| T.R | Title | User | Personal Name
 | Date | Lines | 
|---|
| 114.1 |  | CLT::GILBERT | eager like a child | Sun Nov 02 1986 10:25 | 10 | 
|  |     Check the journals.
    Many DEC compilers use a balanced binary tree for the symbol table,
    since neither performance nor storage are critical aspects.  The
    VMS logical name table (and DCL symbol symbol table?) use hashing
    -- there, performance *is* critical.
    There's also been some work on finding perfect and near-perfect hash
    functions, for a (usually given) small set of identifiers.  Again,
    check the journals.
 | 
| 114.2 | Textbooks; open hashing; bad assumption | TURRIS::AMARTIN | Alan H. Martin | Sun Nov 02 1986 12:32 | 17 | 
|  | Yes, perfect hashing has been a popular topic during the '80s, and the
journals are probably the place to look for information.
Textbooks like Knuth, any subset of {Aho, Hopcroft, Sethi, Ullman},
Horowitz and Sahni, etc., contain more than enough information about
hash tables for the average application.
All the compilers I've worked on have worked on used open hash tables,
which never become full - they only get progressively more cretinous
performance.  I'd say that the deciding factors for using open hashing
were the availability of heap storage for the table entries, and the desire
to avoid having to deal with tables that can become full.
I think it is wrong to assume that all languages have the same characteristic
set of identifiers.
				/AHM
 | 
| 114.3 |  | SMOP::GLOSSOP | Kent Glossop | Sun Nov 02 1986 16:36 | 12 | 
|  |     I attended a talk on randomized hash functions when I was at Michigan.
    The person that worked on it was from IBM (TJW?).  Anyway, they found
    that a general hashing function with randomly chosen constants out-
    performed the "hand-tuned" hash algorithm in the IBM FORTRAN compiler
    that was supposed to do very well for programs that have the "typical"
    FORTRAN identifiers like I, J, K, etc.  I can't give you any references
    offhand, but it might make you think twice about spending a lot of time
    pursuing the "ideal" hashing function...  Also, if you haven't done it
    already, you should use PCA to find your hot spots.  It may be that
    your time would be much better spent in other areas.
    Kent Glossop (VAX PL/I)
 | 
| 114.4 | TPU's solution | CASEE::CLARK | Ward Clark | Mon Nov 10 1986 04:59 | 16 | 
|  | From:  DSSDEV::TANNENBAUM "TPU Developer"
    TPU has similar needs.  The algorithm we're in the process of
    implementing gets around the problems of growth by keeping a linked
    list of symbol tables.  When one fills up, we just add another.
    
    Each symbol table contains a 32 hash buckets.  We choose our bucket by
    XORing the first and last character of the symbol, after stripping
    any known prefixes (like 'TPU$_' and 'EVE$_').  Each symbol entry
    is null padded to a longwork boundry, so we can do longword compares
    (for speed).  The symbol entries are stored in symbol length order, so
    we can drop out early. 
    
    If you want more details, feel free to send mail.
    
    	- Barry
 | 
| 114.5 | My hashing algorithm (copyright) | COMICS::DEMORGAN | Richard De Morgan, UK CSC/CS | Thu Sep 24 1987 11:16 | 28 | 
|  |     I have always used chained hashing, where all the names with the
    same hash code are chained together. For VAX, I advocate the following
    version (it is written in Ada; cshift is a circular (left is positive)
    shift and exor is exclusive OR). It generates a hash code in the
    range 0 .. 255, and has been found to be unbiased (i.e. the
    distribution of hash chain lengths is a Poisson distribution) over
    samples of up to 250,000 names. This function is taken from a
    recently-written Ada program that used a task-driven random number
    generator.
    
        FUNCTION hash(name : IN string, length : IN integer)
    	    RETURN integer IS
    
    	    hash_code : integer := 0;
    
    	BEGIN
    	    FOR character_index IN 1 .. length
    	    LOOP
    		hash_code := exor(cshift(hash_code, 4),
    		    char_to_int(name(character_index)));
    	    END LOOP;
    	    RETURN cshift(hash_code, -2*(length - 1)) MOD 256;
        END hash;
    
    N.B. It has the interesting property that the hash of a single charcter
    identifier is the same as the character value. It is also not stem
    or tail sensitive. We used this in the PEARL compiler (32 bit version).
    For 16 bit version, shift left 2 and right 1 respectively.
 | 
| 114.6 |  | TLE::RMEYERS | Randy Meyers | Thu Sep 24 1987 16:55 | 15 | 
|  | Re: .5
I typically use a very similar algorithm based on circular shifts and
exclusive or.  The differences are that I usually only shift by one
in the inner loop, I don't shift at all outside of the loop, and I
divide by a prime number to get the hash index.
Dividing by a prime number has the effect that the resulting index
depends on every bit in the result of the loop.
Like you, I have noticed the rotate-xor family of hashing algorithms
does produce very even use of all the buckets in the hash table.
Would you care to comment on why you shift by 4 in the inner loop,
and the function of the finial shift before the division?
 | 
| 114.7 | some reasons | COMICS::DEMORGAN | Richard De Morgan, UK CSC/CS | Fri Sep 25 1987 09:53 | 7 | 
|  |     There are a number of reasons, but the main ones are that it spreads
    the bits of all the characters around, taking into account the
    character encoding. The final shift ensures that single character
    identifiers have unique hashes. The algorithm was also designed
    to take into account the possibility of implementation on machines
    with simple instruction sets, possibly no integer divide (the MOD
    should be optimized to an AND).
 |