Online Book Reader

Home Category

Mastering Algorithms With C - Kyle Loudon [66]

By Root 1583 0
will probably contain no more than one element. Of course, since uniform hashing is only approximated, in actuality we end up encountering somewhat more or less than what the load factor suggests. How close we come to uniform hashing ultimately depends on how well we select our hash function.

Selecting a Hash Function


The goal of a good hash function is to approximate uniform hashing, that is, to spread elements about a hash table in as uniform and random a manner as possible. A hash function h is a function we define to map a key k to some position x in a hash table. x is called the hash coding of k. Formally stated:

h(k) = x

Generally, most hashing methods assume k to be an integer so that it may be easily altered mathematically to make h distribute elements throughout the table more uniformly. When k is not an integer, we can usually coerce it into one without much difficulty.

Precisely how to coerce a set of keys depends a great deal on the characteristics of the keys themselves. Therefore, it is important to gain as much of a qualitative understanding of them in a particular application as we can. For example, if we were to hash the identifiers found in a program, we might observe that many have similar prefixes and suffixes since developers tend to gravitate toward variables such as sampleptr, simpleptr, and sentryptr. A poor way to coerce these keys would be any method depending strictly on characters at the beginning and end of the keys, since this would result in many of the same integers for k. On the other hand, we might try selecting characters from four positions that have the propensity to be somewhat random, permute them in a way that randomizes them further, and stuff them into specific bytes of a four-byte integer. Whatever approach we choose for coercing keys, the most important thing to remember, again, is that a hash function should distribute a set of keys about a hash table in a uniform and random manner.

Division method


Once we have a key k represented as an integer, one of the simplest hashing methods is to map it into one of m positions in a table by taking the remainder of k divided by m. This is called the division method. Formally stated:

h(k) = k mod m

Using this method, if the table has m = 1699 positions, and we hash the key k = 25,657, the hash coding is 25,657 mod 1699 = 172. Typically, we should avoid values for m that are powers of 2. This is because if m = 2 p, h becomes just the p lowest-order bits of k. Usually we choose m to be a prime number not too close to a power of 2, while considering storage constraints and load factor.

For example, if we expect to insert around n = 4500 elements into a chained hash table, we might choose m = 1699, a good prime number between 210 and 211. This results in a load factor of α = 4500/1699 ≈ 2.6, which indicates that generally two or three elements will reside in each bucket, assuming uniform hashing.

Multiplication method


An alternative to the division method is to multiply the integer key k by a constant A in the range < A < 1; extract the fractional part; multiply this value by the number of positions in the table, m; and take the floor of the result. Typically, A is chosen to be 0.618, which is the square root of 5, minus 1, all divided by 2. This method is called the multiplication method. Formally stated:

An advantage to this method is that m, the number of positions in the table, is not as critical as in the division method. For example, if the table contains m = 2000 positions, and we hash the key k = 6341, the hash coding is └(2000)((6341)(0.618) mod 1)┘ = └(2000)(3918.738 mod 1)┘ = └(2000)(0.738)┘ = 1476.

In a chained hash table, if we expect to insert no more than n = 4500 elements, we might let m = 2250. This results in a load factor of α = 4500/2250 = 2, which indicates that no more than two traversals should be required to locate an element in any bucket, assuming uniform hashing. Again, notice how this method of hashing allows more flexibility in choosing m to suit the maximum number of traversals acceptable

Return Main Page Previous Page Next Page

®Online Book Reader