Beautiful Code [230]
Python's Dictionary Implementation: Being All Things to All People > Collisions
18.3. Collisions
For any hash table implementation, an important decision is what to do when two keys hash to the same slot. One approach is chaining (see http://en.wikipedia.org/wiki/Hash_table#Chaining): each slot is the head of a linked list containing all the items that hash to that slot. Python doesn't take this approach because creating linked lists would require allocating memory for each list item, and memory allocations are relatively slow operations. Following all the linked-list pointers would also probably reduce cache locality.
The alternative approach is open addressing (see http://en.wikipedia.org/wiki/Hash_table#Open_addressing): if the first slot i that is tried doesn't contain the key, other slots are tried in a fixed pattern. The simplest pattern is called linear probing: if slot i is full, try i+1, i+2, i+3, and so on, wrapping around to slot 0 when the end of the table is reached. Linear probing would be wasteful in Python because many programs use consecutive integers as keys, resulting in blocks of filled slots. Linear probing would frequently scan these blocks, resulting in poor performance. Instead, Python uses a more complicated pattern:
/* Starting slot */
slot = hash;
/* Initial perturbation value */
perturb = hash;
while ( slot = (5*slot) + 1 + perturb; perturb >>= 5; } In the C code, 5*slot is written using bit shifts and addition as (slot<<2) + slot. The perturbation factor perturb starts out as the full hash code; its bits are then progressively shifted downward 5 bits at a time. This shift ensures that every bit in the hash code will affect the probed slot index fairly quickly. Eventually the perturbation factor becomes zero, and the pattern becomes simply slot=(5*slot)+1. This eventually generates every integer between 0 and ma_mask, so the search is guaranteed to eventually find either the key (on a search operation) or an empty slot (on an insert operation). The shift value of 5 bits was chosen by experiment; 5 bits minimized collisions slightly better than 4 or 6 bits, though the difference wasn't significant. Earlier versions of this code used more complicated operations such as multiplication or division, but though these versions had excellent collision statistics, the calculation ran slightly more slowly. (The extensive comments in Objects/dictobject.c discuss the history of this optimization in more detail.) Python's Dictionary Implementation: Being All Things to All People > Resizing 18.4. Resizing The size of a dictionary's hash table needs to be adjusted as keys are added. The code aims to keep the table two-thirds full; if a dictionary is holding n keys, the table must have at least n/ (2/3) slots. This ratio is a trade-off: filling the table more densely results in more collisions when searching for a key, but uses less memory and therefore fits into cache better. Experiments have been tried where the 2/3 ratio is adjusted depending on the size of the dictionary, but they've shown poor results; every insert operation has to check whether the dictionary needed to be resized, and the complexity that the check adds to the insert operation slows things down. 18.4.1. Determining the New Table Size When a dictionary needs to be resized, how should the new size be determined? For small-