Beautiful Code [228]
Second, real-life benchmarking is critical; only that way can you discover what's really worth doing.
18.1. Inside the Dictionary
Dictionaries are represented by a C structure, PyDictObject, defined in Include/dictobject.h. Here's a schematic of the structure representing a small dictionary mapping "aa", "bb", "cc", …, "mm" to the integers 1 to 13:
Code View: Scroll / Show All
int ma_fill 13
int ma_used 13
int ma_mask 31
PyDictEntry ma_table[]:
[0]: aa, 1 hash(aa) == -1549758592, -1549758592 & 31 = 0
[1]: ii, 9 hash(ii) == -1500461680, -1500461680 & 31 = 16
[2]: null, null
[3]: null, null
[4]: null, null
[5]: jj, 10 hash(jj) == 653184214, 653184214 & 31 = 22
[6]: bb, 2 hash(bb) == 603887302, 603887302 & 31 = 6
[7]: null, null
[8]: cc, 3 hash(cc) == -1537434360, -1537434360 & 31 = 8
[9]: null, null
[10]: dd, 4 hash(dd) == 616211530, 616211530 & 31 = 10
[11]: null, null
[12]: null, null
[13]: null, null
[14]: null, null
[15]: null, null
[16]: gg, 7 hash(gg) == -1512785904, -1512785904 & 31 = 16
[17]: ee, 5 hash(ee) == -1525110136, -1525110136 & 31 = 8
[18]: hh, 8 hash(hh) == 640859986, 640859986 & 31 = 18
[19]: null, null
[20]: null, null
[21]: kk, 11 hash(kk) == -1488137240, -1488137240 & 31 = 8
[22]: ff, 6 hash(ff) == 628535766, 628535766 & 31 = 22
[23]: null, null
[24]: null, null
[25]: null, null
[26]: null, null
[27]: null, null
[28]: null, null
[29]: ll, 12 hash(ll) == 665508394, 665508394 & 31 = 10
[30]: mm, 13 hash(mm) == -1475813016, -1475813016 & 31 = 8
[31]: null, null
The ma_prefix in the field names comes from the word mapping, Python's term for data types that provide key/value lookups. The fields in the structure are:
ma_used
Number of slots occupied by keys (in this case, 13).
ma_fill
Number of slots occupied by keys or by dummy entries (also 13).
ma_mask
Bitmask representing the size of the hash table. The hash table contains ma_mask+1 slots—in this case, 32. The number of slots in the table is always a power of 2, so this value is always of the form 2n–1 for some n, and therefore consists of n set bits.
ma_table
Pointer to an array of PyDictEntry structures. PyDictEntry contains pointers to:
The key object
The value object
A cached copy of the key's hash code
The hash value is cached for the sake of speed. When searching for a key, the exact hash values can be quickly compared before performing a slower, full equality comparison of the keys. Resizing a dictionary also requires the hash value for each key, so caching the value saves having to rehash all the keys when resizing.
We don't keep track directly of the number of slots in the table, but derive it instead as needed from ma_mask. When looking up the entry for a key, slot = hash & mask is used to figure out the initial slot for a particular hash value. For instance, the hash function for the first entry generated a hash of –1549758592, and –1549758592 mod 31 is 0, so the entry is stored in slot 0.
Because the mask is needed so often, we store it instead of the number of slots. It's easy to calculate the number of slots by adding 1, and we never need to do so in the most speed-critical sections of code.
ma_fill and ma_used are updated as objects are added and deleted. ma_used is the number of keys present in the dictionary; adding a new key increases it by 1, and deleting a key decreases it by 1. To delete a key, we make the appropriate slot point to a dummy key; ma_fill therefore remains the same when a key is deleted, but may increase by 1 when a new key is added. (ma_fill is never decremented, but will be given a new value when a dictionary is resized.)
Python's Dictionary Implementation: Being All Things to All People > Special Accommodations
18.2.