HashMap Hash Function

September 02, 2019 4 minutes

Assuming that you are already aware about the concept of hashing and hash functions in general, this post tries to look into the details of how the hash function is implemented in java8 HashMap.

Let’s take a quick look at the below mentioned snippet from java8 (1.8.0_221) - HashMap class:

// Computes key.hashCode() and spreads (XORs) higher bits of hash
// to lower.  Because the table uses power-of-two masking, sets of
// hashes that vary only in bits above the current mask will
// always collide. 
static final int hash(Object key) {
    int h;
    return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}

From the above mentioned code, we can see that the hash method:

  1. Returns $0$ as the hashCode for null key(s).
  2. Recomputes the hashCode by performing an XOR operation on the original hashCode returned by the key and its right shifted value, otherwise.

Can you guess what is the need for doing this XOR operation? and what is the significance of Unsigned Right Shift here? Hint - pay attention to the javadoc on the method.

To understand the above mentioned points, we need to look at the way HashMap identifies the index to be used for storing the entries (Node<K,V> in this case):

//from putVal method
if ((p = tab[i = (n - 1) & hash]) == null){
     tab[i] = newNode(hash, key, value, null);
}

To be able to store entries in the table (an array in essence), we should be able to map those to some index in the range $[0, n-1]$, where n is the size of the array.

The main reason that the hashCode from the key is not used as an index as-is, is that it can be a very large number outside this range (think of poorly overridden hashCode method). So we need a way to restrict it to the above mentioned range. The simplest approach that comes to mind is to modulo the hashCode with the size of the table(n). This will ensure that the resultant is always between the range $[0,n-1]$.

As the table size is always a power of $2$, java8 uses bitwise operators to perform the modulo operation. In other words, instead of using $(hash \% n)$, the modulo operation is performed using $(n-1) \& hash$.

x % n == x & (n-1) $\iff$ n is power of 2 and x > 0

So now, as we understand that the hash value returned by hash method will be used to calculate the index on which the key-value pair will map, we come back to our original question that why the hash is calculated the way its mentioned above.

To answer this, let’s take an example of two numbers (or hashcodes) that differ only in their higher order bits and have same lower order bits:

  • $4 : 00000100$
  • $68: 01000100$

Also, lets assume we have the table of size $8(00001000)$.

Now when the modulo operation is performed to identify the index on which these two entries should map, both of these numbers map to the same location:

$$ 4 \% 8 == 4 \& (8-1) = 4 \\
68 \% 8 == 68 \& (8-1) = 4 $$

But why did that happen?

This happened because of the following:

  1. n being a power of 2, only one of its bit will be set. Ex: 8(00001000)
  2. And hence, n-1 will have all of its bits set upto the only set bit in n. Ex: 7(00000111)

This means that, any bit in hashCode which is at a higher order than the highest set bit in n-1 will not participate in the & operation ($hash \& (n-1)$) as all other bits are already un-set in n-1.

Due to this, both 4 and 68 are mapped to same index as the higher order (different) bit from 68(0$1$000100) does not even participate in the modulo calculation using & operation as the highest set bit in 7(00000$1$11) is at a lower order.

This actually becomes the main reason for transforming the given hashCode using h ^ (h >>> 16). It shifts the higher order bits in the hashCode, to the right and spreads the effect to the lower bits using XOR operation so that those actually participate in the index calculation logic and eventually helps in avoiding collisions.

Example

Consider the two numbers and table size as 8:

  • 393219: 1100000000000000011
  • 262147: 1000000000000000011

As we can see, the above mentioned two numbers differ in higher order bits only.

Now $393219 \% 8 == 262147 \% 8 = 3$, i.e. there is a collision as both these map to index 3 (if only the modulo operation is used). But on the other hand, the same calculated using the hash method implementation ((h ^ (h >>> 16)) & (n-1)) maps these to two different indeces:

$$ 393219 \hat{} (393219 \>>> 16) \& 7 = 5 \\
262147 \hat{} (262147 \>>> 16) \& 7 = 7 $$

comments powered by Disqus