Better hashing

Problem

Clojure’s hashing strategy for numbers, sequences/vectors, sets, and maps mimics Java’s. In Clojure, however, it is far more common than in Java to use longs, vectors, sets, maps and compound objects comprised of those components (e.g., a map from vectors of longs to sets) as keys in other hash maps. It appears that Java’s hash strategy is not well-tuned for this kind of usage. Clojure’s hashing for longs, vectors, sets, and maps each suffer from some weaknesses that can multiply together to create a crippling number of collisions.

For example, Paul Butcher wrote a simple Clojure program that produces a set of solutions to a chess problem. Each solution in the set was itself a set of vectors of the form [piece-keyword [row-int col-int]]. Clojure 1.5.1's current hash function hashed about 20 million different solutions to about 20 thousand different hash values, for an average of about 1000 solutions per unique hash value. This causes PersistentHashSet and PersistentHashMap to use long linear searches for testing set/map membership or adding new elements/keys. There is nothing intentionally pathological about these values – they simply happened to expose this behavior in a dramatic way. Others have come across similarly bad performance without any obvious reason why, but some of those cases are likely due to this same root cause.

Proposed solutions

Mark Engelberg's document about Clojure's hash function, its behavior, and potential improvements, is here:

https://docs.google.com/document/d/10olJzjNHXn1Si1LsSvjNqF_X1O7TPZLuv3Uo_duY5o

A summary of his proposed hash function modifications is:

Change the hash of integers that fit within a long to the return value of longHashMunge (see Longs section of doc for more details)
Change the current multiplier of 31 used for vectors, sequences, and queues to a different constant such as -1640531535 or 524287 (see Vectors section). Also applies to Cons, LazySeq.
For sets, instead of adding together the hash value of the elements, add together the return value of a function xorShift32 called on the hash value of each element (see Sets section)
For maps and records, instead of adding together hash(key) ^ hash(val) for each hash,val pair, instead add together hash(key)^xorShift32(hash(val)) (see Maps section)
No change for any other types, e.g. strings, keywords, symbols, floats, doubles, BigInt's outside of long range, BigDecimal, etc.

Below is a link to a modified version of Paul Butcher's N-queens solver, with extra code for printing stats with several different hash functions. The README has instructions for retrieving and installing locally a version of Clojure modified with one of Mark's proposed alternate hash functions. After that is a link to a patch that implements the proposal above.

https://github.com/jafingerhut/chess-clojure

http://dev.clojure.org/download/attachments/8192765/better-hashing-2013-11-18.diff

Here is a summary of results for some program elapsed times and how spread out the hash values are. Below the table are a few details on how these measurements were made. All times are elapsed times. "meas" means raw measurements of elapsed time, listed in ascending order so you can easily see min and max. "avg" is the average of multiple elapsed times.

Problem

Using Clojure 1.6.0-alpha3 hash (same as Clojure 1.5.1)

Clojure 1.6.0 just after this commit on Jan 29 2014

Clojure 1.6.0 just after this commit on Jan 30 2014

Using Mark Engelberg's 2013-11-18 proposed hash

Using Murmur3 hash

Paul Butcher's N-queens problem with 6x6 board

180,568 solutions hash to N distinct hash values

average of A solutions per hash value (max M)

6.7 min (~33 times slower than with 2013-11-18 proposed hash)

N=301

A=599.9 (M=2,492)

12.8 sec

N=180,566

A=1.0 (M=2)

12.5 sec

N=180,561

A=1.0 (M=2)

12.2 sec

N=180,566

A=1.0 (M=2)

11.8 sec

N=180,567

A=1.0 (M=2)

with 6x9 board

20,136,752 solutions hash to N distinct hash values

average of A solutions per hash value (max M)

> 8 hours (did not wait for it to finish)

N=17,936

A=1,122.7 (M=81,610)

12.8 min

N=20,089,842

A=1.0 (M=3)

11.9 min

N=20,089,436

A=1.0 (M=3)

11.4 min

N=20,089,766

A=1.0 (M=4)

12.0 min

N=20,089,390

A=1.00 (M=3)

Hash all ints in range -1,000,000 to 1,000,000 inclusive

2,000,001 values hash to N distinct hash values

average of A integers per hash value (max M)

N=1,000,001

A=2.0 (M=2)

N=1,999,801

A=1.0 (M=2)

N=1,999,801

A=1.0 (M=2)

Hash all 2-element vectors with first element in range [0,999] and second element in same range

1,000,000 values hash to N distinct hash values

average of A values per hash value (max M)

N=31,969

A=31.3 (M=33)

N=999,846

A=1.0 (M=2)

N=999,900

A=1.0 (M=2)

Hash all 2-element vectors with first element in range [-500,499] and second element in same range

1,000,000 values hash to N distinct hash values

average of A values per hash value (max M)

N=15,969

A=62.6 (M=68)

N=999,860

A=1.0 (M=2)

N=999,883

A=1.0 (M=2)

All sets that are subsets of #{0 1 2 3 ... 19}

1,048,576 sets hash to N distinct hash values

average of A values per set (max M)

N=191

A=5,489.9 (M=16,440)

N=1,048,451

A=1.0 (M=2)

N=1,048,208

A=1.0 (M=2)

All maps {i j} where i is in the range [0,999] and j is in the same range

1,000,000 maps hash to N distinct hash values

average of A values per map (max M)

N=1,024

A=976.6 (M=1,000)

N=999,728

A=1.0 (M=2)

N=999,900

A=1.0 (M=2)

Compile Clojure 1.6.0-alpha3 source with "time ant jar", so no tests run

avg 20.1 sec

meas: 19.6, 19.8, 19.8, 20.2, 20.9

avg 20.6 sec (2.5% longer)

meas: 19.8, 20.4, 20.7, 20.8, 21.3

avg 20.6 sec (2.5% longer)

meas: 20.3, 20.5, 20.5, 20.8, 20.9

avg 20.0 sec

meas: 19.6, 19.7, 19.9, 20.2, 20.7

avg 21.4 sec (6.5% longer)

meas: 21.3, 21.4, 21.4, 21.4 21.6

Compile Clojure 1.6.0-alpha3 source with "time ant", which includes running tests, but with generative test duration reduced to 1.0 sec

avg 48.0 sec

meas: 47.3, 47.4, 47.6, 48.3, 49.5

120,353 unique values hash to 113,405 distinct hash values

average of 1.06 values per hash value

avg 47.7 sec

meas: 47.2, 47.4, 47.6, 47.9, 48.2

avg 47.3 sec

meas: 46.8, 47.2, 47.4, 47.5, 47.8

avg 48.0 sec

meas: 47.1, 47.6, 47.7, 48.2, 49.5

119,811 unique values hash to 114,329 distinct hash values

average of 1.05 values per hash value

avg 48.7 sec (1.5% longer)

meas: 47.8, 48.5, 48.9 49.1 49.3

Calc hashes of all integers in (range 1000000000) and return sum, using:

(time (hash-range-n 1000000000))

See here for definition of hash-range-n

avg 29.8 sec

meas: 29.6, 29.6, 29.9, 30.0, 30.1

avg 47.9 sec (60.7% longer)

meas: 47.9, 47.9, 47.9, 47.9, 48.0

avg 44.3 sec (48.7% longer)

meas: 43.2, 43.3, 45.0, 45.0, 45.0

avg 38.7 sec (30% longer)

meas: 38.6, 38.6, 38.7, 38.7, 38.7

Verified that hash values of first 500 million integers (those in (range 500000000)) are all different.

avg 30.2 sec (1.3% longer)

meas: 29.9, 30.2, 30.2, 30.2, 30.2

Calc hashes of 30,001 vectors [] [0] [0 1], etc. up to [0 1 ... 29999] and return sum, using

(let [cs (doall (reductions conj [] (range 30000)))]

(time (total-hash cs)))

See here for definition of total-hash

avg 10.7 sec

meas: 10.6, 10.6, 10.7, 10.7, 10.7

30,000 unique hash values, avg 1.00, max 2

avg 16.0 sec (49.5% longer)

meas: 15.9, 16.0, 16.1, 16.1, 16.1

All 30,001 hash values unique

avg 14.3 sec (33.6% longer)

meas: 14.1, 14.3, 14.3, 14.3, 14.3

All 30,001 hash values unique

avg 10.8 sec (pretty much same)

meas: 10.5, 10.7, 10.7, 10.7, 11.6

30,000 unique hash values, avg 1.00, max 2

avg 11.4 sec (6.5% longer)

meas: 11.3, 11.3, 11.4, 11.4, 11.4

All 30,001 hash values unique

Calc hashes of 30,001 sets #{} #{0} #{0 1}, etc. up to #{0 1 ... 29999} and return sum, using

(let [cs (doall (reductions conj #{} (range 30000)))]

(time (total-hash cs)))

avg 70.2 sec

meas: 66.4, 69.1, 69.4, 71.3, 74.8

30,000 unique hash values, avg 1.00, max 2

avg 86.3 sec (22.9% longer)

meas: 83.3, 84.1, 86.3, 88.2, 89.5

All 30,001 hash values unique

avg 81.8 sec (16.5% longer)

meas: 78.8, 79.1, 80.1, 81.2, 89.6

All 30,001 hash values unique

avg 71.4 sec (1.7% longer)

meas: 70.9, 71.0, 71.2, 72.0, 72.1

29,308 unique hash values, avg 1.02, max 2

avg 71.6 sec (2.0% longer)

meas: 67.9, 72.0, 72.5, 72.7, 73.1

All 30,001 hash values unique

Calc hashes of 30,001 maps {} {0 1} {0 1, 1 2} ... up to {0 1, 1 2, ..., 29999 30000} and return sum, using

(let [cs (doall (reductions (fn [m i] (assoc m i (inc i))) {} (range 30000)))]

(time (total-hash cs)))

avg 71.7 sec

meas: 68.4, 69.0, 69.1, 75.7, 76.3

All 30,001 hash values unique

avg 163.4 (127.9% longer)

meas: 159.2, 159.2, 164.3, 166.5, 167.5

All 30,001 hash values unique

avg 124.6 sec (73.8% longer)

meas: 115.7, 125.5, 126.6, 127.4, 127.8

All 30,001 hash values unique

avg 78.5 sec (9.5% longer)

meas: 74.1, 75.0, 77.0, 81.2, 85.4

All 30,001 hash values unique

avg 89.3 sec (24.5% longer)

meas: 83.9, 87.8, 89.8, 91.2, 93.7

All 30,001 hash values unique

Notes on measurements:

Only the N-queens measurements used Leiningen. The compilation measurements used the ant commands shown. The rest were measured using the expressions shown after starting a JVM with the command line:

    java -cp clojure.jar clojure.main

with the version of clojure.jar given in the column heading. 5 measurements were taken for each. The raw measurements are given, and the average. Each individual run is intended to be long enough to avoid any concerns about misleading measurements from microbenchmarks.

Benchmark versions: hardware is MacPro 2,1 with 2 3GHz quad-core Intel Xeons, 32 GB RAM. OS is Mac OS X 10.6.8. JVM is Apple/Oracle-supplied version 1.6.0_65, 64-bit Server VM.

Open questions

Possible small changes to the proposal for consideration:

Add in some magic constant to the hash of all sets and maps (a different constant for sets than for maps), so that the empty set and empty map have non-0 hash values, and thus nested data structures like #{{}} and {#{} #{}} will hash to non-0 values that are different from each other.
Consider doing something besides the current hash(numerator) ^ hash(denominator) for Ratio values. Perhaps hash(numerator) %2B xorShift32(hash(denominator)) ? That would avoid the hash of 3/7 and 7/3 being the same, and also avoid the current behavior that hash(3/7) "undoes" the longHashMunge proposed for both the numerator and denominator long values.

Tradeoffs and alternatives

These are discussed throughout Mark's document. A few of these are called out below

Nearly all of the proposals involve additional operations on ints or longs. This is expected to require little additional elapsed time in most hash computations, given that the most significant cost in hashing collections is usually traversing the parts of the data structure, especially if that involves cache misses. Measurements are given above for one proposed set of hash function changes.
Murmur3 is widely used, but does not lend itself well to incremental updates of hash calculations of collections.

A follow-up writeup

I (Alex Miller) have spent a bunch of time looking at this. It probably bears some discussion before updating this page so I will simply append this new section for now.

https://docs.google.com/a/thinkrelevance.com/document/d/1DT2uXlAwH5NstgYSeqbOXb_8K6Gwnw99QksCaUKpCjU/edit

For me, it was a key thing to restate the problem as: "Slightly different collections frequently produce identical hash codes". The primary use where this is an issue is when using sets that contain collections of similar values, especially numbers.

A summary of my conclusions:

DO NOT change long hashing. Not worth the effort or differences from Java.
DO change the vector hashcode algorithm either by changing the multiplier constant, rehashing the items, or both. (Currently I lean towards only rehashing the items.)
DO change the set hashcode algorithm to rehash the items before summing.
DO change the map hashcode algorithm to to remove the symmetric inclusion of keys and values. My preferred change would be to treat K/V as a 2-element vector and re-use whatever algorithm we use for that.
DO NOT use xorShift32 for rehash - the reliance on xor can easily cause collisions when summing hashcodes in sets.
DO use Murmur3 as rehash algorithm - it's fast, extremely well-tested, commonly used, has great avalanche and bit dispersion properties. Note, this is not for hashing the entire collection, but for rehashing individual element hash codes.
DO NOT change the hashCode() of collection functions; change only the hasheq() algorithms.
CONSIDER rehashing keys in PersistentHashMap.hash() for storage and retrieval - this would yield better hash keys for all users of persistent hash maps and sets, not just for numbers, but also for keywords, strings, symbols, and whatever else you put in them. This could yield shallower trees and better average performance on hash lookup for large trees. Proving this would likely require instrumentation inside PHM and/or developing some more intensive performance studies.

References

Some useful references:

Xorshift RNGs - source for the "munge" algorithm
Knuth, section 6.4 - popular multiplier for multiplicative hashing of sequences
MurmurHash3
Which hashing algorithm is better for uniqueness and speed
Scala's MurmurHash3 implementation (used for collection hashing) and more info on how to use it.
Guava has a good hashing impl of Murmur3 and other hash functions. And explanation of hashing.
Collection of integer hash functions
xxhash (native code on most platforms) but it also has perf and quality tests for many hashes (I think these were tested in streaming though)
fast-hash
SMHasher - hash testing
Good overview of how to compare hashes for quality and list of results

Tracking updates to other libraries for Clojure 1.6 hash changes

Clojure 1.6.0-beta1 is out as of this writing, and the hash method has been decided upon. Now comes updating Clojure contrib libraries and 3rd party Clojure libraries that implement their own customized collections so that their hash function values for custom sets and maps (at least those intended to be clojure.core/= to built-in Clojure sets, vectors, or maps) have consistent hash values when using Clojure 1.6.

The table below gives some places that could use updating, and status of changes.

Notes: In Clojure 1.5.1 and 1.4.0 (and maybe also 1.3.0?), (.hashCode x) and (hash x) were calculated similarly, and had the same value for many values of x, even many sets, maps, and vectors. However, they were not always the same, because (.hashCode x) uses .hashCode to hash each set/vector element or map key-value pair, whereas (hash x) uses hash to do so, and .hashCode and hash are intentionally different for Integers, Shorts, Bytes, and some other types.

In Clojure 1.6.0, (.hashCode x) should remain consistent with Java .hashCode for all sets, maps, and vectors, whereas (hash x) will almost always be different than (.hashCode x). Unit tests should include special cases comparing hash of your custom collection against Clojure's corresponding built-in collection for the empty set, vector, and/or map, because that is no longer the simple value of 0/1 that it was before. Also test the result of (empty my-custom-non-empty-collection), too. Consider using the collection-check library.

Project/Library	What needs updating?	Status
Tests in Clojure itself	Some tests were disabled when the hash changes were first made, with this commit.	DONE (except one test still disabled): Most of them were updated to be independent of Clojure's hash function with this commit, but the following line at or near line 311 in file test/clojure/test_clojure/java_interop.clj is still commented out: ;;(test-to-passed-array-for hash-set)
core.cache	Some tests failed due to hash-dependent sequence orders	DONE: Patch for ticket CCACHE-33 committed
core.match	Some tests failed due to hash-dependent sequence orders	DONE: Fix for ticket MATCH-94 committed
core.matrix	Mike Anderson says it should be ready for Clojure 1.6 with no updates needed.	DONE: No changes needed.
core.rrb-vector	Hash calculation needs updating	DONE: Fix for ticket CRRBV-3 committed.
data.avl	Hash calculation needs updating	DONE: Fix for ticket DAVL-3 committed in commit 1, 2, and 3.
data.finger-tree	Hash calculation for finger trees needs updating for 1.6	DONE: Fix for ticket DFINGER-2 committed.
immutable-bitset	Hash calculation for bitsets needs updating for 1.6	DONE: Fixed with this commit. Note that this data structure is a special case, in that it can only contain Long values, thus in Clojure 1.5.1 and earlier it was guaranteed that (.hashCode immutable-bitset) and (hash immutable-bitset) were equal for all immutable-bitsets. This is not true in general for other collections, because (.hashCode x) and (hash x) are in general different for Integers, Shorts, Bytes, etc.
instaparse	Mark Engelberg says this relies on incremental hashing, and will need some rethinking on how to update.	Some changes have been made on branch "1.6" in instaparse's Github repository and tested for correctness and performance. Progress reports here, here.
ordered	Namespaces flatland.ordered.map and flatland.ordered.set appear to need updates
potemkin	A map implementation needs updating	DONE: Fixed with this commit.
tools.namespace	Some tests failed due to hash-dependent sequence orders	DONE: Ticket TNS-16 created, fix comitted.
tools.nrepl	Some tests failed due to hash-dependent sequence orders	DONE: Fixed with this commit.
useful	flatland.useful.deftype/defmap probably needs updating. The namespace flatland.useful.map appears not to need any updates, as it uses Clojure maps but does not customize them at a low level that overrides the hash calculation.