Sunday, December 26, 2010

A new look at collisions

Quite some time ago,  i investigated Collisions effects, while working on the first version of MMC.

This was my second investigation, the first one being completed while working on simpler HC (Hash Chain), and which conclusion was : it's not worth increasing the Head Table beyond a certain level, since the improved dispatching ability is hampered by decreasing cache efficiency.

MMC made me have a second look, since its natural behavior is meant to reduce if not eliminate collisions. I ended up with the following diagram, which proved me wrong : MMC was simply decreasing collision effect, but not eliminating it.
I was also shocked by the very linear nature of the collision counter, adamant when using an exponential scale. In effect, it simply proved something simple : increasing hash size by 2 decrease collision by 2, which means that the hash formula is as good as it can be.

Nonetheless, this second study did not changed my conclusion, which was that beyond a certain level, cache effect is more important than reduced collision gains.

This was before working on large search window sizes. Indeed, since my earlier implementations were too slow, working on large search windows was simply out of reach. This has all changed with latest versions of MMC.

Invited by Piotr, i tried MMC with much larger window size, starting with 16MB.
The algorithm was slow as hell, but something struck me : the collision rate was extremely high, sometimes beyond 80% of all comparisons loop. This is way out of control, and can explain alone the bad performances.

Hence, a third look at this phenomena.

A "collision" happen when a hashed sequence, of length MINMATCH, get the same hash value as another different sequence. If the hash formula is relatively well distributed, it should happen once in every "hash-size" position. In fact, this is a worst case situation, since compressible data, by definition, have a non-random distribution pattern, meaning that many sequences are either extremely rare if not impossible.
On the other hand, worst case scenario do sometimes happen, typically when compressing an already compressed file, or part of file.

With collision representing sometimes up to 80% of comparisons, there was definitely quite a lot of effort wasted at this stage.

This is the result of keeping the hash table identical on all window size, in an effort to keep the numbers "comparable" (thus avoiding to attribute the merit of a better MMC method while it would in fact come from a larger hash distribution).

This however made me forget that Hash size is also part of the performance equation.
A quick solution to this problem is simply to enlarge the initial hash table, in charge of dispatching the sequences. With a good hash formula, enlarging hash by 2 should reduce collisions by 2. And it works as well as it should.

Hence, i came back to 4MB dictionary to make some tests and measure effects. Here are the results : the first figure is the % of comparisons which are collisions, and the second one is the speed.

Window Search 4M :

FileCalgaryFirefoxEnwik8OpenOffice
Hash 128K13% - 8.1 MB/s60% - 2.7 MB/s6% - 3.0 MB/s66% - 2.8 MB/s
Hash 256K7% - 8.0 MB/s42% - 3.8 MB/s3% - 2.9 MB/s49% - 4.4 MB/s
Hash 512K4% - 8.0 MB/s27% - 4.9 MB/s2% - 2.9 MB/s33% - 6.2 MB/s
Hash 1M2% - 8.0 MB/s16% - 5.8 MB/s1% - 2.9 MB/s19% - 7.6 MB/s
Hash 2M1% - 8.0 MB/s9% - 6.3 MB/s.5% - 2.9 MB/s11% - 8.5 MB/s
Hash 4M.5% - 7.9 MB/s5% - 6.7 MB/s.2% - 2.9 MB/s6% - 9.2 MB/s
Hash 8M.3% - 7.7 MB/s2% - 6.7 MB/s.1% - 2.9 MB/s3% - 9.3 MB/s
Now this is pretty clear : this "fixed size" hash table was holding back MMC performance at larger search size. With an increased budget, it makes the final speed now relatively acceptable.

But there is more to it : as can be expected, files which were not advertising much collisions (Calgary, Enwik8) do not gain anything, and even lose some performance, most probably due to cache effects.

On the other hand, some files are suffering heavily from collisions (Firefox, OpenOffice), and as a consequence, increasing the hash size results in major speed gains, with OpenOffice showing impressive results.

This is in stark contrast with my earlier statement, which was probably biased due to :
1) using small search window, resulting in larger hash sizes having a notable bad performance impact
2) using Enwik8 and Calgary as my main test files. I though Firefox was an exception, but maybe it was the other way round.

Obviously, different files feature different distribution.

Which lead us to the question : which "Head" size is the best ?
I started with a fixed size of 17 bits, but now it seems a better rule should be to correlate this size with Window Size.
This, in effect, means that "Head Size" is no longer a "small additional table", and that it's budget is now correlated with dictionary size.

As far as i remember, 7zip uses 2x Dictionary size for Head table. Since each head element is 4 Bytes (32 bits pointer or index),  it means the number of elements is twice smaller as Dictionary Size (ex : 2M for a 4M window size).

Judging by the result in the above table, this looks just right.

No comments:

Post a Comment