Xxhash: Add Meow hash in the benchmarks

Created on 21 Oct 2018  ·  8Comments  ·  Source: Cyan4973/xxHash

Most helpful comment

meow_hash entirely depends on the presence of hardware-accelerated AES instructions.

I suspect it won't even compile in absence of these instructions, as the code is pretty short, and I don't see any software backup code path. Besides, all AES instructions I could read are using direct Intel intrinsic, so it's not portable across architecture, irrespective of hardware capabilities.

It's not necessarily a "bad" thing : after all, they get great speed for long input in return.

But the main point is : with reliance on such specialized hardware instruction set, portability is out of the feature list.

As a side-effect, it would not even run on the platform used for xxHash benchmarks.

Maybe I should spend some time documenting this.

All 8 comments

meow_hash entirely depends on the presence of hardware-accelerated AES instructions.

I suspect it won't even compile in absence of these instructions, as the code is pretty short, and I don't see any software backup code path. Besides, all AES instructions I could read are using direct Intel intrinsic, so it's not portable across architecture, irrespective of hardware capabilities.

It's not necessarily a "bad" thing : after all, they get great speed for long input in return.

But the main point is : with reliance on such specialized hardware instruction set, portability is out of the feature list.

As a side-effect, it would not even run on the platform used for xxHash benchmarks.

Maybe I should spend some time documenting this.

This is no longer true you should revisit this.

Yeah, reading the announcements, meow has improved a lot in a few revisions.
It even announces compatibility with ARM.

I'll need some time to get a deeper look into it.

I just made a cursory look, and couldn't find a software backup path. Maybe it's there and I just missed it.
Also, I don't know yet if different AES hardware modules are guaranteed to produce exactly the same hash value, or if there is more to it (like additional parameters that may be preset differently and require machine-specific instructions to be uniform). It's one thing to compute and consume hash locally, in which case exact representation does not matter. It's another one to write/serialize the value and expect it to be read/reproduced exactly on any other system. That part of interoperability is a bit more tricky.
But once again, maybe it's there, and I just need some time to look into it.

For the reference, this is against my latest version, 2.0 GHz Core i7 Gen 2 (Sandy Bridge)

Clang 7.0.1
Flags: -O3 -march=native

./xxhsum 0.6.5 (64-bits little endian), by Yann Collet
Sample of 100 KB...
XXH32               :     102400 ->    54765 it/s ( 5348.2 MB/s)
XXH32 unaligned     :     102400 ->    53782 it/s ( 5252.1 MB/s)
XXH64               :     102400 ->   104882 it/s (10242.4 MB/s)
XXH64 unaligned     :     102400 ->   105935 it/s (10345.2 MB/s)
XXH32a              :     102400 ->    78027 it/s ( 7619.8 MB/s)
XXH32a unaligned    :     102400 ->    75624 it/s ( 7385.2 MB/s)
XXH64a              :     102400 ->    77204 it/s ( 7539.5 MB/s)
XXH64a unaligned    :     102400 ->    76209 it/s ( 7442.3 MB/s)
XXH auto            :     102400 ->   105179 it/s (10271.4 MB/s)
XXH auto unaligned  :     102400 ->   101546 it/s ( 9916.6 MB/s)
XXH32 auto          :     102400 ->   107646 it/s (10512.3 MB/s)
XXH32 auto unaligne :     102400 ->   104364 it/s (10191.8 MB/s)
XXH64 auto          :     102400 ->   107443 it/s (10492.4 MB/s)
XXH64 auto unaligne :     102400 ->   105843 it/s (10336.3 MB/s)
meow auto           :     102400 ->   214435 it/s (20940.9 MB/s)
meow auto unaligned :     102400 ->   213289 it/s (20829.0 MB/s)

With -march=penryn -mno-sse4.2 -mno-avx and the C implementation:

./xxhsum 0.6.5 (64-bits little endian), by Yann Collet
Sample of 100 KB...
XXH32               :     102400 ->    54106 it/s ( 5283.8 MB/s)
XXH32 unaligned     :     102400 ->    53303 it/s ( 5205.4 MB/s)
XXH64               :     102400 ->   107245 it/s (10473.1 MB/s)
XXH64 unaligned     :     102400 ->   107774 it/s (10524.8 MB/s)
XXH32a              :     102400 ->    78394 it/s ( 7655.6 MB/s)
XXH32a unaligned    :     102400 ->    79666 it/s ( 7779.9 MB/s)
XXH64a              :     102400 ->    78907 it/s ( 7705.7 MB/s)
XXH64a unaligned    :     102400 ->    78179 it/s ( 7634.6 MB/s)
XXH auto            :     102400 ->   108878 it/s (10632.6 MB/s)
XXH auto unaligned  :     102400 ->   105124 it/s (10266.0 MB/s)
XXH32 auto          :     102400 ->   105819 it/s (10333.9 MB/s)
XXH32 auto unaligne :     102400 ->   103970 it/s (10153.3 MB/s)
XXH64 auto          :     102400 ->   110021 it/s (10744.2 MB/s)
XXH64 auto unaligne :     102400 ->   107109 it/s (10459.9 MB/s)
meow auto           :     102400 ->    15962 it/s ( 1558.8 MB/s)
meow auto unaligned :     102400 ->    16022 it/s ( 1564.6 MB/s)

This is no longer true you should revisit this

...aaand in Meow 0.5 it's true again; in the latest version they again only have x64 AES code path :)

So yeah, Meow Hash is kinda specific to x86_64 and Nehalem+, or it has to switch to a slow af fallback version.

Meanwhile, XXH3 has vectorized code paths for all x86_64 (including Core 2 and friends), Pentium 4+ (which is required for Windows 7+), ARMv7-A w/NEON (available on most Androids and all iOS devices except the original iPhone), ARM64 (all recent iPhones and Androids), VSX POWER9 (many servers and supercomputers have these, but it is a niche market), and if that wasn't enough, even the scalar version is still very fast even on 32-bit targets, given they have a decent multiplier.

I'm starting to gather some more benchmark results, as there were several requests of this kind over the years.
Starting with this one, which has been a recurrent request.

meowHash is very fast indeed, scoring approximately same speed as XXH3 for large data (~100 KB), hence a bit faster than XXH128.
https://github.com/Cyan4973/xxHash/wiki/Performance-comparison

That being said, meowHash is really designed for large data. When it comes to _small_ data, performance is not comparable, as the algorithm requires a large fixed cost, which is poorly amortized on small data.

A worrying development is that there is a discrepancy in "how to represent and interpret results".
For large data, it's rather simple, a sorted table can be enough, giving an instant idea of the ranking.
For small data though, it seems more complex.
So far, I've used graphs, providing a speed result per different length and scenario. By virtue of featuring _more_ information, it's necessarily more difficult to interpret.

As a consequence, it's unclear if a reader, after finding the first performance figure in the prominent top table, will bother looking below for additional graphs, thus guiding its choice while potentially missing an important information for its use case.

Another issue is about the graphs themselves. A sorted table can be extended, relatively easily. There are limits, sure, but generally speaking, adding a contender adds just a line.
However, in a graph, each contender is represented by a line. Lines tend to overlap. There are only so many colors to pick from etc. Very quickly, if the nb of represented contenders is large, it's just a big mess.
So, if small data performance needs graph, it severely limits the nb of contenders that can be represented.

Adding a paragraph containing "raw" benchmark data, while commendable, is a poor substitute.

Therefore, I was considering "summarizing" the small data test outcome, and create a "performance number" that can be used in a ranked table, to give at least a hint of general performance on small data. Interested readers would still have to look down for more accurate information (graphs), but at least, the notion that some algorithms are more suitable for small data than others can be conveyed in a relatively simple manner.

There is still the issue that graphs can only represent a limited nb of contenders, and maybe the one that a reader wants to observer isn't part of the selected list (even when raw data is available).

I was wondering if there would be a better way to represent graphs. Something more dynamic than a prepared screenshot, allowing a user to select which contenders should be visible.

meow hash added in comparison :
https://github.com/Cyan4973/xxHash/wiki/Performance-comparison

Adding an entry to the table is fine.
But I still need a better way to draw graphs for a large number of candidates.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

jtoivainen picture jtoivainen  ·  4Comments

easyaspi314 picture easyaspi314  ·  6Comments

gitmko0 picture gitmko0  ·  7Comments

vinniefalco picture vinniefalco  ·  4Comments

vp1981 picture vp1981  ·  7Comments