Xxhash: Combining hashes of strings

Created on 7 Oct 2017  ·  6Comments  ·  Source: Cyan4973/xxHash

Using 64 bit PowerPC and I'd like 32 bit hash result over a sequence of non-contiguous strings.

Is there any loss in hash quality if I am hashing a sequence of non contiguous strings using XXH64 and simply passing the result of each hash as the seed of the next XXH64 call? Also, I would only be taking the lower 32 bits of the final result as my final single 32 bit hash value representing the sequence of strings.

Subsequent hashes that are expected to be equal will be done against the exact same sequence of strings. In other words, I have no need for the final hash of this sequence "STRING1" , "STRING2" to be the same as the final hash of "STRIN", "G1STRING2"

My current code uses CRC32 and does the above (passing the intermediate result into the next string's as a seed)

Thanks.

question

Most helpful comment

Never mind, I see now that XXH64 does allow a 64 bit seed value.

All 6 comments

Is there any loss in hash quality if I am hashing a sequence of non contiguous strings using XXH64 and simply passing the result of each hash as the seed of the next XXH64 call?

There's a huge loss.

Please take a look at inside XXH64_state_t, XXH64_update_endian and XXH64_digest_endian.
You can see XXH64 uses (writes and reads) XXH64_state_t's member variables v1, v2, v3, v4, total_len to make hash value. And type of these values are unsigned long long (uint_64, 64 bit unsigned integer).

We can say:
(1) XXH64_state's total number of bit is larger than 64 bit.
(2) If once state is minimized (digest) to single 64bit value, large part of information is disappeared.

Therefore, you can't expect your assumption. Please consider to use streaming functions.

As suggested by @t-mat, the streaming functions were indeed designed exactly for this scenario.

I assume you are concerned by speed.
I don't think you would gain anything by passing the result from one hash to the next.
Maintaining a streaming context has indeed a cost, but so does finalising a hash. I expect that finalising hash after every few bytes would actually cost more than maintaining a streaming context, where finalisation is only performed once, when XXH_digest() is invoked.

My goal is to provide a 'drop in' replacement for use of CRC32 to improve speed, with minimal code change to the many calling sites. The streaming approach is technically not applicable, because the hash of a stream broken up into 10 byte chunks will be the same as the hash of the exact same stream broken into 100 byte chunks. The 'structure' of the the discontiguous strings is relevant and should contribute to the hash result.

This is a string which has 'indirect' references to other strings embedded within it at arbitrary locations (only one level deep). I need to hash the indirect strings and the subsets of the base string (not including the pointers to the indirect strings)

I saw that using XXH64 is faster on 64 bit hardware, and will perform better than XXH32-- taking 32 bites of the result as the hash.

I'm content with a 32 bit level of hash quality, so it seemed reasonable to put the 32 bit hash of the the various strings as the seed of incorporating the next string.

If your existing program structure allow it, prefer passing 64-bits between 2 consecutive hashes. Perform the 32-bits extraction at the end only. It will maximize hash quality.

Are you suggesting modifying the code to allow a 64 bit seed? Currently, rather than simply extracting the low 32 bits for seeding the next step, I'm xor'ing the high 32 and low 32 together so that all 64 bits 'contribute' to the 32 bit carry over. Unfortunately, it is not feasible (without copying data around) to append or prepend the intermediate 64 bit value to the next string, as I don't have control over that storage, it is handed to me.

Never mind, I see now that XXH64 does allow a 64 bit seed value.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

devnoname120 picture devnoname120  ·  8Comments

vp1981 picture vp1981  ·  7Comments

hiqbn picture hiqbn  ·  7Comments

WayneD picture WayneD  ·  7Comments

make-github-pseudonymous-again picture make-github-pseudonymous-again  ·  3Comments