xxHash as checksum for error detection

Created on 16 Jul 2019  ·  11Comments  ·  Source: Cyan4973/xxHash

One of the things I continually fail to understand is how a hash can be a checksum (or whether they should be), because traditionally hashes are optimized for strings, and for hash tables (DJB2 comes to mind as an old fan favorite). However within the PIC/Arduino crowd, people still use old simple 8-bit checksums, 16-bit Fletcher sums or 16-bit CRC. But you don't see much talk on new checksums as being better choices with today's supposedly faster processors. Are the types of computations used in hashes too slow? (despite CRC being actually slower in software than xxHash..) Or is something inherently bad with hash functions when reduced to 8 or 16 bit?

I've read that xxHash was designed first as a checksum for lz4, and some other hash functions such as SeaHash designed for use in a file system also for checksumming. For comparison, CRC-32 has been widely used, also in some compression formats (like ZIP/RAR). I have also read that CRC has certain mathematical properties which proves hamming distances for error detection for specific polynomials, etc. But in many ways hashes are more chaotic, more resembling pseudo-number-generators, unlikely to have such guarantees, at least that is how it seems to me.

This leaves me wondering if one is better off using CRC-32 for integrity purposes rather than a fast, simple to implement hash, even if it may have a speed hit. Does xxHash have any error-detection guarantees (lets assume 32-bit version)? And if not, how does it compete with CRC-32 as a checksum or error-detection code? What are the tradeoffs?

question

Most helpful comment

CRC algorithms generally feature guarantees of error detection under condition of hamming distances. Given a limited number of bit flips, the CRC will _necessarily_ generate a different result.

This can prove handy in situations where errors are presumed to only flip a few bits.
Just a few decades ago, this was relatively common, especially in transmission scenarios, as the underlying signal was still very "raw", so undetected bit flips on the physical layer were a thing.

This property however has a distribution cost. Once the "safe" hamming distance is crossed, the probability of collision becomes worse. This is a logical consequence of pigeon hole principle. How much worse ? Well, it depends on the exact CRC, but I expect good CRC to be in the 3x - 7x worse range. That may sound a lot, but it's just a 2-3 bits dispersion reduction, so it's not that terrible. Still.

Contrast that with a hash featuring "ideal" distribution property : any change, no matter if it's a single bit or a completely different output, has exactly 1 / 2^n probability to generate a collision. It's simpler to grasp : the risk of collision is always the same. In a direct comparison with CRC, when hamming distance is small collision rate is worse (since it's >0), but it's better when hamming distance is large.

Fast forward nowadays, and the situation has radically changed. We have layers upon layers of error detection and correction logic above the physical media. We can't just extract a bit from a flash block, nor can we read a bit from a Bluetooth channel. It doesn't make sense any more. These protocols embeds a stateful block logic, which is more complex, more resilient, constantly compensating for the permanent noise of the physical layer. When they fail, that's not going to produce a single flip: rather, a complete data region will be completely shambled, being all zeroes, or even random noise.

In this new environment, the bet is that errors, when they happen, are no longer in the "bitflip" category. In which case, the distribution properties of CRC become a liability. A pure hash actually has a lower probability to produce a collision.

This is when considering checksumming only.
An additional property of an "ideal" hash is that, when extracting any segment of bits from the hash, it still features this 1 / 2^n collision probability, which is very important as a source of bits for other structures, such as hash table or bloom filter. In contrast, CRC do not provide such guarantee. Some of the bits end up being very predictable or highly correlated, and when extracting them for hash purpose, the dispersion is much worse.

One could say that checksum and hashing are simply 2 different domains, and should not be confused. Indeed, that's the theory. Problem is, this disregards convenience and field practice. Having one "mixer" for multiple purpose is convenient, and programmers will settle for one and forget about it. I can't count how many times I've seen crc32 used for hashing, just because it felt "random enough". It's easy to say that it should not happen, but it does, constantly.
In this environment, proposing a solution designed to work well for both use cases makes sense, as it matches user expectation.

within the PIC/Arduino crowd, people still use old simple 8-bit checksums, 16-bit Fletcher sums or 16-bit CRC.
(...)
is something inherently bad with hash functions when reduced to 8 or 16 bit?

Due to the properties of a good hash, it's indeed possible to extract 8 or 16 bits from a hash, resulting in a 1/256 or 1/65535 collision probability. I see no concern on this.

Just, we have to accept the latency of _history_. People are used to a certain way of things, such as using specific checksum functions for checksumming. Even if there are more recent and potentially better solutions out there, habits do not change fast.

All 11 comments

CRC algorithms generally feature guarantees of error detection under condition of hamming distances. Given a limited number of bit flips, the CRC will _necessarily_ generate a different result.

This can prove handy in situations where errors are presumed to only flip a few bits.
Just a few decades ago, this was relatively common, especially in transmission scenarios, as the underlying signal was still very "raw", so undetected bit flips on the physical layer were a thing.

This property however has a distribution cost. Once the "safe" hamming distance is crossed, the probability of collision becomes worse. This is a logical consequence of pigeon hole principle. How much worse ? Well, it depends on the exact CRC, but I expect good CRC to be in the 3x - 7x worse range. That may sound a lot, but it's just a 2-3 bits dispersion reduction, so it's not that terrible. Still.

Contrast that with a hash featuring "ideal" distribution property : any change, no matter if it's a single bit or a completely different output, has exactly 1 / 2^n probability to generate a collision. It's simpler to grasp : the risk of collision is always the same. In a direct comparison with CRC, when hamming distance is small collision rate is worse (since it's >0), but it's better when hamming distance is large.

Fast forward nowadays, and the situation has radically changed. We have layers upon layers of error detection and correction logic above the physical media. We can't just extract a bit from a flash block, nor can we read a bit from a Bluetooth channel. It doesn't make sense any more. These protocols embeds a stateful block logic, which is more complex, more resilient, constantly compensating for the permanent noise of the physical layer. When they fail, that's not going to produce a single flip: rather, a complete data region will be completely shambled, being all zeroes, or even random noise.

In this new environment, the bet is that errors, when they happen, are no longer in the "bitflip" category. In which case, the distribution properties of CRC become a liability. A pure hash actually has a lower probability to produce a collision.

This is when considering checksumming only.
An additional property of an "ideal" hash is that, when extracting any segment of bits from the hash, it still features this 1 / 2^n collision probability, which is very important as a source of bits for other structures, such as hash table or bloom filter. In contrast, CRC do not provide such guarantee. Some of the bits end up being very predictable or highly correlated, and when extracting them for hash purpose, the dispersion is much worse.

One could say that checksum and hashing are simply 2 different domains, and should not be confused. Indeed, that's the theory. Problem is, this disregards convenience and field practice. Having one "mixer" for multiple purpose is convenient, and programmers will settle for one and forget about it. I can't count how many times I've seen crc32 used for hashing, just because it felt "random enough". It's easy to say that it should not happen, but it does, constantly.
In this environment, proposing a solution designed to work well for both use cases makes sense, as it matches user expectation.

within the PIC/Arduino crowd, people still use old simple 8-bit checksums, 16-bit Fletcher sums or 16-bit CRC.
(...)
is something inherently bad with hash functions when reduced to 8 or 16 bit?

Due to the properties of a good hash, it's indeed possible to extract 8 or 16 bits from a hash, resulting in a 1/256 or 1/65535 collision probability. I see no concern on this.

Just, we have to accept the latency of _history_. People are used to a certain way of things, such as using specific checksum functions for checksumming. Even if there are more recent and potentially better solutions out there, habits do not change fast.

CRC algorithms generally feature guarantees of error detection under condition of hamming distances. Given a limited number of bit flips, the CRC will _necessarily_ generate a different result.

This can prove handy in situations where errors are presumed to only flip a few bits.
Just a few decades ago, this was relatively common, especially in transmission scenarios, as the underlying signal was still very "raw", so undetected bit flips on the physical layer were a thing.

This property however has a distribution cost. Once the "safe" hamming distance is crossed, the probability of collision becomes worse. This is a logical consequence of pigeon hole principle. How much worse ? Well, it depends on the exact CRC, but I expect good CRC to be in the 3x - 7x worse range. That may sound a lot, but it's just a 2-3 bits dispersion reduction, so it's not that terrible. Still.

Contrast that with a hash featuring "ideal" distribution property : any change, no matter if it's a single bit or a completely different output, has exactly 1 / 2^n probability to generate a collision. It's simpler to grasp : the risk of collision is always the same. In a direct comparison with CRC, collision rate is worse (since it's >0) when hamming distance is small, but it's better when hamming distance is large.

Fast forward nowadays, and the situation has radically changed. We have layers upon layers of error detection and correction logic above the physical media. We can't just extract a bit from a flash block, nor can we read a bit from a Bluetooth channel. It doesn't make sense any more. These protocols embeds a stateful block logic, which is more complex, more resilient, constantly compensating for the permanent noise of the physical layer. When they fail, that's not going to produce a single flip: rather, a complete data region will be completely shambled, being all zeroes, or even random noise.

In this new environment, the bet is that errors, when they happen, are no longer in the "bitflip" category. In which case, the distribution properties of CRC become a liability. A pure hash actually has a lower probability to produce a collision.

This is when considering checksumming only.
An additional property of an "ideal" hash is that, when extracting any segment of bits from the hash, it still features this 1 / 2^n collision probability, which is very important as a source of bits for other structures, such as hash table or bloom filter. In contrast, CRC do not provide such guarantee. Some of the bits end up being very predictable or highly correlated, and when extracting them for hash purpose, the dispersion is much worse.

One could say that checksum and hashing are simply 2 different domains, and should not be confused. Indeed, that's the theory. Problem is, this disregards convenience and field practice. Having one "mixer" for multiple purpose is convenient, and programmers will settle for one and forget about it. I can't count how many times I've seen crc32 used for hashing, just because it felt "random enough". It's easy to say that it should not happen, but it does, constantly.
In this environment, proposing a solution designed to work well for both use cases makes sense, as it matches user expectation.

within the PIC/Arduino crowd, people still use old simple 8-bit checksums, 16-bit Fletcher sums or 16-bit CRC.
(...)
is something inherently bad with hash functions when reduced to 8 or 16 bit?

Due to the properties of a good hash, it's indeed possible to extract 8 or 16 bits from a hash, resulting in a 1/256 or 1/65535 collision probability. I see no concern on this.

Just, we have to accept the latency of _history_. People are used to a certain way of things, such as using specific checksum functions for checksumming. Even if there are more recent and potentially better solutions out there, habits do not change fast.

we know that with CRC 32 and using HD=6 it can detect any 5-bit errors and it can detect bursts up to 2^31=about 204MB
and detect all single-bit errors
In xxHash, how many bits need to be changed so that we can't detect an error?
I know in Hash even 1-bit change makes a totally different digest but how can we calculate the number of bits needed to create collision?
another question is as you mentioned:

In contrast, CRC do not provide such guarantee. Some of the bits end up being very predictable or highly correlated, and when extracting them for hash purpose, the dispersion is much worse.

Why is that bad? If bits are somehow correlated can we use this correlation to repair the damaged file?

In xxHash, how many bits need to be changed so that we can't detect an error (or collisions)

1 bit.

xxHash is a non-cryptographic hash function, not a checksum. It does not provide any hamming distance or collision resistance guarantees.

While it can do a decent job at it, it is not the best choice. It is like using pliers as a wrench. Sure, it will get the job done, but it is less effective than a real wrench and may cause damage.

In xxHash, how many bits need to be changed so that we can't detect an error (or collisions)

1 bit.

xxHash is a non-cryptographic hash function, not a checksum. It does not provide any hamming distance or collision resistance guarantees.

While it can do a decent job at it, it is not the best choice. It is like using pliers as a wrench. Sure, it will get the job done, but it is less effective than a real wrench and may cause damage.

do you mean we can fool the decoder with just changing 1 bit of data?
it can create collisions? (that's scary) and the probability of such a collision depends on digest size

do you mean we can fool the decoder with just changing 1 bit of data?

Considering hash values produced from 2 different contents, the probability of collisions is always 1 / 2^64 (for good quality 64-bit hash algorithms, such as xxh64 or xxh3), whatever the amount of modifications between those 2 contents.

It indeed means that, at least in theory, a change of a single bit might be able to generate a collision.

Now, keep in mind that this probability is 1 / 2^64, which is pretty much infinitesimal.
People are relatively bad at understanding this topic. They sometimes conflate it with "small", which tends to be perceived as a tangible amount in most people's mind, like ~10%.

In order to get a more accurate idea, I like to refer to this table :
collision probabilities

i.e : 1 / 2 ^ 64 is much much lower than the chance of receiving a comet directly on one's head, and lower than winning the national lottery _3 times in a row_ (which would rather be suspicious). This is actually much closer to "null". Most components in any system are susceptible to break with a _much_ higher probability than that, starting with other sources of software bugs.

If one wants to find such a theoretic 1-bit collision, one would need a pretty gigantic input and a pretty unbelievable amount of power to brute force a solution to this problem. Even then, a solution might not exist : considering the nature of employed arithmetic, I wouldn't be surprised if it would be impossible to generate a collision with a single bit modification. My own bet is that it takes at least 2 bits, and even then, it requires to place them in specific input's positions to stand a chance to achieve the desired objective, making such a modification no longer "random", hence out of scope for a non-cryptographic hash.

But I think the quoted text already explains all this : real crc have guarantees on hamming distance. This can be useful, and it made a lot more sense in the past, when signals were susceptible to be modified at bit level, and when checksum themselves were rather small (16 or 32 bits). A 16-bit hash algorithm would have a non-negligible chance to miss a collision involving very few bit flips, making it less suitable to the task. But with 64-bit or even 128-bit hashes now on the table, I believe this is no longer a dangerous option, simply because the chances of collision, hence of undetected corruption, are unfathomably small.

Some of the bits end up being very predictable or highly correlated, and when extracting them for hash purpose, the dispersion is much worse.

Why is that bad?

Hashes have generally more usages than just checksum replacement.
Actually, in most cases, the purpose of hash algorithms is to provide some random bits in order to designate a position in a hash table. If some of these bits become predictable depending on content, it means the distribution of positions in the hash table will no longer be "random", and as a consequence, some cells will overflow faster than other, negatively impacting the performance of the hash table.

If bits are somehow correlated can we use this correlation to repair the damaged file?

This is a completely different topic.
One doesn't repair a damaged input with a CRC. A CRC can only detect an error, not repair it.

Self-repairing data is a topic, and uses different (and much more computationally involved) techniques, such as convolutional codes. This is much more complex.

If one wants to find such a theoretic 1-bit collision, one would need a pretty gigantic input and a pretty unbelievable amount of power to brute force a solution to this problem. Even then, a solution might not exist : considering the nature of employed arithmetic, I wouldn't be surprised if it would be impossible to generate a collision with a single bit modification. My own bet is that it takes at least 2 bits, and even then, it requires to place them in specific input's positions to stand a chance to achieve the desired objective, making such a modification no longer "random", hence out of scope for a non-cryptographic hash.

261 demonstrates a single bit collision in the 64-bit variant of XXH3, although it is seed-dependent and very unlikely to happen on random inputs

Indeed; to be more complete, my previous statement was rather directed at the large-size section, for inputs > 240 bytes.

But yes, in the limited range of sizes where the input has to match exactly one of the 64-bit secret, a single-bit collision becomes possible. It felt acceptable because a collision based on a precise 64-bit input at a precise location is still within this 1 / 2^64 territory, presuming the input is random (i.e. not engineered to produce a collision). Moreover, the secret to match can be made effectively _secret_, so an external attacker will have to brute force its way to find it.

do you mean we can fool the decoder with just changing 1 bit of data?

Considering hash values produced from 2 different contents, the probability of collisions is always 1 / 2^64 (for good quality 64-bit hash algorithms, such as xxh64 or xxh3), whatever the amount of modifications between those 2 contents.

It indeed means that, at least in theory, a change of a single bit might be able to generate a collision.

Now, keep in mind that this probability is 1 / 2^64, which is pretty much infinitesimal.
People are relatively bad at understanding this topic. They sometimes conflate it with "small", which tends to be perceived as a tangible amount in most people's mind, like ~10%.

In order to get a more accurate idea, I like to refer to this table :
collision probabilities

i.e : 1 / 2 ^ 64 is much much lower than the chance of receiving a comet directly on one's head, and lower than winning the national lottery _3 times in a row_ (which would rather be suspicious). This is actually much closer to "null". Most components in any system are susceptible to break with a _much_ higher probability than that, starting with other sources of software bugs.

If one wants to find such a theoretic 1-bit collision, one would need a pretty gigantic input and a pretty unbelievable amount of power to brute force a solution to this problem. Even then, a solution might not exist : considering the nature of employed arithmetic, I wouldn't be surprised if it would be impossible to generate a collision with a single bit modification. My own bet is that it takes at least 2 bits, and even then, it requires to place them in specific input's positions to stand a chance to achieve the desired objective, making such a modification no longer "random", hence out of scope for a non-cryptographic hash.

But I think the quoted text already explains all this : real crc have guarantees on hamming distance. This can be useful, and it made a lot more sense in the past, when signals were susceptible to be modified at bit level, and when checksum themselves were rather small (16 or 32 bits). A 16-bit hash algorithm would have a non-negligible chance to miss a collision involving very few bit flips, making it less suitable to the task. But with 64-bit or even 128-bit hashes now on the table, I believe this is no longer a dangerous option, simply because the chances of collision, hence of undetected corruption, are unfathomably small.

Some of the bits end up being very predictable or highly correlated, and when extracting them for hash purpose, the dispersion is much worse.

Why is that bad?

Hashes have generally more usages than just checksum replacement.
Actually, in most cases, the purpose of hash algorithms is to provide some random bits in order to designate a position in a hash table. If some of these bits become predictable depending on content, it means the distribution of positions in the hash table will no longer be "random", and as a consequence, some cells will overflow faster than other, negatively impacting the performance of the hash table.

If bits are somehow correlated can we use this correlation to repair the damaged file?

This is a completely different topic.
One doesn't repair a damaged input with a CRC. A CRC can only detect an error, not repair it.

Self-repairing data is a topic, and uses different (and much more computationally involved) techniques, such as convolutional codes. This is much more complex.

could you confirm these scenarios?

  1. LZ4 has content checksum for each data block, Consider that we have 10^6 packets of 1 GB files
    the probability of collision is 1/2^32 = 0.0000000002 it means that if we send 10^10 (this giant size ) packets we may have 2 collisions (based on the Pigeonhole principle) but If we send less than this amount, the probability of seeing a collision will drop significantly so that we can detect all the errors, decoder calculates the checksum and compare it with the packet checksum, is this correct?
  2. Content checksum calculates the checksum over the compressed file (it is optional and we can use it to detect out of order or to reduce the probability of collision (Before using CC, the probability of a collision is 1/2 ^ 32 and then it would be 1/2 ^ 64.)

The described scenario is :

  • LZ4-compressed data being corrupted (which is already a pretty rare event by itself)
  • Corruption remaining undetected due to hash collision.

The LZ4 frame format uses XXH32, a 32-bit hash, as a companion checksum.
It's applied by default to the full frame's content, more precisely the frame's _uncompressed_ content (after decompression), thus validating both transmission and decoding process.
It can also optionally apply it to each block, in which case it checksums the _compressed_ content of each block.
Both can be cumulated.

To remain undetected, a corruption must successfully pass ___all___ applicable checksums.
Presuming that block checksum is enabled, and presuming that the corruption is entirely located within a single block, the chances of passing both the block and the frame checksum is indeed 1 / 2^32 x 1 / 2^32 = 1 / 2^64.

Now, if corrupted data resides in multiple blocks, either because there are multiple corruption events, or because a single large corrupted section cover consecutive blocks, it's _even harder for the corruption to remain undetected_. It would have to generate a collision with each one of the block checksums involved.
For example, presuming corruption is spread onto 2 blocks, chances for this corruption to remain undetected are 1 / 2^32 x 1 / 2^32 x 1 / 2^32 = 1 / 2^96. Astronomically small.

Not counting the fact that each corrupted block may be detected as undecodable by the decompression process, which comes on top of that.

So yes, combining block and frame checksum dramatically increases chances of detecting corruption events.

The described scenario is :

* LZ4-compressed data being corrupted (which is already a pretty rare event by itself)

* Corruption remaining undetected due to hash collision.

The LZ4 frame format uses XXH32, a 32-bit hash, as a companion checksum.
It's applied by default to the full frame's content, more precisely the frame's _uncompressed_ content (after decompression), thus validating both transmission and decoding process.
It can also optionally apply it to each block, in which case it checksums the _compressed_ content of each block.
Both can be cumulated.

To remain undetected, a corruption must successfully pass _all_ applicable checksums.
Presuming that block checksum is enabled, and presuming that the corruption is entirely located within a single block, the chances of passing both the block and the frame checksum is indeed 1 / 2^32 x 1 / 2^32 = 1 / 2^64.

Now, if corrupted data resides in multiple blocks, either because there are multiple corruption events, or because a single large corrupted section cover consecutive blocks, it's _even harder for the corruption to remain undetected_. It would have to generate a collision with each one of the block checksums involved.
For example, presuming corruption is spread onto 2 blocks, chances for this corruption to remain undetected are 1 / 2^32 x 1 / 2^32 x 1 / 2^32 = 1 / 2^96. Astronomically small.

Not counting the fact that each corrupted block may be detected as undecodable by the decompression process, which comes on top of that.

So yes, combining block and frame checksum dramatically increases chances of detecting corruption events.

Is it possible to use LZ4 on the fly? I mean, compress/decompress the data as soon as you've got some of it. it is somehow pipelining.
what about the xxHash?

s it possible to use LZ4 on the fly? I mean, compress/decompress the data as soon as you've got some of it.

This is the streaming mode. Yes it is possible.

what about the xxHash?

The frame checksum is updated continuously, but only produces a result at the end of the frame. Therefore, there is no checksum result to compare to until an end of frame event (a stream may consist of multiple appended frames, but generally doesn't).

In contrast, block checksums are created at each block, so they are regularly produced and checked while streaming.

Both checksums employ XXH32.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

carstenskyboxlabs picture carstenskyboxlabs  ·  6Comments

devnoname120 picture devnoname120  ·  8Comments

eloff picture eloff  ·  6Comments

gitmko0 picture gitmko0  ·  4Comments

easyaspi314 picture easyaspi314  ·  6Comments