Xxhash: 32-bit ARM very slow xxhash

Created on 2 Mar 2021 · 4Comments · Source: Cyan4973/xxHash

Compiled with gcc 4.8.5 and tested on dual core 32-bit ARM Cortex-A9, all the xxhash algorithms are very slow and lose even crc32 implementation, regardless the input size. Tested from dev and v0.8.0, does not make a difference. NEON path is activated.

lscpu:
Flags: half thumb fastmult vfp edsp neon vfpv3 tls vfpd32

Seems that these codes are not so tested with 32-bit ARM?

question

Source

jtoivainen

Most helpful comment

Seems that compiler optimizations was not turned on correctly by the tool chain. When setting those on explicitly for the build, xxhashes are calculated as (fast as) they should. Thanks @Cyan4973 and @easyaspi314.

jtoivainen on 3 Mar 2021

👍3 🎉1

All 4 comments

A classical issue with gcc on 32-bit ARM is that
it doesn't exploit (by default) CPU's ability to support unaligned memory access
(which I presume the A9 is able to)
resulting in very slow memory accesses.

There are a few things that can be attempted here.

On the library side, there are multiple access methods which are provided:
https://github.com/Cyan4973/xxHash/blob/dev/xxhash.h#L1117

The most aggressive is XXH_FORCE_MEMORY_ACCESS=2.
It can lead to incorrect binary generation depending on other settings and compiler capabilities.
But should it work, it will at least give a sense of what the speed should be.

A bit safer, XXH_FORCE_MEMORY_ACCESS=1 will generate correct code.
But the compiler needs to be aware that the cpu is able to access unaligned memory addresses.

This may require specifying some additional compilation flag.
For example, -march=armv7a tells gcc that the arm chip is capable to access unaligned memory addresses,
resulting in much better code generation in combination with XXH_FORCE_MEMORY_ACCESS=1.
(Unfortunately, according to godbolt, this flag associated with default XXH_FORCE_MEMORY_ACCESS=0 still provides bad performance on older version of gcc. Newer versions gcc-6 fix this issue ).

Not sure if it's an option, but clang tends also to generate better binary code for arm.

Finally, on the user side
presuming that XXH_FORCE_ALIGN_CHECK=1 (https://github.com/Cyan4973/xxHash/blob/dev/xxhash.h#L1179),
providing aligned input (on 4 bytes boundaries for XXH32)
should use the direct aligned memory access code path.
This one should not depend on the compiler detecting unaligned memory access capabilities,
since memory access is detected aligned to begin with.

XXH64 is likely to have poor performance on 32-bit arm.
But XXH32 should be fine.
The newer XXH3 is on the balance, baseline should be okay,
and with neon support, it is expected to beat XXH32 in speed.

_edit_ : fixed memory access value, as underlined by @easyaspi314

Cyan4973 on 2 Mar 2021

👍1

If I am not mistaken, the correct general purpose flags for that CPU would be this:

-O2   -fomit-frame-pointer -march=armv7-a      -mfpu=neon     -mthumb        -munaligned-access
^duh  ^saves a register    ^sets arch version  ^enables neon  ^use thumb-2   ^force enable unaligned access

However, as Yann said, I would also recommend Clang as it tends to generate better code, especially with NEON.

Additionally, did you try a newer GCC version?, GCC's ARM and AArch64 backends are rather mediocre, and it was pretty bad until recently.