Numpy: Use an aligned allocator for NumPy?

Created on 26 Nov 2014  ·  68Comments  ·  Source: numpy/numpy

Regarding the f2py regression in NumPy 1.9 with failures on 32-bit Windows, the question is whether NumPy should start to use an allocator which gives guaranteed alignment.

https://github.com/scipy/scipy/issues/4168

01 - Enhancement

Most helpful comment

This feature would be very helpful to me. I am using an FPGA device (Altera A10GX) where the DMA controller requires 64-byte aligned data to be used, this speeds up my code by 40x(!!!). I suspect that @nachiket has the same problem as me. I wrote something similar to what @eamartin is using but this is a bit of a hack.

All 68 comments

Here is one example of an allocator that should work on all platforms. It is shamelessly based on this:

https://sites.google.com/site/ruslancray/lab/bookshelf/interview/ci/low-level/write-an-aligned-malloc-free-function

There are not many ways to do this and similar code is floating around on the net, so extending it in this way is probable ok. (And besides it does not implement realloc.)

Dropping this code into numpy/core/include/numpy/ndarraytypes.hshould ensure that freshly allocated ndarrays are properly aligned on all platforms.

This platform-independent code could possibly be replaced with posix_memalign() on POSIX and _aligned_malloc() on Windows. However, combining posix_memalign() with realloc() is not possible, so implementing it ourselves is probably better.

#define NPY_MEMALIGN 32   /* 16 for SSE2, 32 for AVX, 64 for Xeon Phi */ 

static NPY_INLINE
void *PyArray_realloc(void *p, size_t n)
{
    void *p1, **p2, *base;
    size_t old_offs, offs = NPY_MEMALIGN - 1 + sizeof(void*);    
    if (NPY_UNLIKELY(p != NULL)) {
        base = *(((void**)p)-1);
        if (NPY_UNLIKELY((p1 = PyMem_Realloc(base,n+offs)) == NULL)) return NULL;
        if (NPY_LIKELY(p1 == base)) return p;
        p2 = (void**)(((Py_uintptr_t)(p1)+offs) & ~(NPY_MEMALIGN-1));
        old_offs = (size_t)((Py_uintptr_t)p - (Py_uintptr_t)base);
        memmove(p2,(char*)p1+old_offs,n);    
    } else {
        if (NPY_UNLIKELY((p1 = PyMem_Malloc(n + offs)) == NULL)) return NULL;
        p2 = (void**)(((Py_uintptr_t)(p1)+offs) & ~(NPY_MEMALIGN-1));   
    }
    *(p2-1) = p1;
    return (void*)p2;
}    

static NPY_INLINE
void *PyArray_malloc(size_t n)
{
    return PyArray_realloc(NULL, n);
}

static NPY_INLINE
void *PyArray_calloc(size_t n, size_t s)
{
    void *p;
    if (NPY_UNLIKELY((p = PyArray_realloc(NULL,n*s)) == NULL)) return NULL;
    memset(p, 0, n*s);
    return p;
}

static NPY_INLINE        
void PyArray_free(void *p)
{
    void *base = *(((void**)p)-1);
    PyMem_Free(base);
} 

I have already a branch which adds an aligned allocator, i'll dig it out.

By using something like this we throw away the option to uses pythons tracemalloc framework and sparse memory (there is no aligned_calloc).
@njsmith would you be willing to engage with python devs again to add yet another allocator to their slot before 3.5 is released? They already added calloc only for us, would be a schame if we now couldn't use it.

Presumably one could pass in alignment in the context data of PyMemAllocatorEx? But NumPy has to support Python versions from 2.6 and up, so doing this in Python 3.5 might not solve the problem.

I do think engaging with the python devs on this before 3.5 is a good idea,
but I still am not convinced we have a good reason to use an aligned
allocator in the near term. It cannot possibly be the case that struct {
double, double } actually requires better-than-malloc alignment on win32 or
SPARC, because if that were true then nothing would work.
On 26 Nov 2014 09:10, "Julian Taylor" [email protected] wrote:

I have already a branch which adds an aligned allocator, i'll dig it out.

By using something like this we throw away the option to uses pythons
tracemalloc framework and sparse memory (there is no aligned_calloc).
@njsmith https://github.com/njsmith would you be willing to engage with
python devs again to add yet another allocator to their slot before 3.5 is
released? They already added calloc only for us, would be a schame if we
now couldn't use it.


Reply to this email directly or view it on GitHub
https://github.com/numpy/numpy/issues/5312#issuecomment-64534021.

The question with regard to f2py was what alignment Fortran would need, not the minimum requirement of C. Speed is also an issue. Both indexing and SIMD works better if the data is properly aligned.

A reason for using aligned allocator could indeed be speed, and ensuring SSE/AVX
compatibility would remove the numerical jitter that comes from taking different
code paths for differently aligned data.
.
f2py is older than the ISO C binding standard in Fortran, and the way it
works is essentially the de facto standard way on interfacing Fortran
with C, used extensively by everyone. In light of this experience, it's
clear that alignment provided by system malloc is sufficient for the
Fortran compilers that matter in practice for us.

@pitrou And 64 byte alignment is recommended for Xeon Phi. Take a look at the comment behind the definition of NPY_MEMALIGN in my code example.

The main complication on providing aligned allocation is that ATM we can
either hook into the tracemalloc infrastructure xor do aligned allocation,
and fixing this will require some coordination with CPython upstream (see

4663).

On 5 Dec 2014 16:44, "Sturla Molden" [email protected] wrote:

@pitrou https://github.com/pitrou And 64 bit is recommended for Xeon
Phi. Take a look at the comment behind the definition of NPY_MEMALIGN in
my code example.


Reply to this email directly or view it on GitHub
https://github.com/numpy/numpy/issues/5312#issuecomment-65816753.

So the CPython issue is at http://bugs.python.org/issue18835.

Given the complications with realloc(), it might not be realistic to expect CPython to solve this in the 3.5 timeframe. Numpy should perhaps use its own aligned allocated wrapper instead (which should be able to defer to the PyMem API, and take advantage of tracemalloc, anyway).

Code for such an allocator is included above. I don't understand @juliantaylor 's argument, but he probably understands this better than me.

I can understand what he meant about calloc though. A calloc is not simply a malloc and a memset to zero. A memset will require the OS to fetch the pages before they are needed. AFAIK there is no PyMem_Calloc.

Actually CPython 3.5 has PyMem_Calloc and friends.
I think @juliantaylor was considering the implementation case of using OS functions (posix_memalign, etc.). But that doesn't sound necessary.

By the way @sturlamolden, your snippet redefines PyArray_Malloc and friends, but array allocation seems to use PyDataMem_NEW. Am I misunderstanding something?

Another thought is that aligned allocation may be wasteful for small arrays. Perhaps there should be a threshold below which standard allocation is used?
Also, should the alignment be configurable?

The allocators are called PyArray_malloc and PyArray_free in NumPy 1.9. A lot is changed in NumPy 1.10.

Are you sure? PyArray_NewFromDescr_int() calls npy_alloc_cache() and npy_alloc_cache() calls PyDataMem_NEW().

Numpy has multiple allocation interfaces, and they don't have very obvious
names. PyArray_malloc/free are used for "regular" allocations (e.g. object
structs). Data buffers (ndarray ->data pointers, temporary buffers inside
ufuncs, etc.), however, are allocated via PyDataMem_NEW.

On Thu, Jan 15, 2015 at 7:48 PM, Sturla Molden [email protected]
wrote:

The allocators are called PyArray_malloc and PyArray_free in Numpy 1.9.


Reply to this email directly or view it on GitHub
https://github.com/numpy/numpy/issues/5312#issuecomment-70149126.

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

@njsmith Yeah, we should rationalize the allocation macros some day... I'd start with the one used to allocate dimensions for ndarray (IIRC).

I created PR #5457 with a patch. Feedback on the approach would be nice.

As far as I know there is currently no benefit to using an aligned
allocator in numpy?

On Fri, Jan 16, 2015 at 7:15 PM, Antoine Pitrou [email protected]
wrote:

I created PR #5457 https://github.com/numpy/numpy/pull/5457 with a
patch. Feedback on the approach would be nice.


Reply to this email directly or view it on GitHub
https://github.com/numpy/numpy/issues/5312#issuecomment-70305997.

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

With Numba we determined that AVX vector instructions required a 32-byte alignment for optimal performance. If you compile Numpy with AVX enabled (requires specific compiler options, I guess), alignment should make a difference too.

Out of curiosity, do you have any real-world measurements? I ask b/c there
are so many factors that play into these things (different overhead/speed
trade-offs at different array sizes, details of memory allocators -- which
also act differently at different array sizes -- etc.) that I find it hard
to guess whether one ends up with like a 0.5% end-to-end speedup or a 50%
end-to-end speedup or what.

On Fri, Jan 16, 2015 at 8:16 PM, Antoine Pitrou [email protected]
wrote:

With Numba we determined that AVX vector instructions required a 32-byte
alignment for optimal performance. If you compile Numpy with AVX enabled
(requires specific compiler options, I guess), alignment should make a
difference too.


Reply to this email directly or view it on GitHub
https://github.com/numpy/numpy/issues/5312#issuecomment-70315434.

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

fwiw on my i5-4210u I see no significant difference between 16 and 32 byte aligned data in a simple load add store test, the minimum cycle count seems lower by 5% but median and 10th percentile is identical to 1%

Is that with AVX?

@seibert made some AVX measurements with Numba (i.e. just-in-time code generation with LLVM) using Numpy arrays, I think he'll try to run them again to get precise numbers :-)

Here is the benchmark with the latest Numba master:

http://nbviewer.ipython.org/gist/seibert/6957baddc067140e55fe

For float32 arrays (size=10000) executing an a + b * fabs(a) operation, we see a 40% difference running on an Intel Core i7-4820K CPU @ 3.70GHz (Ivy Bridge).

(We are a little puzzled that the LLVM 3.5 autovectorizer is not generating any loop peeling code to correct for alignment problems. I would not be surprised if gcc or clang correct for this.)

For other architectures we are planning to target (like AMD APUs that support the HSA standard), the alignment needs are more strict, with the OpenCL developers suggesting that we have 256 byte alignment for arrays for best performance.

I don't know 40% seems like an excessive penalty, I see zero on an i5 haswell (which granted has really poor avx performance)
Could it be that the jit compiler is creating two different versions of the loop?
do you have a assembly level profile of this (e.g. via perf record)?

also maybe turbo boost is kicking in on the second loop, disabling it or monitoring he appropiate pmu might be interesting

Since we use this system for benchmarking, we have disabled TurboBoost in the BIOS.

The JIT is only running once (once Numba compiles a function for a given set of input types, it caches it), and the JIT is triggered prior to benchmarking by Cell 6 in the linked notebook.

I haven't used perf record before, but I'll look into it. How did you perform your benchmarks? (I would actually expect Haswell to do better than Ivy Bridge if mis-alignment is triggering some kind of inefficient use of the available L2 cache bandwidth.)

This whitepaper from Intel gives a little more information on the alignment issue at the bottom of page 6:

https://software.intel.com/sites/default/files/m/d/4/1/d/8/Practical_Optimization_with_AVX.pdf

The use of Intel AVX instructions on unaligned 32-byte vectors means that every second load will be across a cacheline split, since the cache line is 64 bytes. This doubles the cache line split rate compared to Intel SSE code that uses 16-byte vectors. A high cache-line split rate in memory-intensive code is extremely likely to cause performance degradation. For that reason, it is highly recommended to align the data to 32 bytes for use with Intel AVX.

Haswell has twice the L2 cache bandwidth of Sandy/Ivy Bridge, so it is possible the effect of misaligned arrays is not significant on Haswell...

my simple benchmark is this:
https://gist.github.com/juliantaylor/68e578d140f427ed80bb
would be interesting to see on that i7

Results on a Core i5-2500K (Sandy Bridge):
4644 6656 7704 10100

Please note I had to add a longer warmup phase at start of the benchmark:

    for (i=0; i<10; i++)
        add(a, 1);

is that an average/median/...? that looks quite significant, seems intel worked on it a lot for the haswell.

It's the near-stable output from the benchmark after some hundreds of runs in a loop.

I added a packed SSE2 implementation, here are the figures (i5-2500K):
4660 6492 7468 5108 10096
(32B aligned AVX, 16B (un)aligned AVX, 8B (un)aligned AVX, aligned SSE2, scalar)

Here is the updated source: https://gist.github.com/pitrou/892219a7d4c6d37de201

Near-stable output from a Core i5-4200U (laptop Haswell CPU):
4120 4152 4148 4260 7308

At least here, misaligned AVX isn't worse than SSE2 with similar alignment.

Here is from my MBP (quad-core i7-3635QM, 2.4 GHz, Ivy Bridge):
5060 4932 5820 5704 5040

I had to change avxintrin.h to immintrin.h and compile with Intel icc because gcc 4.8.1 refused to compile the code (and so did clang).

@sturlamolden , may it be possible that icc is vectorizing the scalar loop (the last result in the numbers)?

(note that the benchmark would probably put more pressure on the cache subsystem if the addition involved two separate input arrays)

I have no idea.

@sturlamolden This code was very useful for interfacing with some SSE2.

I had one issue on Windows (all versions of VC, including 2015) don't like this line

memmove((void*)p2,p1+old_offs,n);    

since they don't support pointer arithmetic on void*. As a short term fix I cast it to char* to do the math. This probably isn't right - do you have a better idea what would make it compile correctly on Windows?

My bad. Pointer arithmetics on void* is illegal C. Casting to char* is correct.
Updated the code example.

How do I get 64-byte aligned numpy arrays? The 'ALIGNED' addresses from https://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.require.html seem to align at some different length. Is there some user-configurable parameter for alignment length?

I made my own (Python) aligned allocator that works on (unaligned) memory provided by Numpy.

import numpy as np

def empty_aligned(n, align):
    """                                                                                                                                     
    Get n bytes of memory wih alignment align.                                                                                              
    """

    a = np.empty(n + (align - 1), dtype=np.uint8)
    data_align = a.ctypes.data % align
    offset = 0 if data_align == 0 else (align - data_align)
    return a[offset : offset + n]


def test(n):
    for i in xrange(1, 1024):
        b = empty_aligned(n, i)

        try:
            assert b.ctypes.data % i == 0
            assert b.size == n
        except AssertionError:
            print i, b.ctypes.data % i, b.size

Perhaps a Python workaround like this is a viable solution?

@eamartin This is about the NumPy's internal C code and the interface code to Fortran generated by f2py (also C code). For obvious reasons, the implementation of NumPy cannot depend on NumPy.

You can use that trick for your Python projects, though.

@sturlamolden: It might help @nachiket and others in a similar situation though.

@sturlamolden : I didn't read closely enough to realize this was about C internals.

However, offering better alignment options in the Python interfaces would be valuable for developing Python interfaces between Numpy and native libraries that have alignment requirements on arguments.

I am not against offering better alignment options in the Python interfaces 🙂

The "Python aligned allocator" solution I suggested is a hack. I think offering alignment in the Python interfaces would be nice, but the right way to do that would be to handle alignment at the C level.

This feature would be very helpful to me. I am using an FPGA device (Altera A10GX) where the DMA controller requires 64-byte aligned data to be used, this speeds up my code by 40x(!!!). I suspect that @nachiket has the same problem as me. I wrote something similar to what @eamartin is using but this is a bit of a hack.

I definitely encourage 64 byte alignment:

  1. that is the cache line size
  2. it is suitable for any SIMD alignment up to AVX512

Here we are almost 5 years later.

Any thoughts on making this (64-byte alignment in particular) a standard feature?..

This cython code is now in NumPy. Of course, this doesn't change the default.

my 2cents: An aligned allocator would help when interface with hardware devices and kernel level calls. These interfaces might benefit from aligning the buffers to pages.

In merging randomgen, we gained PyArray_realloc_aligned and friends. Should we move these routines into numpy/core/include ?

That would certainly be useful, @mattip. Would it be possible to access this functionality from Python as well?

In merging randomgen, we gained PyArray_realloc_aligned and friends. Should we move these routines into numpy/core/include ?

Ripped off my code, did ya? 😂

Probably got lost at some point in the move from randomgen to numpy. I
think I had a record that this was yours.

On Sun, Nov 17, 2019, 10:29 Sturla Molden notifications@github.com wrote:

It looks like I am no loger contributor for the code I have written for
NumPy 🧐:

https://github.com/numpy/numpy/blob/v1.17.2/numpy/_build_utils/src/apple_sgemv_fix.c

https://github.com/numpy/numpy/blob/v1.17.2/numpy/random/src/aligned_malloc/aligned_malloc.h


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/numpy/numpy/issues/5312?email_source=notifications&email_token=ABKTSRLWKGXUFE4OK53SJMDQUEMJJA5CNFSM4AYDJQ42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEIIWSY#issuecomment-554732363,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABKTSRM7IZIKPGKT4D2W4IDQUEMJJANCNFSM4AYDJQ4Q
.

what was the original source of the code?

The link is dead, but it was adapted from an aligned malloc that looked like this:

https://tianrunhe.wordpress.com/2012/04/23/aligned-malloc-in-c/

It was a github post from Sturla. There was no original code file.

On Sun, Nov 17, 2019, 12:04 Matti Picus notifications@github.com wrote:

what was the original source of the code?


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/numpy/numpy/issues/5312?email_source=notifications&email_token=ABKTSRKKJC4K6C4LW4GFYULQUEXNDA5CNFSM4AYDJQ42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEIKJZY#issuecomment-554738919,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABKTSRIVMQFEJ5EP227PXL3QUEXNDANCNFSM4AYDJQ4Q
.

This is for the aligned malloc.

On Sun, Nov 17, 2019, 12:14 Kevin Sheppard kevin.k.sheppard@gmail.com
wrote:

It was a github post from Sturla. There was no original code file.

On Sun, Nov 17, 2019, 12:04 Matti Picus notifications@github.com wrote:

what was the original source of the code?


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/numpy/numpy/issues/5312?email_source=notifications&email_token=ABKTSRKKJC4K6C4LW4GFYULQUEXNDA5CNFSM4AYDJQ42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEIKJZY#issuecomment-554738919,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABKTSRIVMQFEJ5EP227PXL3QUEXNDANCNFSM4AYDJQ4Q
.

Does everyone who contributes 50 lines of code to Numpy get a dedicated copyright header? I might go through my contributions and see if any applies :-)

It looks like I am no longer contributor for the code I have written for NumPy 🧐:

You are and will always be:)

Does everyone who contributes 50 lines of code to Numpy get a dedicated copyright header? I might go through my contributions and see if any applies :-)

Nope. We try to avoid encoding such things inside the source code, since that will always be wildly incomplete and hard to maintain. We do ask people to list themselves in THANKS.txt; I'm looking at a better alternative to that, because that file often gives merge conflicts.

Was this page helpful?
0 / 5 - 0 ratings