Here is one example of an allocator that should work on all platforms. It is shamelessly based on this:

https://sites.google.com/site/ruslancray/lab/bookshelf/interview/ci/low-level/write-an-aligned-malloc-free-function

There are not many ways to do this and similar code is floating around on the net, so extending it in this way is probable ok. (And besides it does not implement realloc.)

Dropping this code into numpy/core/include/numpy/ndarraytypes.hshould ensure that freshly allocated ndarrays are properly aligned on all platforms.

This platform-independent code could possibly be replaced with posix_memalign() on POSIX and _aligned_malloc() on Windows. However, combining posix_memalign() with realloc() is not possible, so implementing it ourselves is probably better.

#define NPY_MEMALIGN 32   /* 16 for SSE2, 32 for AVX, 64 for Xeon Phi */ 

static NPY_INLINE
void *PyArray_realloc(void *p, size_t n)
{
    void *p1, **p2, *base;
    size_t old_offs, offs = NPY_MEMALIGN - 1 + sizeof(void*);    
    if (NPY_UNLIKELY(p != NULL)) {
        base = *(((void**)p)-1);
        if (NPY_UNLIKELY((p1 = PyMem_Realloc(base,n+offs)) == NULL)) return NULL;
        if (NPY_LIKELY(p1 == base)) return p;
        p2 = (void**)(((Py_uintptr_t)(p1)+offs) & ~(NPY_MEMALIGN-1));
        old_offs = (size_t)((Py_uintptr_t)p - (Py_uintptr_t)base);
        memmove(p2,(char*)p1+old_offs,n);    
    } else {
        if (NPY_UNLIKELY((p1 = PyMem_Malloc(n + offs)) == NULL)) return NULL;
        p2 = (void**)(((Py_uintptr_t)(p1)+offs) & ~(NPY_MEMALIGN-1));   
    }
    *(p2-1) = p1;
    return (void*)p2;
}    

static NPY_INLINE
void *PyArray_malloc(size_t n)
{
    return PyArray_realloc(NULL, n);
}

static NPY_INLINE
void *PyArray_calloc(size_t n, size_t s)
{
    void *p;
    if (NPY_UNLIKELY((p = PyArray_realloc(NULL,n*s)) == NULL)) return NULL;
    memset(p, 0, n*s);
    return p;
}

static NPY_INLINE        
void PyArray_free(void *p)
{
    void *base = *(((void**)p)-1);
    PyMem_Free(base);
}

sturlamolden on 26 Nov 2014

I have already a branch which adds an aligned allocator, i'll dig it out.

By using something like this we throw away the option to uses pythons tracemalloc framework and sparse memory (there is no aligned_calloc).
@njsmith would you be willing to engage with python devs again to add yet another allocator to their slot before 3.5 is released? They already added calloc only for us, would be a schame if we now couldn't use it.

juliantaylor on 26 Nov 2014

Presumably one could pass in alignment in the context data of PyMemAllocatorEx? But NumPy has to support Python versions from 2.6 and up, so doing this in Python 3.5 might not solve the problem.

sturlamolden on 26 Nov 2014

I do think engaging with the python devs on this before 3.5 is a good idea,
but I still am not convinced we have a good reason to use an aligned
allocator in the near term. It cannot possibly be the case that struct {
double, double } actually requires better-than-malloc alignment on win32 or
SPARC, because if that were true then nothing would work.
On 26 Nov 2014 09:10, "Julian Taylor" [email protected] wrote:

I have already a branch which adds an aligned allocator, i'll dig it out.

By using something like this we throw away the option to uses pythons
tracemalloc framework and sparse memory (there is no aligned_calloc).
@njsmith https://github.com/njsmith would you be willing to engage with
python devs again to add yet another allocator to their slot before 3.5 is
released? They already added calloc only for us, would be a schame if we
now couldn't use it.

—
Reply to this email directly or view it on GitHub
https://github.com/numpy/numpy/issues/5312#issuecomment-64534021.

njsmith on 26 Nov 2014

The question with regard to f2py was what alignment Fortran would need, not the minimum requirement of C. Speed is also an issue. Both indexing and SIMD works better if the data is properly aligned.

sturlamolden on 26 Nov 2014

A reason for using aligned allocator could indeed be speed, and ensuring SSE/AVX
compatibility would remove the numerical jitter that comes from taking different
code paths for differently aligned data.
.
f2py is older than the ISO C binding standard in Fortran, and the way it
works is essentially the de facto standard way on interfacing Fortran
with C, used extensively by everyone. In light of this experience, it's
clear that alignment provided by system malloc is sufficient for the
Fortran compilers that matter in practice for us.

pv on 26 Nov 2014

Note a 32-byte alignment is recommended for AVX by Intel : https://software.intel.com/en-us/articles/practical-intel-avx-optimization-on-2nd-generation-intel-core-processors

pitrou on 5 Dec 2014

@pitrou And 64 byte alignment is recommended for Xeon Phi. Take a look at the comment behind the definition of NPY_MEMALIGN in my code example.

sturlamolden on 5 Dec 2014

The main complication on providing aligned allocation is that ATM we can
either hook into the tracemalloc infrastructure xor do aligned allocation,
and fixing this will require some coordination with CPython upstream (see

4663).

On 5 Dec 2014 16:44, "Sturla Molden" [email protected] wrote:

@pitrou https://github.com/pitrou And 64 bit is recommended for Xeon
Phi. Take a look at the comment behind the definition of NPY_MEMALIGN in
my code example.

—
Reply to this email directly or view it on GitHub
https://github.com/numpy/numpy/issues/5312#issuecomment-65816753.

njsmith on 5 Dec 2014

So the CPython issue is at http://bugs.python.org/issue18835.

pitrou on 5 Dec 2014

Given the complications with realloc(), it might not be realistic to expect CPython to solve this in the 3.5 timeframe. Numpy should perhaps use its own aligned allocated wrapper instead (which should be able to defer to the PyMem API, and take advantage of tracemalloc, anyway).

pitrou on 15 Jan 2015

Code for such an allocator is included above. I don't understand @juliantaylor 's argument, but he probably understands this better than me.

I can understand what he meant about calloc though. A calloc is not simply a malloc and a memset to zero. A memset will require the OS to fetch the pages before they are needed. AFAIK there is no PyMem_Calloc.

sturlamolden on 15 Jan 2015

Actually CPython 3.5 has PyMem_Calloc and friends.
I think @juliantaylor was considering the implementation case of using OS functions (posix_memalign, etc.). But that doesn't sound necessary.

By the way @sturlamolden, your snippet redefines PyArray_Malloc and friends, but array allocation seems to use PyDataMem_NEW. Am I misunderstanding something?

pitrou on 15 Jan 2015

Another thought is that aligned allocation may be wasteful for small arrays. Perhaps there should be a threshold below which standard allocation is used?
Also, should the alignment be configurable?

pitrou on 15 Jan 2015

The allocators are called PyArray_malloc and PyArray_free in NumPy 1.9. A lot is changed in NumPy 1.10.

sturlamolden on 15 Jan 2015

Are you sure? PyArray_NewFromDescr_int() calls npy_alloc_cache() and npy_alloc_cache() calls PyDataMem_NEW().

pitrou on 15 Jan 2015

Numpy has multiple allocation interfaces, and they don't have very obvious
names. PyArray_malloc/free are used for "regular" allocations (e.g. object
structs). Data buffers (ndarray ->data pointers, temporary buffers inside
ufuncs, etc.), however, are allocated via PyDataMem_NEW.

On Thu, Jan 15, 2015 at 7:48 PM, Sturla Molden [email protected]
wrote:

The allocators are called PyArray_malloc and PyArray_free in Numpy 1.9.

—
Reply to this email directly or view it on GitHub
https://github.com/numpy/numpy/issues/5312#issuecomment-70149126.

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

njsmith on 15 Jan 2015

It seems PyDataMem_NEW calls malloc in NumPy 1.9.
https://github.com/numpy/numpy/blob/3975e095013119cfdbb9405ca95e6c723eb862d3/numpy/core/src/multiarray/alloc.c

sturlamolden on 15 Jan 2015

@njsmith Yeah, we should rationalize the allocation macros some day... I'd start with the one used to allocate dimensions for ndarray (IIRC).

charris on 15 Jan 2015

I created PR #5457 with a patch. Feedback on the approach would be nice.

pitrou on 16 Jan 2015

As far as I know there is currently no benefit to using an aligned
allocator in numpy?

On Fri, Jan 16, 2015 at 7:15 PM, Antoine Pitrou [email protected]
wrote:

I created PR #5457 https://github.com/numpy/numpy/pull/5457 with a
patch. Feedback on the approach would be nice.

—
Reply to this email directly or view it on GitHub
https://github.com/numpy/numpy/issues/5312#issuecomment-70305997.

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

njsmith on 16 Jan 2015

With Numba we determined that AVX vector instructions required a 32-byte alignment for optimal performance. If you compile Numpy with AVX enabled (requires specific compiler options, I guess), alignment should make a difference too.

pitrou on 16 Jan 2015

Out of curiosity, do you have any real-world measurements? I ask b/c there
are so many factors that play into these things (different overhead/speed
trade-offs at different array sizes, details of memory allocators -- which
also act differently at different array sizes -- etc.) that I find it hard
to guess whether one ends up with like a 0.5% end-to-end speedup or a 50%
end-to-end speedup or what.

On Fri, Jan 16, 2015 at 8:16 PM, Antoine Pitrou [email protected]
wrote:

With Numba we determined that AVX vector instructions required a 32-byte
alignment for optimal performance. If you compile Numpy with AVX enabled
(requires specific compiler options, I guess), alignment should make a
difference too.

—
Reply to this email directly or view it on GitHub
https://github.com/numpy/numpy/issues/5312#issuecomment-70315434.

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

njsmith on 16 Jan 2015

fwiw on my i5-4210u I see no significant difference between 16 and 32 byte aligned data in a simple load add store test, the minimum cycle count seems lower by 5% but median and 10th percentile is identical to 1%

juliantaylor on 16 Jan 2015

Is that with AVX?

sturlamolden on 17 Jan 2015

@seibert made some AVX measurements with Numba (i.e. just-in-time code generation with LLVM) using Numpy arrays, I think he'll try to run them again to get precise numbers :-)

pitrou on 19 Jan 2015

Here is the benchmark with the latest Numba master:

http://nbviewer.ipython.org/gist/seibert/6957baddc067140e55fe

For float32 arrays (size=10000) executing an a + b * fabs(a) operation, we see a 40% difference running on an Intel Core i7-4820K CPU @ 3.70GHz (Ivy Bridge).

(We are a little puzzled that the LLVM 3.5 autovectorizer is not generating any loop peeling code to correct for alignment problems. I would not be surprised if gcc or clang correct for this.)

For other architectures we are planning to target (like AMD APUs that support the HSA standard), the alignment needs are more strict, with the OpenCL developers suggesting that we have 256 byte alignment for arrays for best performance.

seibert on 19 Jan 2015

I don't know 40% seems like an excessive penalty, I see zero on an i5 haswell (which granted has really poor avx performance)
Could it be that the jit compiler is creating two different versions of the loop?
do you have a assembly level profile of this (e.g. via perf record)?

juliantaylor on 19 Jan 2015

also maybe turbo boost is kicking in on the second loop, disabling it or monitoring he appropiate pmu might be interesting

juliantaylor on 19 Jan 2015

Since we use this system for benchmarking, we have disabled TurboBoost in the BIOS.

The JIT is only running once (once Numba compiles a function for a given set of input types, it caches it), and the JIT is triggered prior to benchmarking by Cell 6 in the linked notebook.

I haven't used perf record before, but I'll look into it. How did you perform your benchmarks? (I would actually expect Haswell to do better than Ivy Bridge if mis-alignment is triggering some kind of inefficient use of the available L2 cache bandwidth.)

seibert on 19 Jan 2015

This whitepaper from Intel gives a little more information on the alignment issue at the bottom of page 6:

https://software.intel.com/sites/default/files/m/d/4/1/d/8/Practical_Optimization_with_AVX.pdf

The use of Intel AVX instructions on unaligned 32-byte vectors means that every second load will be across a cacheline split, since the cache line is 64 bytes. This doubles the cache line split rate compared to Intel SSE code that uses 16-byte vectors. A high cache-line split rate in memory-intensive code is extremely likely to cause performance degradation. For that reason, it is highly recommended to align the data to 32 bytes for use with Intel AVX.

Haswell has twice the L2 cache bandwidth of Sandy/Ivy Bridge, so it is possible the effect of misaligned arrays is not significant on Haswell...

seibert on 19 Jan 2015

my simple benchmark is this:
https://gist.github.com/juliantaylor/68e578d140f427ed80bb
would be interesting to see on that i7

juliantaylor on 19 Jan 2015

Results on a Core i5-2500K (Sandy Bridge):
4644 6656 7704 10100

pitrou on 19 Jan 2015

Please note I had to add a longer warmup phase at start of the benchmark:

    for (i=0; i<10; i++)
        add(a, 1);

pitrou on 19 Jan 2015

is that an average/median/...? that looks quite significant, seems intel worked on it a lot for the haswell.

juliantaylor on 19 Jan 2015

It's the near-stable output from the benchmark after some hundreds of runs in a loop.

pitrou on 19 Jan 2015

I added a packed SSE2 implementation, here are the figures (i5-2500K):
4660 6492 7468 5108 10096
(32B aligned AVX, 16B (un)aligned AVX, 8B (un)aligned AVX, aligned SSE2, scalar)

Here is the updated source: https://gist.github.com/pitrou/892219a7d4c6d37de201

pitrou on 19 Jan 2015

Near-stable output from a Core i5-4200U (laptop Haswell CPU):
4120 4152 4148 4260 7308

At least here, misaligned AVX isn't worse than SSE2 with similar alignment.

pitrou on 19 Jan 2015

Here is from my MBP (quad-core i7-3635QM, 2.4 GHz, Ivy Bridge):
5060 4932 5820 5704 5040

I had to change avxintrin.h to immintrin.h and compile with Intel icc because gcc 4.8.1 refused to compile the code (and so did clang).

sturlamolden on 19 Jan 2015

@sturlamolden , may it be possible that icc is vectorizing the scalar loop (the last result in the numbers)?

pitrou on 20 Jan 2015

(note that the benchmark would probably put more pressure on the cache subsystem if the addition involved two separate input arrays)

pitrou on 20 Jan 2015

I have no idea.

sturlamolden on 20 Jan 2015

@sturlamolden This code was very useful for interfacing with some SSE2.

I had one issue on Windows (all versions of VC, including 2015) don't like this line

memmove((void*)p2,p1+old_offs,n);

since they don't support pointer arithmetic on void*. As a short term fix I cast it to char* to do the math. This probably isn't right - do you have a better idea what would make it compile correctly on Windows?

bashtage on 16 Jan 2016

My bad. Pointer arithmetics on void* is illegal C. Casting to char* is correct.
Updated the code example.

sturlamolden on 16 Jan 2016

How do I get 64-byte aligned numpy arrays? The 'ALIGNED' addresses from https://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.require.html seem to align at some different length. Is there some user-configurable parameter for alignment length?

nachiket on 13 Feb 2017

I made my own (Python) aligned allocator that works on (unaligned) memory provided by Numpy.

import numpy as np

def empty_aligned(n, align):
    """                                                                                                                                     
    Get n bytes of memory wih alignment align.                                                                                              
    """

    a = np.empty(n + (align - 1), dtype=np.uint8)
    data_align = a.ctypes.data % align
    offset = 0 if data_align == 0 else (align - data_align)
    return a[offset : offset + n]


def test(n):
    for i in xrange(1, 1024):
        b = empty_aligned(n, i)

        try:
            assert b.ctypes.data % i == 0
            assert b.size == n
        except AssertionError:
            print i, b.ctypes.data % i, b.size

Perhaps a Python workaround like this is a viable solution?

eamartin on 5 May 2017

👍2

@eamartin This is about the NumPy's internal C code and the interface code to Fortran generated by f2py (also C code). For obvious reasons, the implementation of NumPy cannot depend on NumPy.

You can use that trick for your Python projects, though.

sturlamolden on 5 May 2017

@sturlamolden: It might help @nachiket and others in a similar situation though.

njsmith on 5 May 2017

@sturlamolden : I didn't read closely enough to realize this was about C internals.

However, offering better alignment options in the Python interfaces would be valuable for developing Python interfaces between Numpy and native libraries that have alignment requirements on arguments.

eamartin on 5 May 2017

I am not against offering better alignment options in the Python interfaces 🙂

sturlamolden on 5 May 2017

The "Python aligned allocator" solution I suggested is a hack. I think offering alignment in the Python interfaces would be nice, but the right way to do that would be to handle alignment at the C level.

eamartin on 6 May 2017

This feature would be very helpful to me. I am using an FPGA device (Altera A10GX) where the DMA controller requires 64-byte aligned data to be used, this speeds up my code by 40x(!!!). I suspect that @nachiket has the same problem as me. I wrote something similar to what @eamartin is using but this is a bit of a hack.

vellamike on 14 Jul 2017

👍2

I definitely encourage 64 byte alignment:

that is the cache line size
it is suitable for any SIMD alignment up to AVX512

mborgerding on 30 May 2018

Here we are almost 5 years later.

Any thoughts on making this (64-byte alignment in particular) a standard feature?..

aldanor on 5 Jun 2019

This cython code is now in NumPy. Of course, this doesn't change the default.

bashtage on 5 Jun 2019

my 2cents: An aligned allocator would help when interface with hardware devices and kernel level calls. These interfaces might benefit from aligning the buffers to pages.

hmaarrfk on 17 Jun 2019

In merging randomgen, we gained PyArray_realloc_aligned and friends. Should we move these routines into numpy/core/include ?

mattip on 27 Sep 2019

That would certainly be useful, @mattip. Would it be possible to access this functionality from Python as well?

jakirkham on 5 Nov 2019

In merging randomgen, we gained PyArray_realloc_aligned and friends. Should we move these routines into numpy/core/include ?

Ripped off my code, did ya? 😂

sturlamolden on 17 Nov 2019

It looks like I am no loger contributor for the code I have written for NumPy 🧐:

https://github.com/numpy/numpy/blob/v1.17.2/numpy/_build_utils/src/apple_sgemv_fix.c
https://github.com/numpy/numpy/blob/v1.17.2/numpy/random/src/aligned_malloc/aligned_malloc.h

sturlamolden on 17 Nov 2019

Probably got lost at some point in the move from randomgen to numpy. I
think I had a record that this was yours.

On Sun, Nov 17, 2019, 10:29 Sturla Molden notifications@github.com wrote:

It looks like I am no loger contributor for the code I have written for
NumPy 🧐:

https://github.com/numpy/numpy/blob/v1.17.2/numpy/_build_utils/src/apple_sgemv_fix.c

https://github.com/numpy/numpy/blob/v1.17.2/numpy/random/src/aligned_malloc/aligned_malloc.h

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/numpy/numpy/issues/5312?email_source=notifications&email_token=ABKTSRLWKGXUFE4OK53SJMDQUEMJJA5CNFSM4AYDJQ42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEIIWSY#issuecomment-554732363,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABKTSRM7IZIKPGKT4D2W4IDQUEMJJANCNFSM4AYDJQ4Q
.

bashtage on 17 Nov 2019

what was the original source of the code?

mattip on 17 Nov 2019

The link is dead, but it was adapted from an aligned malloc that looked like this:

https://tianrunhe.wordpress.com/2012/04/23/aligned-malloc-in-c/

sturlamolden on 17 Nov 2019

It was a github post from Sturla. There was no original code file.

On Sun, Nov 17, 2019, 12:04 Matti Picus notifications@github.com wrote:

what was the original source of the code?

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/numpy/numpy/issues/5312?email_source=notifications&email_token=ABKTSRKKJC4K6C4LW4GFYULQUEXNDA5CNFSM4AYDJQ42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEIKJZY#issuecomment-554738919,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABKTSRIVMQFEJ5EP227PXL3QUEXNDANCNFSM4AYDJQ4Q
.

bashtage on 17 Nov 2019

https://github.com/numpy/numpy/issues/5312#issuecomment-64529217

sturlamolden on 17 Nov 2019

This is for the aligned malloc.

On Sun, Nov 17, 2019, 12:14 Kevin Sheppard kevin.k.sheppard@gmail.com
wrote:

It was a github post from Sturla. There was no original code file.

On Sun, Nov 17, 2019, 12:04 Matti Picus notifications@github.com wrote:

what was the original source of the code?

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/numpy/numpy/issues/5312?email_source=notifications&email_token=ABKTSRKKJC4K6C4LW4GFYULQUEXNDA5CNFSM4AYDJQ42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEIKJZY#issuecomment-554738919,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABKTSRIVMQFEJ5EP227PXL3QUEXNDANCNFSM4AYDJQ4Q
.

bashtage on 17 Nov 2019

Does everyone who contributes 50 lines of code to Numpy get a dedicated copyright header? I might go through my contributions and see if any applies :-)

pitrou on 17 Nov 2019

👎1

It looks like I am no longer contributor for the code I have written for NumPy 🧐:

You are and will always be:)

Does everyone who contributes 50 lines of code to Numpy get a dedicated copyright header? I might go through my contributions and see if any applies :-)

Nope. We try to avoid encoding such things inside the source code, since that will always be wildly incomplete and hard to maintain. We do ask people to list themselves in THANKS.txt; I'm looking at a better alternative to that, because that file often gives merge conflicts.

rgommers on 17 Nov 2019

Numpy: Use an aligned allocator for NumPy?

Most helpful comment

All 68 comments

4663).

Related issues