Numpy: apply_along_axis cuts strings

Created on 7 Dec 2016 · 16Comments · Source: numpy/numpy

I'm trying to concatenate all elements of a row into a string as follows:

np.apply_along_axis(lambda x: " ".join(map(str, x)), 1, b)

b is

[[111,111,0,0,0], [111,111,111,111,111]]

However, the result of the line is:

['111 111 0 0 0', '111 111 111 1']

It looks like np.apply_along_axis is cutting the second string to be of the same length as the first one. If I put a longer sequence first, the result is correct:

['111 111 111 111 111', '111 111 0 0 0']

So I'm guessing this is a bug?

Summary 2019-04-30 by @seberg

np.apply_along_axis infers the output dtype from the first pass. Which can be worked around for example but the function returning an array of a correct type.

Actions:

np.apply_along_axis could/should get a dtype kwarg (or similar, compare also np.vectorize).

01 - Enhancement numpy.lib Intermediate

Source

lukovnikov

👍1

Most helpful comment

Perhaps we should just add a dtype= argument to apply_along_axis

eric-wieser on 29 Dec 2017

👍5

All 16 comments

(found this via #8363) @lukovnikov - the code fails because it was written with numerical arrays in mind, for which the computation on any part of an array can be expected to return the same type of output as any other part. I should note more generally that numpy arrays are not particularly good or efficient at strings, and unless you have a very complicated array, my guess is that you would be much better off just working with lists and python functions, especially since you already are using the python string functions to do the concatenating.

mhvk on 13 Dec 2016

👎1 👍1

@mhvk : I disagree completely with your statement. numpy might be written for numerical computations, but that doesn't mean we have omit functionality with str arrays. After all, that is why we have specific dtypes for strings unlike libraries like pandas.

This is a bug, and it should be patched unless you can come up with a more convincing argument than what you have provided.

gfyoung on 13 Dec 2016

👍2

@gfyoung - I'm not saying one should not try to solve the bug (though I think it is obvious any solution better not cause a huge performance regression for more typical), just explaining why the bug exists and suggesting that for strings one really is better off not using ndarray. Anyway, those are my 2¢.

mhvk on 13 Dec 2016

@mhvk : Fair enough, though your response came across as if this wasn't really a concern of numpy. That is why I wanted to come down strongly to emphasize that this is something we should be trying to patch.

gfyoung on 13 Dec 2016

i had the same issue:

you can see it crops the 'g' of jpg by simply referencing it. i figured it has something to do with the shape change. ended up using lists + map instead.

capture d ecran 2017-05-27 a 11 54 25

yvan on 27 May 2017

I'm assuming that that's a deliberately contrived example, because you shouldn't be using apply_along_axis for simple indexing like that.

np.ma.apply_along_axis will work correctly here (for now - see #8511). Another option (crashed until 1.13) is a manual cast to dtype object:

np.apply_along_axis(lambda x: np.array(x[0], object), 1, fnames)

eric-wieser on 27 May 2017

👍3

@eric-wieser your comment above helped me solve a problem I've been having for a few months with numpy and string operations. 👍

I'm assuming that that's a deliberately contrived example, because you shouldn't be using apply_along_axis for simple indexing like that.

np.ma.apply_along_axis will work correctly here (for now - see #8511). Another option (crashed until 1.13) is a manual cast to dtype object:

np.apply_along_axis(lambda x: np.array(x[0], object), 1, fnames)

Thanks

linwoodc3 on 29 Dec 2017

Perhaps we should just add a dtype= argument to apply_along_axis

eric-wieser on 29 Dec 2017

👍5

@eric-wieser

I agree you should add that argument, if there are performance differences between the mask array version and the regular version. I just did a small test (not sure if it means anything) and here is a pic of the results:

screen shot 2017-12-29 at 9 58 56 am

My use case: I'm storing parsed NLP data (strings) in numpy arrays and trying to get rid of all for loops and if-else clauses; I apply some text analytics functions using the apply_along_axis. Any speed benefit would be awesome as each information extraction could have 10-20 variations (from a document that could have tens to hundreds of information extractions which comes from a corpus of thousands of documents...per day).

EDIT

For anyone else experiencing this issue with numpy string operations:
I just explicitly set the dtype in the normal apply_along_axis vice using the masked array approach with is slower than normal apply_along_axis.

Detail on my string operations so you can see if it applies to you : My numpy array (set as nump in code below) has a shape of (26,1) and each element in the array is an information extraction from a sentence in a document. Each information extraction is a list of key/value pairs, and I am extracting the key/value pairs that represent NLP triples (subject, relation, object). The lambda function is passed over the array to combine the triple into a single sentence which I will then test for flesh kincaid reading ease since these are computer generated sentences; the sentences were being truncated or set to some default length based on the dtype value. The original code that was truncating the strings was:

%timeit -n 500 np.apply_along_axis(lambda x: ("{} {} {}".format(x[0]['subject'],x[0]['relation'],x[0]['object'])),1,np.array(nump[0][0])[:,np.newaxis])

My speed was 116 µs ± 5.42 µs per loop (mean ± std. dev. of 7 runs, 500 loops each)

Solution

Using @eric-wieser 's comment above, I used to following to fix my problem and maintain speeds near the original:

Explicitly passing in the dtype argument

%timeit -n 500 np.apply_along_axis(lambda x: np.array("{} {} {}".format(x[0]['subject'],x[0]['relation'],x[0]['object']),dtype='S255'),1,np.array(nump[0][0])[:,np.newaxis])

My speed was 152 µs ± 12.6 µs per loop (mean ± std. dev. of 7 runs, 500 loops each)

The masked array approach works without changing your code, but is slower.

Masked array with no dtype argument

%timeit -n 500 np.ma.apply_along_axis(lambda x: ("{} {} {}".format(x[0]['subject'],x[0]['relation'],x[0]['object'])),1,np.array(nump[0][0])[:,np.newaxis])

My speed was 925 µs ± 33.6 µs per loop (mean ± std. dev. of 7 runs, 500 loops each)

linwoodc3 on 29 Dec 2017

and trying to get rid of all for loops

Note that apply_along_axis is just a python for loop, with a little help in allocating the output array for you - it's not unlikely that it's slower than the loop it replaces

eric-wieser on 29 Dec 2017

Ah shucks; i was doing numpy because I always thought it led to faster speeds. Sheesh.

linwoodc3 on 29 Dec 2017

👍1

fml

Divye02 on 19 Feb 2018

To avoid the cut when joining string with np.apply_along_axis:

a = np.array(['sssssssss','ffffffffffffff'])
b = np.array(['cccccccccccc','iiiiiiiiiiiiiiiii'])

def join_txt(text): return np.asarray(" ".join(text),dtype=object)

np.apply_along_axis(join_txt,0,[a,b])

the result is

array(['sssssssss cccccccccccc', 'ffffffffffffff iiiiiiiiiiiiiiiii'], dtype=object)

cerlymarco on 1 May 2019

👍1

@cerlymarco , how does the approach you propose compared to @linwoodc3's solution in terms of speed?

alexcoca on 4 Dec 2019

Perhaps we should just add a dtype= argument to apply_along_axis

Unfortunately this isn't possible without breaking someone. The current signature is:

def apply_along_axis(func1d, axis, arr, *args, **kwargs):

Today, users can call it as both:

def f1(x):
    return x
np.apply_along_axis(f1, 0, my_arr)

def f2(x, *, dtype):
    return x.astype(dtype)
np.apply_along_axis(f2, 0, my_arr, dtype=int)

If we make a apply_along_axis take a dtype argument and not pass it on to f, then f2 will fail. If we make it take a dtype argument and pass it on to f, then f1 will fail.

What we could do is:

Emit a FutureWarning if 'dtype' in kwargs telling people to rename their arguments
Wait 2 years
Break any users still using something like f2 above

eric-wieser on 4 Dec 2019

Just a comment that I was trying to add a prefix and suffix to a filename using Numpy on a Pandas Series and ran into the same problem (I think); i.e., that the second string was cut to the same length as the first.

filenames = Series(['S1/C03/C03_R1/S1_C03_R1_PICT0239.JPG','S1/C03/C03_R1/S1_C03_R1_PICT0239.JPG'])
prefix = 'somepath'
np.char.add(prefix, filenames.astype(str))