https://numpy.org/doc/1.18/reference/generated/numpy.vectorize.html
This tutorial mentions that vectorize's implementation is essentially a for loop. But as far as I know, a vectorized func will use
SIMD, so is it accurate to say numpy.vectorize's implementation is essentially a for loop? If true, so it's faster than unvectorized func only because it's loop implementated in C language?
Many thanks in advance.
Yes. In the context of interpreted numerical array programming languages like Python (with numpy) and MATLAB™, we often use "vectorization" to refer to replacing explicit loops in the interpreted programming language with a function (or operator) that takes care of all of the looping logic internally. In numpy, the ufunc
s implement this logic. This is unrelated to the usage of "vectorization" to refer to using SIMD CPU instructions that compute over multiple inputs concurrently, except that they both use a similar metaphor: they are like their "scalar" counterparts, but perform the computation over multiple input values with a single invocation.
With numpy.vectorize()
, there is usually not a whole lot of speed benefit over the explicit Python for
loop. The main point of it is to turn the Python function into a ufunc
, which implements all of the broadcasting semantics and thus deals with any size of inputs. The Python function that's being "vectorized" still takes up most of the time, as well as converting the raw value of each element to a Python object to pass to the function. You wouldn't expect np.vectorize(lambda x, y: x + y)
to be as fast as the ufunc np.add
, which is C both in the loop and the contents of the loop.
Thank you for your detailed explaination. But to be clear, let me take an example.
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': range(100000), 'b': range(1, 1000001)})
# method1
df.loc[:, 'c'] = df.apply(lambda x: x['a'] + x['b'], axis=1)
# method2
df.loc[:, 'c'] = np.vectorize(lambda x, y: x + y)(df['a'], df['b'])
# method3
df.loc[:, 'c'] = np.add(df['a'], df['b'])
so with your explaination, I guess
method | loop in C | loop content in C | use SIMD
-- | -- | -- | --
1 | × | × | ×
2 | √ | × | ×
3 | √ | √ | √
Right?
np.add
is faster than np.vectorize(lambda x, y: x + y)
because it avoids converting C doubles into Python objects and the Python function call overhead. It's possible that it also uses SIMD instructions, depending on whether or not you have the AVX2 extensions, but that's not why it's faster.
np.add
is faster thannp.vectorize(lambda x, y: x + y)
because it avoids converting C doubles into Python objects and the Python function call overhead. It's possible that it _also_ uses SIMD instructions, depending on whether or not you have the AVX2 extensions, but that's not why it's faster.
I got it. Thanks.
You can use numba's vectorize
to produce ufuncs that operate in parallel without Python overheads:
https://numba.pydata.org/numba-doc/latest/user/vectorize.html
Most helpful comment
Yes. In the context of interpreted numerical array programming languages like Python (with numpy) and MATLAB™, we often use "vectorization" to refer to replacing explicit loops in the interpreted programming language with a function (or operator) that takes care of all of the looping logic internally. In numpy, the
ufunc
s implement this logic. This is unrelated to the usage of "vectorization" to refer to using SIMD CPU instructions that compute over multiple inputs concurrently, except that they both use a similar metaphor: they are like their "scalar" counterparts, but perform the computation over multiple input values with a single invocation.With
numpy.vectorize()
, there is usually not a whole lot of speed benefit over the explicit Pythonfor
loop. The main point of it is to turn the Python function into aufunc
, which implements all of the broadcasting semantics and thus deals with any size of inputs. The Python function that's being "vectorized" still takes up most of the time, as well as converting the raw value of each element to a Python object to pass to the function. You wouldn't expectnp.vectorize(lambda x, y: x + y)
to be as fast as the ufuncnp.add
, which is C both in the loop and the contents of the loop.