Pandas: Parallelization for embarrassingly parallel tasks

Created on 19 Dec 2013  ·  64Comments  ·  Source: pandas-dev/pandas

I would like to promote the idea of applying multiprocessing.Pool() execution of embarrassingly parallel tasks, e.g. the application of a function to a large number of columns.
Obviously, there's some overhead to set it up, so there will be a lower limit of number of columns from which on this only can be faster than the already fast cython approach. I will be adding my performance tests later to this issue.
I was emailing @jreback about this and he added the following remarks/complications:

  • transferring numpy data is not that efficient - but still prob good enough
  • you can transfer (pickle) lambda type functions so maybe need to look at using a library called dill ( which solves this problem) - possibly could slightly modify msgpack to do this though (and is already pretty efficient at transferring other types of objects)
  • could also investigate joblib - I think statsmodels uses it [ed: That seems to be correct, I read in their group about joblib]
  • I would create a new top level dir core/parallel for this type of stuff
  • the strategy in this link could be a good way to follow: http://stackoverflow.com/questions/17785275/share-large-read-only-numpy-array-between-multiprocessing-processes

links:
http://stackoverflow.com/questions/13065172/multiprocess-python-numpy-code-for-processing-data-faster
http://docs.cython.org/src/userguide/parallelism.html
http://distarray.readthedocs.org/en/v0.5.0/

Docs Groupby IO HDF5 Performance

Most helpful comment

I've loaded a data-frame with roughly 170M rows in memory (python used ~35GB RAM) and timed the same operation with 3 methods and ran it over night. The machine has 32 physical or 64 hypervised cores and enough free RAM. While date conversion is a very cheap operation it shows the overhead of these methods.

While the single threaded way is the fastest its quite boring to see a single core continuously running at 100% while 63 are idling. Ideally I want for parallel operations some kind of batching to reduce the overhead, e.g. always 100000 rows or something like batchsize=100000.

@interactive
def to_date(strdate) :
    return datetime.fromtimestamp(int(strdate)/1000)

%time res['milisecondsdtnormal']=res['miliseconds'].map(to_date)
#CPU times: user 14min 52s, sys: 2h 1min 30s, total: 2h 16min 22s
#Wall time: 2h 17min 5s

pool = Pool(processes=64)
%time res['milisecondsdtpool']=pool.map(to_date, res['miliseconds'])
#CPU times: user 21min 37s, sys: 2min 30s, total: 24min 8s
#Wall time: 5h 40min 50s

from IPython.parallel import Client
rc = Client() #local 64 engines
rc[:].execute("from datetime import datetime")
%time res['milisecondsipython'] = rc[:].map_sync(to_date, res['miliseconds'])
#CPU times: user 5h 27min 4s, sys: 1h 23min 50s, total: 6h 50min 54s
#Wall time: 10h 56min 18s

All 64 comments

Maybe this would be something better done in a sandbox/parallel folder
until we settle on what's feasible to do? It would also be good to have
someone to test perf on Windows as well, I've heard that there are very
different performance characteristics for Windows vs. Linux/OSX.

for windows vs. linux/osx threading/multiprocessing that is.

I think this could start with an optional keyword / option to enable parallel conputation where useful. this solves the windows/Linux issues because the default to do parallel can be different ( or have different thresholds, like we do for numexpr)

They could all be grouped under a par attribute like the str methods. What all is in mind here other than apply? Would sum, prod, etc possibly benefit?

@michaelaye can u put up a simple example that would be nice for benchmarking?

Here's an implementation using joblib:

https://github.com/jreback/pandas/tree/parallel

and some timings.....I had to use a pretty contrived function actually....
you need to weigh the pickle time for the sub-frames vs the function time
the pickle time listed below is to disk which is not the case when sending to sub-processes
(but is still the limiting factor I think). FYI if the frame is already on disk (e.g. HDF), then this could have quite substantial benefit I think.

In [1]: df1  = DataFrame(np.random.randn(20000, 1000))

In [2]: def f1(x):
   ...:         result = [ np.sqrt(x) for i in range(10) ]
   ...:         return result[-1]
   ...: 
In [8]: %timeit df1.to_pickle('test.p')
1 loops, best of 3: 1.77 s per loop
# reg apply
In [3]: %timeit df1.apply(f1)
1 loops, best of 3: 6.28 s per loop

# using 12 cores (6 real x 2 hyperthread)
In [4]: %timeit df1.apply(f1,engine='joblib')
1 loops, best of 3: 2.06 s per loop

# 1 core pass thru
In [5]: %timeit df1.apply(f1,engine=pd.create_parallel_engine(name='joblib',force=True,max_cpu=1))
1 loops, best of 3: 6.28 s per loop

In [6]: %timeit df1.apply(f1,engine=pd.create_parallel_engine(name='joblib',force=True,max_cpu=4))
1 loops, best of 3: 2.68 s per loop

In [7]: %timeit df1.apply(f1,engine=pd.create_parallel_engine(name='joblib',force=True,max_cpu=2))
1 loops, best of 3: 3.87 s per loop

pickle time outweighs the perf gains, function is too quick so no benefit here

In [6]: %timeit df1.apply(f2)
1 loops, best of 3: 981 ms per loop

In [7]: %timeit df1.apply(f2,engine='joblib')
1 loops, best of 3: 1.8 s per loop

In [8]: def f2(x):
    return np.sqrt(x)
   ...: 

So you need a sufficiently slow function on a single-column to make this worthwhile
(would be pretty easy to do a sample timing of a single column say and decide wether to go parallel or not)
right now just using it on-demand (user/option) specified

In [9]: %timeit f1(df1.icol(0))
100 loops, best of 3: 5.89 ms per loop

In [10]: %timeit f2(df1.icol(0))
1000 loops, best of 3: 639 ᄉs per loop

Good write up Jeff. I think pickle time is a big factor but also the time to spawn a new process.

I would envision this working with a set of compute processes that are pre launched on startup and wait to do work. For my future parallel work I will probably use iPython parallel across distributed hdf5 data. The problem that I often run into is slowly growing memory consumption for python processes that live for too long.

On disk parallel access of HDF5 row chunks to speed up computation sounds great.

@dragoljub

yep...I don't think adding a IPython parallel back end would be all that difficult (or other distributed type of backends). Just inherit and plug in.

HDF5 and groupby apply look like especially nice cases for enhancement with this.

Pls play around and give me feedback on the API (and even take a stab at a backend!)

http://docs.cython.org/src/userguide/parallelism.html

a Cynthon engine is also straightforward (though needs a slight change in setup to compile with OpenMP Support)
but seems straightforward

I've loaded a data-frame with roughly 170M rows in memory (python used ~35GB RAM) and timed the same operation with 3 methods and ran it over night. The machine has 32 physical or 64 hypervised cores and enough free RAM. While date conversion is a very cheap operation it shows the overhead of these methods.

While the single threaded way is the fastest its quite boring to see a single core continuously running at 100% while 63 are idling. Ideally I want for parallel operations some kind of batching to reduce the overhead, e.g. always 100000 rows or something like batchsize=100000.

@interactive
def to_date(strdate) :
    return datetime.fromtimestamp(int(strdate)/1000)

%time res['milisecondsdtnormal']=res['miliseconds'].map(to_date)
#CPU times: user 14min 52s, sys: 2h 1min 30s, total: 2h 16min 22s
#Wall time: 2h 17min 5s

pool = Pool(processes=64)
%time res['milisecondsdtpool']=pool.map(to_date, res['miliseconds'])
#CPU times: user 21min 37s, sys: 2min 30s, total: 24min 8s
#Wall time: 5h 40min 50s

from IPython.parallel import Client
rc = Client() #local 64 engines
rc[:].execute("from datetime import datetime")
%time res['milisecondsipython'] = rc[:].map_sync(to_date, res['miliseconds'])
#CPU times: user 5h 27min 4s, sys: 1h 23min 50s, total: 6h 50min 54s
#Wall time: 10h 56min 18s

it's not at all clear what you are timing here; the way pool and ipython split this is exceedingly poor; they turn this type of task into a lot of scalar evaluations where the cost of transport is MUCH higher than the evaluations time.

the pr does exactly this type of batching

you need to have each processor execute a slice and work on it in a single task (for each proessor), not distrute the pool like you are showing.

@michaelaye did you have a look at this branch? https://github.com/jreback/pandas/tree/parallel

Oh, this is exciting. I've been waiting for a parallel scatter/gather apply function using IPython.parallel. Please keep us up to date on any progress here.

Indeed! It would be a great feature. I have been using concurrent.futures and that makes things pretty easy, however the cost of spooling up new processes still takes up a bunch of time. If we have IPython parallel kernels just waiting to do work with all the proper imports, passing data pointers to them and aggregating results would be fantastic.

@dragoljub you have hit the nail on the head. joblib is fine, but you often don't need to spawn processes that way; usually you have an engine hanging out their.

Do you have some code that I could hijack?

I don't think it would be very hard to add this using IPython.parallel; I just have never used it (as for what I do I often spawn relatively long-lived processes)

It may be overkill but I have a notebook on using IPython.parallel here. There are some quick examples.

https://github.com/jseabold/zorro

Their docs are also quite good

http://ipython.org/ipython-doc/dev/parallel/

thanks skipper....ok next thing....do you guys have some non-trivial examples for vbenching purposes? e.g. stuff that does actual work (and takes a non-trivial amount of time) that can use for benchmarking? (needs to be relatively self-contained....though obviously could use say statsmodels :)

I happen to be running one such problem right now. :) I'm skipping apply in favor of joblib.Parallel map. Let me see if I can make it self contained.

Hmm, maybe it is too trivial. My actual use case takes much longer 20 obs ~ 1s and the data is quite big. Find the first occurrence of a word in some text. You can scale up n, make the "titles" longer, include unicode, etc. and it quickly becomes time consuming.

n = 100

random_strings = pd.util.testing.makeStringIndex().tolist()
X = pd.DataFrame({'title' : random_strings * n,
                                'year' : np.random.randint(1400, 1800, size=10*n)})

def min_date(x): # can't be a lambda for joblib/pickling
    # watch out for global
    return X.ix[X.title.str.contains('\\b{}\\b'.format(x))].year.min()

X.title.apply(min_date) 

There are maybe some better examples in ipython/examples/parallel. There are also a couple in my notebook. E.g., parallel optimization, but I'm not sure it's a real use case of the scatter-gather apply I'm thinking of. Something like

def crazy_optimization_func(...):
    ....

df = pd.DataFrame(random_start_values)
df.apply(crazy_optimization_func, ...)

Where the DataFrame contains rows of random starting values and you iterate over the zero axis to do poor man's global optimization.

Some API inspiration. See aaply, adply, etc.

http://cran.r-project.org/web/packages/plyr/index.html

I forgot that Hadley kindly re-licensed all of his code to BSD-compatible, so you can take more than API inspiration if for some reason you're curious.

actually the api really is very easy:

df.apply(my_cool_function, engine='ipython')

from IPython.parallel import Client
df.apply(my_cool_function,engine=pd.create_parallel_engine(client=Client(profile='mpi')))

e.g. you just pass an engine (prob will allow you to simply pass a Client directly as an engine)

Great.

You might also allow just 'ipython' and use a default Client() call. If you start your IPython session/notebook with the correct profile then it should respect this and look in that directory for the setup code it needs. There was a bug here in some of the IPython 1.x but it should be fixed now.

https://github.com/ipython/ipython/issues/4238

passing engine='ipython' will create a default Client
also settable via an option, parallel.default_engine

and will only pass with a threshold number of rows (could have a function do that too)

Sounds awesome. Can't wait for this. Going to be a big feature.

What is the status of this? It seems awesome. Do you just need some functions for benchmarks? I can come up with something if that's helpful/ all that's needed.

how much should a target function take (per row, say; that's what I always apply on)? 0.1 s? 1 s? 10 s? RAM limitations?

Well it works for joblib, sort of with IPython.parallel. needs some more work/time. I am also convinced that you need a pretty intensive task for this to be really useful. e.g. the creation/comm time is non-trivial.

I won't have time for this for a while, but my code is out there.

I'm really excited about this! I wrote my own function to split - map - concat but ran into some troubles with multiprocessing (http://stackoverflow.com/questions/26665809/multiprocessing-on-pandas-dataframe-confusing-behavior-based-on-input-size).

Any thoughts when this might get out?

+1 for parallel apply engine!

It is a year since @jreback has made available a parallel version of pandas and it is not clear what is a hold up not allowing to make it a feature of regular pandas (even in a "testing" form).

While it has been shown to work, there seems to be an understanding that more testing might be needed. Would not then making this feature available / accessible (even if for testing purposes) in regular pandas facilitate more testing and bug reporting?

@wikiped
I could not get this to work from a testing perspective for generic UDF's (user defined functions). and ran out of time.

Its actually much easier/better to use: http://blaze.pydata.org/docs/latest/index.html
for this type of parallelization as its pretty well supported nowadays.

This would be an add-on feature which actually requires quite a bit of work/testing/benchmarking to make a part of pandas core.

You are welcome to do this if you wish.

@jreback
Thank you for sharing your perspective on this. I do wish I could help code-wise with this, but unfortunately it is above my skill level.

And thanks for linking blaze - will need to look deeper into it - from a first glance looks interesting, but makes me feel lost with all the new concepts and not (yet) reach docs/examples.

Linking this with #3202.

Has there been any progress on this? I would like to help if we need developers to work on this! A parallel datatable would be _really useful_ considering I have a 36 core machine using only 1 thread :( Pyspark has an parallel implementation but it is really slow on a single machine.

There are a lot of interesting new tools in this space: dask, Ibis, SFrame. I would recommend taking a look at some of these. Extending pandas to natively support out of core computation would be great but it would be quite a lot of work and it's not on anyone's immediate agenda. Honestly it's probably best left to other projects.

On Fri, Jul 24, 2015 at 8:59 AM, abell25 [email protected] wrote:

Has there been any progress on this? I would like to help if we need developers to work on this! A parallel datatable would be _really useful_ considering I have a 36 core machine using only 1 thread :( Pyspark has an parallel implementation but it is really slow on a single machine.

Reply to this email directly or view it on GitHub:
https://github.com/pydata/pandas/issues/5751#issuecomment-124566446

using pandas in a parallel way requires a fair amount of work on the implementation. Fortunately dask has already done this, and it works nicely with pandas (in fact, we released the gil and created the CategoricalIndex to facilitate this. So I would recommend this for embarrasingly parallel (and in fact generalized parallel computations).

I am going to change this issue to one of docs, as I think we should add a section to enhancingperf.rst to point the way to dask.

cc @mrocklin

Further, it may be desirable to provide an interface from the pandas side to dask (and possibly some of the out-of-core / parallel computation libraries), e.g. something along the lines of

df.apply(f, engine='dask'). Which can dispatch the computation. However, this would not be a generalized mechanism as often what you really want to do is do a series of computations. So not sure what / if anything we should do here.

I'd be quite happy to work with people who are interested in paralleliing pandas-like computations with dask. I suspect that this will involve a bit of work both on ensuring that dask.dataframe encompasses all necessary Pandas API and in ensuring that Pandas releases the GIL in the appropriate locations.

Making this work very well probably requires action by a few parties, a user with a driving application, a core pandas developer, and a core dask developer. Jeff and I have had good interactions on this in the past; it'd be good to add in some people with concrete use cases.

Is the expectation that df.apply(f, engine='dask') (or any other engine) _must_ return the same result as df.apply(f)? Or could dask return a dask graph? I'd be interested to see what returning a graph would allow you to do, and how hard it would be to implement? Would love to talk about this today @mrocklin

@TomAugspurger

IF we did this, then I think this would simply do the computation (in parallel) and return the result. Kind of a hey, just use multi-cores type of thing. My viewpoint is that this gives the average pandas users a big hammer, so instead of vectorizing (or using cython/numba), they IMMEDIATELY go towards parallel when it is not necessary.

So that is my hesitation here. (others have argued that you simply document/educate and provide the tools). My argument is that maybe we should NOT do this in pandas AT ALL, and if you are sophisticated enough to use parallel execution, then using dask mainline (or ibis et al.) is the correct way to go about this.

My views align with @jreback 's here. I think that if your data already lives in a pandas DataFrame then eager parallel execution probably provides most of the benefits that you should expect without anything unexpected.

Very reasonable.

On the other hand, for novice pandas users using numba or even dask (and sometimes even vectorizing) is not so easy while getting a performance boost without needing to do anything (aka a "use parallel everywhere" config option set to True) is a very nice benefit for everyone, even power users.

I think we could add an engine= argument

and allow say: numba, dask (if the are installed)

which gets most of the way there

Agree with the comments above :

I think that if your data already lives in a pandas DataFrame then eager parallel execution probably provides most of the benefits that you should expect without anything unexpected.

That's dead on.

Also, I very much like the idea of an engine = argument option. This would be a huge benefit for most end users, especially those using pandas as core dependency in their own applications—immediate parallelism across .map and .apply with the inclusions of a dependency (ala dask) and / or JIT (ala numba) and a simple configuration option.

Excited about this, and just amazing, all the work here. Kudos to you all.

Is it possible to integrate pandas with the @parallel decorator in ipyparallel, like the example that they have with numpy?

http://ipyparallel.readthedocs.org/en/latest/multiengine.html#remote-function-decorators

I think theoretically speaking, even pandas does not support parallel computing by default, user can still refer to mpi4py for parallel computation. It's just some more coding time if one knows about MPI already.

let's just doc this to direct to dask

I tried out groupby and apply with pandas 0.17.1, and surprised to see the function is applied in parallel. I am confused, is this feature already added and enabled by default? I am not using dask.

@heroxbd well the GIL IS released during groupby operations. However there is not paralleism natively / inherent in groupby. Why do you think its in parallel?

The evidence is that all the loads of my CPUs grow to 100% when I call groupby apply.
sph = tt.groupby('ev').apply(sphericity)

so if you do arithmetic ops inside sphericity (e.g. df + df) or like. These defer to numexpr, which utilizes multi-cores by default. Can you provide a sketch of this function?

These are performance things that pandas does w/o the user being specifically aware.

@jreback Ah-ha. Here it is:
pmtl, ql = event['pmt'].values, event['q'].values
pl = pdir[pmtl] * ql[:,np.newaxis]
s = np.sum(pl[:,:,np.newaxis] * pl[:,np.newaxis,:], axis=0)
qs = np.sum(ql_2)
eig = np.sort(np.linalg.eigvals(s/qs))
return pd.Series({'S': (eig[0] + eig[1])_1.5, 'A': eig[0]
1.5}) # sphericity, aplanarity

Sorry for the noise. The parallel execution come from OpenMP used by OpenBLAS, which in turn is used by NumPy.

closing this as dask is the most appropriate for this. Certainly could add some usage within pandas to actually use dask, but those are separate issues (e.g. on a big-enough groupby (number of groups) deferring to dask is a good thing)

Was there a reason when releasing the GIL in pandas group by operations to only allows separate group by and apply operations to happen concurrently rather than the computing independent group-level aggregations in parallel?

Once you have GIL-releasing groupby operations then other developers can use Pandas in parallel. It's actually quite tricky to intelligently write down the algorithms to handle all groupbys. I think that if someone wants to do this for Pandas that'd be great. It's a pretty serious undertaking though.

To do this with dask.dataframe

$ conda install dask
or 
$ pip install dask[dataframe]
import pandas as pd
df = ... your pandas code ...

import dask.dataframe as dd
df = dd.from_pandas(df, npartitions=20)
dd_result = df.groupby(df.A).B.mean()  # or whatever
pd_result = dd_result.compute()  # uses as many threads as you have logical cores

Of course, this doesn't work on all groupbys (as mentioned before, this is incredibly difficult to do generally) but does work in many common cases.

@mrocklin thanks for the tip. How long would df = dd.from_pandas(df, npartitions=20) take on a dataframe with say 10M-100M rows? Is there a copy involved? Does dask support categorical columns?

There is a single copy involved (just to other pandas dataframes). We're effectively just doing dfs = df.iloc[i: i + blocksize] for i in range(0, len(df), blocksize)]. We're also doing a sort on the index ahead of time which, if your algorithms don't require it, you can turn off with sort=False (though some groupbys, particularly groupby.applys will definitely appreciate a sorted index.)

Generally the from_pandas call is cheap relative to groupby calls. Copying data in memory generally runs fairly quickly.

@dragoljub

In particular YMMV depending on such things as the number of groups that you are dealing and how you partition

Small number of groups

In [7]: N = 10000000

In [8]: ngroups = 100

In [9]: df = pd.DataFrame({'A' : np.random.randn(N), 'B' : np.random.randint(0,ngroups,size=N)})

In [10]: %timeit df.groupby('B').A.mean()
10 loops, best of 3: 161 ms per loop
In [15]: ddf = dd.from_pandas(df, npartitions=8)

In [16]: %timeit ddf.groupby(ddf.B).mean().compute()
1 loop, best of 3: 223 ms per loop

Larger number of groups

In [17]: ngroups = 10000

In [18]: df = pd.DataFrame({'A' : np.random.randn(N), 'B' : np.random.randint(0,ngroups,size=N)})

In [19]: %timeit df.groupby('B').A.mean()
1 loop, best of 3: 591 ms per loop

In [21]: %timeit ddf.groupby(ddf.B).mean().compute()
1 loop, best of 3: 323 ms per loop

Can do even better if actually use our index

In [32]: ddf = dd.from_pandas(df.set_index('B'), npartitions=8)

In [33]: %timeit ddf.groupby(ddf.index).mean().compute()
1 loop, best of 3: 215 ms per loop

Note that these are pretty naive timings. This is a _single_ computation that is split into embarassingly parallel tasks. Generally you would use dask for multiple steps in a computation. If you data _doesn't_ fit in memory that dask can often help a lot more.

In an embarrassingly parallel calculation, I create many dataframes which must be dumped to disc (This is the desired output of the program). I tried doing the computation and the dumping (to hdf5) in parallel using joblib. And I run into HDFWrite trouble. Note, that at this point, I do not worry so much about performance of the parallel write.

An example which demonstrates the problem is in https://github.com/rbiswas4/ParallelHDFWriteProblem

The program demo_problem.py includes a function called worker(i) which creates a very simple dataframe based on the input i and appends that to an hdf file. I do this twice once in serial, and once in parallel using joblib.

What I find is that the serial case always works. But the parallel case is not reproducible: Sometimes it works without a problem and sometimes it crashes. The two log files https://github.com/rbiswas4/ParallelHDFWriteProblem/blob/master/demo.log.worked
and
https://github.com/rbiswas4/ParallelHDFWriteProblem/blob/master/demo.log_problem
are two cases when it worked and did not work respective.

Is there a better way to write to hdf files in a parallel way from pandas that I should use? Is this a question for other fora like joblib and pyTables?

@rbiswas4 If you want to dump a bunch of data to disk in parallel, the easiest thing to do is to create a separate HDF5 file for each process. Your approach is certainly going to result in corrupted data -- see the pytables FAQ for more details (http://www.pytables.org/FAQ.html#can-pytables-be-used-in-concurrent-access-scenarios). You might also be interested in dask.

Writing to separate files is something I have to avoid because I might end up creating too many files (inode limits). I suppose this means one of the following:

  • I should not be looking at hdf5 as a possible output format, but use a database.
  • Use a smaller number of partitions and hence files created. This is still quite inelegant
  • consider some methodology by which I can write out the dataframes through a separate process (I am not sure how to split things)

It seems the first is the best bet. And, yes I intend to see whether I should use dask!

Thanks @shoyer

@rbiswas4, if you can write out raw hdf5 (via h5py) instead of pytables, please have a look at SWMR,

https://www.hdfgroup.org/HDF5/docNewFeatures/NewFeaturesSwmrDocs.html

available in hdf5-1.10.

Was this page helpful?
0 / 5 - 0 ratings