Pandas: What is the most efficient way to iterate over Pandas's DataFrame row by row?

Created on 12 Jun 2015  ·  6Comments  ·  Source: pandas-dev/pandas

I have tried the function df.iterrows() but its performance is horrible. Which is not surprising given that iterrows() returns a Series with full schema and meta data, not just the values (which all that I need).

The second method that I have tried is for row in df.values, which is significantly faster. However, I have recently realized that df.values is not the internal data storage of the DataFrame, because df.values converts all dtypes to a common dtype. For example, one of my columns has dtype int64 but the dtype of df.values is all float64. So I suspect that df.values actually creates another copy of the internal data.

Also, another requirement is that the row iteration must return a list of values that preserve the original dtype of the data.

Usage Question

Most helpful comment

I think df.itertuples() is what you're looking for -- it's way faster than iterrows:

In [10]: x = pd.DataFrame({'x': range(10000)})

In [11]: %timeit list(x.iterrows())
1 loops, best of 3: 383 ms per loop

In [12]: %timeit list(x.itertuples())
1000 loops, best of 3: 1.39 ms per loop

All 6 comments

In python, iterating over the rows is going to be (a lot) slower than doing vectorized operations.

The types are being converted in your second method because that's how numpy arrays (which is what df.values is) work. DataFrames are column based, so you can have a single DataFrame with multiple dtypes. Once you iterate though row-wise, everything has to be upcast to a more general type that holds everything. In your case the ints go to float64.

If you describe your problem with a minimal working example, we might be able to help you vectorize it. You may also have luck on StackOverflow with the pandas tag.

Basically, I want to do the following:

row_handler = RowHandler(sample_df)  # learn how to handle row from sample data
transformed_data = []
for row in df.values:
    transformed_data.append(row_handler.handle(row))
return transformed_data

I don't own the RowHandler class and hence can only operate row by row.

Another similar example is in machine learning, where you may have a model that has predict API at row level only.

Still a bit too vague to be helpful. But if RowHandler is really out of your control then you'll be out of luck. FWIW all of scikit-learn's APIs operate on arrays (so multiple rows).

I don't see how it can be clearer. Yes, RowHandler is out of my control. What do you mean by out of luck? My question is for the most efficient way to iterate over rows while keeping the dtype of each element intact. Are you suggesting df.iterrows(), or something else?

sklearn is an exception, not the norm, that operates natively on PD's DataFrame. Not many machine learning libraries have APIs that operate on DataFrame.

I think df.itertuples() is what you're looking for -- it's way faster than iterrows:

In [10]: x = pd.DataFrame({'x': range(10000)})

In [11]: %timeit list(x.iterrows())
1 loops, best of 3: 383 ms per loop

In [12]: %timeit list(x.itertuples())
1000 loops, best of 3: 1.39 ms per loop

Thanks @shoyer! That's what I need.

Was this page helpful?
0 / 5 - 0 ratings