I have tried the function df.iterrows()
but its performance is horrible. Which is not surprising given that iterrows()
returns a Series
with full schema and meta data, not just the values (which all that I need).
The second method that I have tried is for row in df.values
, which is significantly faster. However, I have recently realized that df.values
is not the internal data storage of the DataFrame, because df.values
converts all dtypes
to a common dtype
. For example, one of my columns has dtype int64
but the dtype of df.values
is all float64
. So I suspect that df.values
actually creates another copy of the internal data.
Also, another requirement is that the row iteration must return a list of values that preserve the original dtype
of the data.
In python, iterating over the rows is going to be (a lot) slower than doing vectorized operations.
The types are being converted in your second method because that's how numpy arrays (which is what df.values
is) work. DataFrames are column based, so you can have a single DataFrame with multiple dtypes. Once you iterate though row-wise, everything has to be upcast to a more general type that holds everything. In your case the ints go to float64
.
If you describe your problem with a minimal working example, we might be able to help you vectorize it. You may also have luck on StackOverflow with the pandas tag.
Basically, I want to do the following:
row_handler = RowHandler(sample_df) # learn how to handle row from sample data
transformed_data = []
for row in df.values:
transformed_data.append(row_handler.handle(row))
return transformed_data
I don't own the RowHandler
class and hence can only operate row by row.
Another similar example is in machine learning, where you may have a model that has predict API at row level only.
Still a bit too vague to be helpful. But if RowHandler
is really out of your control then you'll be out of luck. FWIW all of scikit-learn's APIs operate on arrays (so multiple rows).
I don't see how it can be clearer. Yes, RowHandler
is out of my control. What do you mean by out of luck? My question is for the most efficient way to iterate over rows while keeping the dtype
of each element intact. Are you suggesting df.iterrows()
, or something else?
sklearn
is an exception, not the norm, that operates natively on PD's DataFrame
. Not many machine learning libraries have APIs that operate on DataFrame
.
I think df.itertuples()
is what you're looking for -- it's way faster than iterrows:
In [10]: x = pd.DataFrame({'x': range(10000)})
In [11]: %timeit list(x.iterrows())
1 loops, best of 3: 383 ms per loop
In [12]: %timeit list(x.itertuples())
1000 loops, best of 3: 1.39 ms per loop
Thanks @shoyer! That's what I need.
Most helpful comment
I think
df.itertuples()
is what you're looking for -- it's way faster than iterrows: