Pandas: When adding a Series to a DataFrame with a different index, the Series gets turned into all NaNs

Created on 6 Dec 2011 · 9Comments · Source: pandas-dev/pandas

Case in point:

>>> df
               RP/Rsum  P.value
ID                             
A_23_P42353    17.8     0      
A_23_P369994   15.91    0      
A_33_P3262440  436.7    0.0005 
A_32_P199429   18.97    0      
A_23_P256724   22.24    0      
A_33_P3394689  24.24    0      
A_33_P3403117  27.14    0      
A_24_P252364   28.56    0      
A_23_P99515    31.82    0      
A_24_P261750   31.46    0 

>>> df.dtypes
RP/Rsum    float64
P.value    float64

>>> ids = pandas.Series(['51513', '9201', np.nan, np.nan, '8794', '6530', '7025', '4897', '84935', '11081'])
>>> df["test"] = ids
>>> df
               RP/Rsum  P.value  test
ID                                   
A_23_P42353    17.8     0        NaN 
A_23_P369994   15.91    0        NaN 
A_33_P3262440  436.7    0.0005   NaN 
A_32_P199429   18.97    0        NaN 
A_23_P256724   22.24    0        NaN 
A_33_P3394689  24.24    0        NaN 
A_33_P3403117  27.14    0        NaN 
A_24_P252364   28.56    0        NaN 
A_23_P99515    31.82    0        NaN 
A_24_P261750   31.46    0        NaN 
>>> df.dtypes
RP/Rsum    float64
P.value    float64
test       object

This also happens with float objects and the like. I am not sure in what the trigger is.

Source

lbeltrame

Most helpful comment

The Series is given an implicit 0, ..., N-1 index when you don't supply one-- so this is exactly the behavior I would expect. If data were a raw ndarray or a list, then this would not occur. So the fact that when you do:

df[col] = series

and it conforms the series exactly to the index of df, that's a feature and not a bug :) so

df['test'] = ids.values

would work fine in your example

wesm on 6 Dec 2011

👍5 🎉1

All 9 comments

I wonder if it's related to this issue I found also this morning:

>>> df = pandas.DataFrame(index=[1,2,3,4])
>>> df["test"] = pandas.Series(["B", "fdf", "344", np.nan])
>>> df["test2"] = ["B", "fdf", "344", np.nan]
>>> df   test  test2
1  fdf   B    
2  344   fdf  
3  NaN   344  
4  NaN   nan

Looks like some kind of off-by-one error to me.

lbeltrame on 6 Dec 2011

Further digging leads to the call to Series.reindex when setting items at being the culprit:

>>> data 
0    B
1    fdf
2    344
3    NaN

>>>  df.index = ["A", "B", "C", "D"]
>>> data.reindex(df.index).values
array([nan, nan, nan, nan], dtype=object)

lbeltrame on 6 Dec 2011

Even more digging leads to reindex in the index attribute being called that gives a strange result:

>>> data.index.reindex(df.index)
(Index([A, B, C, D], dtype=object), array([-1, -1, -1, -1], dtype=int32))

These -1 are then translated to NaNs.

lbeltrame on 6 Dec 2011

Updated bug title with more correct description.

lbeltrame on 6 Dec 2011

df[col] = series

and it conforms the series exactly to the index of df, that's a feature and not a bug :) so

df['test'] = ids.values

would work fine in your example

wesm on 6 Dec 2011

👍5 🎉1

In that case perhaps it should be documented somewhere if it isn't already. In the mean time I'll adjust my own code as you suggested, thanks.

lbeltrame on 7 Dec 2011

http://pandas.sourceforge.net/dsintro.html#column-selection-addition-deletion

When inserting a Series that does not have the same index as the DataFrame, it will be conformed to the DataFrame’s index:

In [180]: df['one_trunc'] = df['one'][:2]

In [181]: df
Out[181]: 
   one  flag   foo  one_trunc
a  1    False  bar  1        
b  2    False  bar  2        
c  3    True   bar  NaN      
d  NaN  False  bar  NaN      

You can insert raw ndarrays but their length must match the length of the DataFrame’s index.

wesm on 7 Dec 2011

What is the idea behind the fact that when inserting a Series that does not have the same index as the DataFrame, it will be conformed to the DataFrame’s index?

When creating a DataFrame from series, the resulting index covers all individual series indexes. So why is this idea not used when df['new_column'] = series?
So you try to add data, but ignore all values that do not match the DataFrame index?
If _index extension_ would exist, one could always do df['new_column'] = series.reindex(df.index) when one does not want to extend the index (current behavior)?

In [256]: df = pandas.DataFrame({'A': pandas.Series(['foo', 'bar'], index=['a', 'b']),
   .....:                        'B': pandas.Series([10, 20], index=['b', 'c'])})

In [257]: df
Out[257]:
   A    B
a  foo  NaN
b  bar  10.000
c  NaN  20.000

In [258]: df['C'] = pandas.Series(range(3), index=['a', 'c', 'd'])

In [259]: df
Out[259]:
   A    B       C
a  foo  NaN     0.000
b  bar  10.000  NaN
c  NaN  20.000  1.000

In the example above i would expect a row 'd' in the DataFrame.

lodagro on 7 Dec 2011

Well, I think the basic idea is that DataFrame is a "fixed length dict-like container of Series". When you construct a DataFrame with a dict of Series without an explicit index, there is no obvious index other than the union of them all.

I can see the argument for implicitly extending the index, but there are tradeoffs either way

wesm on 8 Dec 2011

Was this page helpful?

0 / 5 - 0 ratings