Pandas: When adding a Series to a DataFrame with a different index, the Series gets turned into all NaNs

Created on 6 Dec 2011  ·  9Comments  ·  Source: pandas-dev/pandas

Case in point:

>>> df
               RP/Rsum  P.value
ID                             
A_23_P42353    17.8     0      
A_23_P369994   15.91    0      
A_33_P3262440  436.7    0.0005 
A_32_P199429   18.97    0      
A_23_P256724   22.24    0      
A_33_P3394689  24.24    0      
A_33_P3403117  27.14    0      
A_24_P252364   28.56    0      
A_23_P99515    31.82    0      
A_24_P261750   31.46    0 

>>> df.dtypes
RP/Rsum    float64
P.value    float64

>>> ids = pandas.Series(['51513', '9201', np.nan, np.nan, '8794', '6530', '7025', '4897', '84935', '11081'])
>>> df["test"] = ids
>>> df
               RP/Rsum  P.value  test
ID                                   
A_23_P42353    17.8     0        NaN 
A_23_P369994   15.91    0        NaN 
A_33_P3262440  436.7    0.0005   NaN 
A_32_P199429   18.97    0        NaN 
A_23_P256724   22.24    0        NaN 
A_33_P3394689  24.24    0        NaN 
A_33_P3403117  27.14    0        NaN 
A_24_P252364   28.56    0        NaN 
A_23_P99515    31.82    0        NaN 
A_24_P261750   31.46    0        NaN 
>>> df.dtypes
RP/Rsum    float64
P.value    float64
test       object

This also happens with float objects and the like. I am not sure in what the trigger is.

Most helpful comment

The Series is given an implicit 0, ..., N-1 index when you don't supply one-- so this is exactly the behavior I would expect. If data were a raw ndarray or a list, then this would not occur. So the fact that when you do:

df[col] = series

and it conforms the series exactly to the index of df, that's a feature and not a bug :) so

df['test'] = ids.values

would work fine in your example

All 9 comments

I wonder if it's related to this issue I found also this morning:

>>> df = pandas.DataFrame(index=[1,2,3,4])
>>> df["test"] = pandas.Series(["B", "fdf", "344", np.nan])
>>> df["test2"] = ["B", "fdf", "344", np.nan]
>>> df   test  test2
1  fdf   B    
2  344   fdf  
3  NaN   344  
4  NaN   nan  

Looks like some kind of off-by-one error to me.

Further digging leads to the call to Series.reindex when setting items at being the culprit:

>>> data 
0    B
1    fdf
2    344
3    NaN

>>>  df.index = ["A", "B", "C", "D"]
>>> data.reindex(df.index).values
array([nan, nan, nan, nan], dtype=object)

Even more digging leads to reindex in the index attribute being called that gives a strange result:

>>> data.index.reindex(df.index)
(Index([A, B, C, D], dtype=object), array([-1, -1, -1, -1], dtype=int32))

These -1 are then translated to NaNs.

Updated bug title with more correct description.

The Series is given an implicit 0, ..., N-1 index when you don't supply one-- so this is exactly the behavior I would expect. If data were a raw ndarray or a list, then this would not occur. So the fact that when you do:

df[col] = series

and it conforms the series exactly to the index of df, that's a feature and not a bug :) so

df['test'] = ids.values

would work fine in your example

In that case perhaps it should be documented somewhere if it isn't already. In the mean time I'll adjust my own code as you suggested, thanks.

http://pandas.sourceforge.net/dsintro.html#column-selection-addition-deletion

When inserting a Series that does not have the same index as the DataFrame, it will be conformed to the DataFrame’s index:

In [180]: df['one_trunc'] = df['one'][:2]

In [181]: df
Out[181]: 
   one  flag   foo  one_trunc
a  1    False  bar  1        
b  2    False  bar  2        
c  3    True   bar  NaN      
d  NaN  False  bar  NaN      

You can insert raw ndarrays but their length must match the length of the DataFrame’s index.

What is the idea behind the fact that when inserting a Series that does not have the same index as the DataFrame, it will be conformed to the DataFrame’s index?

When creating a DataFrame from series, the resulting index covers all individual series indexes. So why is this idea not used when df['new_column'] = series?
So you try to add data, but ignore all values that do not match the DataFrame index?
If _index extension_ would exist, one could always do df['new_column'] = series.reindex(df.index) when one does not want to extend the index (current behavior)?

In [256]: df = pandas.DataFrame({'A': pandas.Series(['foo', 'bar'], index=['a', 'b']),
   .....:                        'B': pandas.Series([10, 20], index=['b', 'c'])})

In [257]: df
Out[257]:
   A    B
a  foo  NaN
b  bar  10.000
c  NaN  20.000

In [258]: df['C'] = pandas.Series(range(3), index=['a', 'c', 'd'])

In [259]: df
Out[259]:
   A    B       C
a  foo  NaN     0.000
b  bar  10.000  NaN
c  NaN  20.000  1.000

In the example above i would expect a row 'd' in the DataFrame.

Well, I think the basic idea is that DataFrame is a "fixed length dict-like container of Series". When you construct a DataFrame with a dict of Series without an explicit index, there is no obvious index other than the union of them all.

I can see the argument for implicitly extending the index, but there are tradeoffs either way

Was this page helpful?
0 / 5 - 0 ratings