Case in point:
>>> df
RP/Rsum P.value
ID
A_23_P42353 17.8 0
A_23_P369994 15.91 0
A_33_P3262440 436.7 0.0005
A_32_P199429 18.97 0
A_23_P256724 22.24 0
A_33_P3394689 24.24 0
A_33_P3403117 27.14 0
A_24_P252364 28.56 0
A_23_P99515 31.82 0
A_24_P261750 31.46 0
>>> df.dtypes
RP/Rsum float64
P.value float64
>>> ids = pandas.Series(['51513', '9201', np.nan, np.nan, '8794', '6530', '7025', '4897', '84935', '11081'])
>>> df["test"] = ids
>>> df
RP/Rsum P.value test
ID
A_23_P42353 17.8 0 NaN
A_23_P369994 15.91 0 NaN
A_33_P3262440 436.7 0.0005 NaN
A_32_P199429 18.97 0 NaN
A_23_P256724 22.24 0 NaN
A_33_P3394689 24.24 0 NaN
A_33_P3403117 27.14 0 NaN
A_24_P252364 28.56 0 NaN
A_23_P99515 31.82 0 NaN
A_24_P261750 31.46 0 NaN
>>> df.dtypes
RP/Rsum float64
P.value float64
test object
This also happens with float objects and the like. I am not sure in what the trigger is.
I wonder if it's related to this issue I found also this morning:
>>> df = pandas.DataFrame(index=[1,2,3,4])
>>> df["test"] = pandas.Series(["B", "fdf", "344", np.nan])
>>> df["test2"] = ["B", "fdf", "344", np.nan]
>>> df test test2
1 fdf B
2 344 fdf
3 NaN 344
4 NaN nan
Looks like some kind of off-by-one error to me.
Further digging leads to the call to Series.reindex
when setting items at being the culprit:
>>> data
0 B
1 fdf
2 344
3 NaN
>>> df.index = ["A", "B", "C", "D"]
>>> data.reindex(df.index).values
array([nan, nan, nan, nan], dtype=object)
Even more digging leads to reindex
in the index attribute being called that gives a strange result:
>>> data.index.reindex(df.index)
(Index([A, B, C, D], dtype=object), array([-1, -1, -1, -1], dtype=int32))
These -1 are then translated to NaNs.
Updated bug title with more correct description.
The Series is given an implicit 0, ..., N-1 index when you don't supply one-- so this is exactly the behavior I would expect. If data
were a raw ndarray or a list, then this would not occur. So the fact that when you do:
df[col] = series
and it conforms the series exactly to the index of df
, that's a feature and not a bug :) so
df['test'] = ids.values
would work fine in your example
In that case perhaps it should be documented somewhere if it isn't already. In the mean time I'll adjust my own code as you suggested, thanks.
http://pandas.sourceforge.net/dsintro.html#column-selection-addition-deletion
When inserting a Series that does not have the same index as the DataFrame, it will be conformed to the DataFrame’s index:
In [180]: df['one_trunc'] = df['one'][:2]
In [181]: df
Out[181]:
one flag foo one_trunc
a 1 False bar 1
b 2 False bar 2
c 3 True bar NaN
d NaN False bar NaN
You can insert raw ndarrays but their length must match the length of the DataFrame’s index.
What is the idea behind the fact that when inserting a Series that does not have the same index as the DataFrame, it will be conformed to the DataFrame’s index?
When creating a DataFrame from series, the resulting index covers all individual series indexes. So why is this idea not used when df['new_column'] = series?
So you try to add data, but ignore all values that do not match the DataFrame index?
If _index extension_ would exist, one could always do df['new_column'] = series.reindex(df.index) when one does not want to extend the index (current behavior)?
In [256]: df = pandas.DataFrame({'A': pandas.Series(['foo', 'bar'], index=['a', 'b']),
.....: 'B': pandas.Series([10, 20], index=['b', 'c'])})
In [257]: df
Out[257]:
A B
a foo NaN
b bar 10.000
c NaN 20.000
In [258]: df['C'] = pandas.Series(range(3), index=['a', 'c', 'd'])
In [259]: df
Out[259]:
A B C
a foo NaN 0.000
b bar 10.000 NaN
c NaN 20.000 1.000
In the example above i would expect a row 'd' in the DataFrame.
Well, I think the basic idea is that DataFrame is a "fixed length dict-like container of Series". When you construct a DataFrame with a dict of Series without an explicit index, there is no obvious index other than the union of them all.
I can see the argument for implicitly extending the index, but there are tradeoffs either way
Most helpful comment
The Series is given an implicit 0, ..., N-1 index when you don't supply one-- so this is exactly the behavior I would expect. If
data
were a raw ndarray or a list, then this would not occur. So the fact that when you do:df[col] = series
and it conforms the series exactly to the index of
df
, that's a feature and not a bug :) sodf['test'] = ids.values
would work fine in your example