Code Sample
# My code
df.loc[0, 'column_name'] = 'foo bar'
This code in Pandas 20.3 throws SettingWithCopyWarning and suggests to
"Try using .loc[row_indexer,col_indexer] = value
instead".
I am already doing so, looks like there is a little bug. I use Jupyter.
Thank you! :)
pd.show_versions()
commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Windows
OS-release: 8.1
machine: AMD64
processor: Intel64 Family 6 Model 61 Stepping 4, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.20.1
pytest: 3.0.7
pip: 9.0.1
setuptools: 35.0.2
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 5.3.0
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.2.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: 4.6.0
html5lib: 0.999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None
@NadiaRom Can you provide a full example? It's hard to say for sure, but I suspect that df
came from an operation that may be a view or copy. For example:
In [8]: df = pd.DataFrame({"A": [1, 2], "B": [3, 4], "C": [4, 5]})
In [9]: df1 = df[['A', 'B']]
In [10]: df1.loc[0, 'A'] = 5
/Users/taugspurger/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/indexing.py:180: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self._setitem_with_indexer(indexer, value)
/Users/taugspurger/Envs/pandas-dev/bin/ipython:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
#!/Users/taugspurger/Envs/pandas-dev/bin/python3.6
So we're updating df1
correctly. The ambiguity is whether or not df
will be updated as well. I think a similar thing is happening to you, but without a reproducible example it's hard to say for sure.
@TomAugspurger Here is the code, in general, I never assign values to pandas without .loc
df = pd.read_csv('df_unicities.tsv', sep='\t')
df.replace({'|': '--'}, inplace=True)
df_c = df.loc[df.encountry == country, : ]
df_c['sort'] = (df_c.encities_ua == 'all').astype(int) # new column
df_c['sort'] += (df_c.encities_foreign == 'all').astype(int)
df_c.sort_values(by='sort', inplace=True)
# ---end of chunk, everything is fine ---
if df_c.encities_foreign.str.contains('all').sum() < len(df_c):
df_c.loc[df_c.encities_foreign.str.contains('all'), 'encities_foreign'] = 'other'
df_c.loc[df_c.cities_foreign.str.contains('всі'), 'cities_foreign'] = 'інші'
else:
df_c.loc[df_c.encities_foreign.str.contains('all'), 'encities_foreign'] = country
df_c.loc[df_c.cities_foreign.str.contains('всі'), 'cities_foreign'] = df_c.country.iloc[0]
if df_c.encities_ua.str.contains('all').sum() < len(df_c):
df_c.loc[df_c.encities_ua.str.contains('all'), 'encities_ua'] = 'other'
df_c.loc[df_c.cities_ua.str.contains('всі'), 'cities_ua'] = 'інші'
else:
df_c.loc[df_c.encities_ua.str.contains('all'), 'encities_ua'] = 'Ukraine'
df_c.loc[df_c.cities_ua.str.contains('всі'), 'cities_ua'] = 'Україна'
# Warning after it
Thank you for rapid answer!
The issue here is that you're slicing you dataframe first with .loc
in line 4. The attempting to assign values to that slice.
df_c = df.loc[df.encountry == country, :]
Pandas isn't 100% sure if you want to assign values to just your df_c
slice, or have it propagate all the way back up to the original df
. To avoid this when you first assign df_c
make sure you tell pandas that it is its own data frame (and not a slice) by using
df_c = df.loc[df.encountry == country, :].copy()
Doing this will fix your error. I'll tack on a brief example to help explain the above since I've noticed a lot of users get confused by pandas in this aspect.
>>> import pandas as pd
>>> df = pd.DataFrame({'A':[1,2,3,4,5], 'B':list('QQQCC')})
>>> df
A B
0 1 Q
1 2 Q
2 3 Q
3 4 C
4 5 C
>>> df.loc[df['B'] == 'Q', 'new_col'] = 'hello'
>>> df
A B new_col
0 1 Q hello
1 2 Q hello
2 3 Q hello
3 4 C NaN
4 5 C NaN
So the above works as we expect! Now lets try an example that mirrors what you attempted to do with your data.
>>> df = pd.DataFrame({'A':[1,2,3,4,5], 'B':list('QQQCC')})
>>> df_q = df.loc[df['B'] == 'Q']
>>> df_q
A B
0 1 Q
1 2 Q
2 3 Q
>>> df_q.loc[df['A'] < 3, 'new_col'] = 'hello'
/Users/riddellcd/anaconda/lib/python3.6/site-packages/pandas/core/indexing.py:337: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self.obj[key] = _infer_fill_value(value)
>>> df_q
A B new_col
0 1 Q hello
1 2 Q hello
2 3 Q NaN
Looks like we hit the same error! But it changed df_q
as we expected! This is because df_q
is a slice of df
so, even though we're using .loc[] df_q
pandas is warning us that it won't propagate the changes up to df
. To avoid this, we need to be more explicit and say that df_q
is its own dataframe, separate from df
by explicitly declaring it so.
Lets start back from df_q
but use .copy()
this time.
>>> df_q = df.loc[df['B'] == 'Q'].copy()
>>> df_q
A B
0 1 Q
1 2 Q
2 3 Q
Lets try to reassign our value now!
>>> df_q.loc[df['A'] < 3, 'new_col'] = 'hello'
>>> df_q
A B new_col
0 1 Q hello
1 2 Q hello
2 3 Q NaN
This works without an error because we've told pandas that df_q
is separate from df
If you in fact do want these changes to df_c
to propagate up to df
thats another point entirely and will answer if you want.
@CRiddler Great, thank you!
As you mentioned, chained .loc
has never returned unexpected results. As I understand, .copy()
ensures Pandas that we treat selected df_sliced_once
as separate object and do not intend to change initial full df
. Please correct if I mixed up smth.
documentation is here http://pandas.pydata.org/pandas-docs/stable/indexing.html#returning-a-view-versus-a-copy and @CRiddler has a nice expl. you should in general NOT use inplace
at all.
If you in fact do want these changes to
df_c
to propagate up todf
thats another point entirely and will answer if you want.
@CRiddler Thanks your answer is better than the ones in Stack Overflow could you add when you want to propagate to the initial dataframe or give an indication of how it is done?
@persep In general I don't like turning issues into stackoverflow threads for help, but it seems that this issue has gotten a fair bit of attention since last posting so I'll go ahead and post my method of tackling this type of problem in pandas. I typically do this by not subsetting the dataframe into separate variables, but I instead turn masks into variables- then combine masks as needed and set values based on those masks to ensure the changes happen in the original dataframe, and not to some copy floating around.
Original data:
>>>import pandas as pd
>>> df = pd.DataFrame({'A':[1,2,3,4,5], 'B':list('QQQCC')})
>>> df
A B
0 1 Q
1 2 Q
2 3 Q
3 4 C
4 5 C
Remember that creating a temporary dataframe will NOT propagate changes
As shown in the previous example, this makes changes to only to df_q
and raises a pandas warning (not copied/pasted here). AND does NOT propagate any changes to df
>>> df_q = df.loc[df["B"] == "Q"]
>>> df_q.loc[df["A"] < 3, "new_column"] = "hello"
# df remains unchanged because we only made changes to `df_q`
>>> df
A B
0 1 Q
1 2 Q
2 3 Q
3 4 C
4 5 C
To my knowledge, there is no way to use the same code as above and force changes to propagate back to the original dataframe.
However, if we change our thinking a bit and work with masks instead of full-on subsets we can achieve the desired result. While this isn't necessarily "propagating" changes to the original dataframe from a subset, we are ensuring that any changes we do make happen in the original dataframe df
. To do this, we create masks first, then apply them when we want to make a change to that subset of df
>>> q_mask = df["B"] == "Q"
>>> a_mask = df["A"] < 3
# Combine masks (in this case we used "&") to achieve what a nested subset would look like
# In the same step we add in our item assignment. Instructing pandas to create a new column in `df` and assign
# the value "hello" to the rows in `df` where `q_mask` & `a_mask` overlap.
>>> df.loc[q_mask & a_mask, "new_col"] = "hello"
# Successful "propagation" of new values to the original dataframe
>>> df
A B new_col
0 1 Q hello
1 2 Q hello
2 3 Q NaN
3 4 C NaN
4 5 C NaN
Lastly, if we ever wanted to see what df_q would look like we can always subset it from the original dataframe using our q_mask
>>> df.loc[q_mask, :]
A B new_col
0 1 Q hello
1 2 Q hello
2 3 Q NaN
While this isn't necessarily "propagating" changes from df_q
to df
we achieve the same result. Actual propagation would need to be explicitly done and would be less efficient than just working with masks.
@CRiddler Thanks, you've been very helpful
Most helpful comment
The issue here is that you're slicing you dataframe first with
.loc
in line 4. The attempting to assign values to that slice.Pandas isn't 100% sure if you want to assign values to just your
df_c
slice, or have it propagate all the way back up to the originaldf
. To avoid this when you first assigndf_c
make sure you tell pandas that it is its own data frame (and not a slice) by usingDoing this will fix your error. I'll tack on a brief example to help explain the above since I've noticed a lot of users get confused by pandas in this aspect.
Example with made up data
So the above works as we expect! Now lets try an example that mirrors what you attempted to do with your data.
Looks like we hit the same error! But it changed
df_q
as we expected! This is becausedf_q
is a slice ofdf
so, even though we're using .loc[]df_q
pandas is warning us that it won't propagate the changes up todf
. To avoid this, we need to be more explicit and say thatdf_q
is its own dataframe, separate fromdf
by explicitly declaring it so.Lets start back from
df_q
but use.copy()
this time.This works without an error because we've told pandas that
df_q
is separate fromdf
If you in fact do want these changes to
df_c
to propagate up todf
thats another point entirely and will answer if you want.