Pandas: 'column' not in index, but hell it is. Seems like a bug...

Created on 18 Aug 2017  ·  24Comments  ·  Source: pandas-dev/pandas

I have a dataframe called delivery and when I print(delivery.columns) I get the following:

Index(['Complemento_endereço', 'cnpj', 'Data_fundação', 'Número',
   'Razão_social', 'CEP', 'situacao_cadastral', 'situacao_especial', 'Rua',
   'Nome_Fantasia', 'last_revenue_normalized', 'last_revenue_year',
   'Telefone', 'email', 'Capital_Social', 'Cidade', 'Estado',
   'Razão_social', 'name_bairro', 'Natureza_Jurídica', 'CNAE', '#CNAE',
   'CNAEs_secundários', 'Pessoas', 'percent'],
  dtype='object')

Well, we can clearly see that there is a column 'Rua'.

Also, if I print(delivery.Rua) I get a proper result:

82671                         R JUDITE MELO DOS SANTOS
817797                                R DOS GUAJAJARAS
180081           AV MARCOS PENTEADO DE ULHOA RODRIGUES
149373                                 AL MARIA TEREZA
455511                               AV RANGEL PESTANA
...

Even if I write "if 'Rua' in delivery.columns: print('here I am')" it does print the 'here I am'. So 'Rua' is in fact there.

Well, in the immediate line after I have this code:

delivery=delivery.set_index('cnpj')[['Razão_social','Nome_Fantasia','Data_fundação','CEP','Estado','Cidade','Bairro','Rua','Número','Complemento_endereço','Telefone','email','Capital_Social', 'CNAE', '#CNAE', 'Natureza_Jurídica','Pessoas' ]]

And voilá, I get this weird error:

Traceback (most recent call last):
File "/file.py", line 45, in <module>
   'Telefone', 'email', 'Capital_Social', 'Cidade', 'Estado',
   'Razão_social', 'name_bairro', 'Natureza_Jurídica', 'CNAE', '#CNAE',
'Telefone','email','Capital_Social', 'CNAE', '#CNAE', 'Natureza_Jurídica','Pessoas' ]]
   'CNAEs_secundários', 'Pessoas', 'percent'],
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/core/frame.py", line 1991, in __getitem__
  dtype='object')
return self._getitem_array(key)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/core/frame.py", line 2035, in _getitem_array
indexer = self.ix._convert_to_indexer(key, axis=1)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/core/indexing.py", line 1214, in _convert_to_indexer
raise KeyError('%s not in index' % objarr[mask])
KeyError: "['Rua'] not in index"

Can someone help? I tried stackoverflow but no one could help. I'm starting to think I'm crazy and 'Rua' is an illusion of my troubled mind.

ADDITIONAL INFO

I'm using this code right before the error line:

delivery=pd.DataFrame()

for i in selection.index:
    sample=groups.get_group(selection['#CNAE'].loc[i]).sample(selection['samples'].loc[i])
    delivery=pd.concat((delivery,sample)).sort_values('Capital_Social',ascending=False)


print(delivery.columns)
print(delivery.Rua)
print(delivery.set_index('cnpj').columns)

delivery=delivery.set_index('cnpj')[['Razão_social','Nome_Fantasia','Data_fundação','CEP','Estado','Cidade','Bairro','Rua','Número','Complemento_endereço',
                                 'Telefone','email','Capital_Social', 'CNAE', '#CNAE', 'Natureza_Jurídica','Pessoas' ]]

EDIT

New weird stuff:
I gave up and deleted 'Rua' from that last piece of code, wishing that it would work. For my surprise, I had the same problem but now with the column 'Número'.

delivery=delivery.set_index('cnpj')[['Razão_social','Nome_Fantasia','Data_fundação','CEP','Estado','Cidade','Bairro','Número','Complemento_endereço',
                                                 'Telefone','email','Capital_Social', 'CNAE', '#CNAE', 'Natureza_Jurídica' ]]

KeyError: "['Número'] not in index"

EDIT 2

And then I gave up on 'Número' and took it out. Then the same problem happened with 'Complemento_endereço'. Then I deleted 'Complemento_endereço'. And it happend to 'Telefone' and so on.

* EDIT 3 *

If I do a pd.show_versions(), that's the output:

INSTALLED VERSIONS

commit: None
python: 3.5.0.final.0
python-bits: 64
OS: Darwin
OS-release: 16.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.18.1
nose: None
pip: 8.1.2
setuptools: 18.2
Cython: None
numpy: 1.11.0
scipy: 0.17.1
statsmodels: 0.6.1
xarray: None
IPython: None
sphinx: None
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.1
openpyxl: None
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.5.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.3
pymysql: 0.7.11.None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None
None

Indexing

Most helpful comment

This is for those who landed here from searching Google to see what's wrong.

If you are working out of a CSV, or XLSX make 100% sure none of your columns names have a space at the front or end of it.

When importing a CSV i noticed there was an issue getting a column. When exporting the df to a csv and opening it in excel, it's impossible to see the trailing or leading white spaces. You have to open it with notepad or notepad++

Again, this is for those who landed here from a google search. Making sure all leading and trailing whitespaces are removed from your column header names in your csv, xlsx or any other dataframe file template you may be using.

All 24 comments

@abutremutante : Thanks for reporting this! It does look really weird, but we can't replicate it at this point because we can't run your code. Could you provide a complete code sample for us?

Also, if you could provide the output of pd.show_versons in your initial issue box, that would be great.

Hi there! Thanks for answering.
I wouldn’t like to make it public on github. Could I send it by email to you guys?

Em 17 de ago de 2017, à(s) 19:38, gfyoung notifications@github.com escreveu:

@abutremutante https://github.com/abutremutante : Thanks for reporting this! It does look really weird, but we can't replicate it at this point because we can't run your code. Could you provide a complete code sample for us?

Also, if you could provide the output of pd.show_versons in your initial issue box, that would be great.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/17275#issuecomment-323213193, or mute the thread https://github.com/notifications/unsubscribe-auth/ARNqx54xVjjuQyYofj1-AEjp9NRL5AFnks5sZMD7gaJpZM4O63lz.

Preferably not, as anyone who wants to take up this issue would need to see the code. Can you try replicating with a different table (or DataFrame) that doesn't contain sensitive information?

Tried here:

import pandas as pd
import FindCos.FindCos_Functions as find #that's a file where I write some functions
import datetime
import pdb

target=find.get_full_basics(business='select * from sqltable;',test_mode=False)

CNAEs=['23.30-3-01','26','27','49.30-2-03','37.02-9-00','46.45','47.73','46.44-3-01']
hired_cos=200

selecting items from CNAEs

selection=pd.DataFrame()
for i in CNAEs:
x=target.loc[target['#CNAE'].str.startswith(i) == True]
selection=pd.concat((selection,x))

FILTERING

selection=selection.loc[selection['Capital_Social'] < 100000000].loc[selection['situacao_cadastral'] == 'ATIVA']\
.loc[selection['situacao_especial'].isnull() == True].loc[selection['Natureza_Juridica'] != 'EMPRESA INDIVIDUAL DE RESP.LIMITADA (DE NATUREZA EMPRESARIA)']\
.loc[selection['Natureza_Juridica'] != 'EMPRESARIO (INDIVIDUAL)']\
.loc[selection['Estado'] != 'PA'].loc[selection['Estado'] != 'AM']\
.loc[selection['Estado'] != 'RR'].loc[selection['Estado'] != 'AC'].loc[selection['Estado'] != 'RO'].loc[selection['Estado'] != 'AP']\
.loc[selection['Estado'] != 'TO']

DUPLICATION CONTROL

lista=['file.csv']
selection=find.exclude_business(selection,lista)

CHECKING PROFILE

groups=selection.groupby('#CNAE')
selection['percent']=groups['#CNAE'].transform('size')/len(selection)
selection=selection[['#CNAE','percent']].drop_duplicates().sort_values('percent',ascending=False)
selection['samples']=round(((hired_cos1.05)selection['percent']))

delivery=pd.DataFrame()
for i in selection.index:
sample=groups.get_group(selection['#CNAE'].loc[i]).sample(selection['samples'].loc[i])
delivery=pd.concat((delivery,sample)).sort_values('Capital_Social',ascending=False)#.rename(columns={'Capital_Social':'Score_Tamanho'})

MAKING SURE THAT RUA REALLY EXISTS

print(delivery.columns)
print(delivery.Rua)
print(delivery.set_index('cnpj').columns)
delivery=delivery.rename(columns={'Rua':'Rua'})
if 'Rua' in delivery.columns:
print('here I am')

PROBLEM LINE

delivery=delivery.set_index('cnpj')[['cnpj','Razao_social','Nome_Fantasia','Data_fundacao','CEP','Estado','Cidade','Bairro','Rua','Numero','Complemento_endereco','Telefone','email','Capital_Social','CNAE','#CNAE','Natureza_Juridica']]

@abutremutante : Thanks, but unfortunately, this code is not replicable for us. We can't run import FindCos.FindCos_Functions. Try to just create DataFrame from scratch and replicate the issue.

Also, if you could provide the output of pd.show_versons in your initial issue box, that would be great.

@gfyoung: I added to the initial issue box the pd.show_versions.
Regarding the dataframe, it is a pretty long dataframe. I made a csv of using the 10 first lines of it, right here:

target=find.get_full_basics(business='select * from sqltable limit 10;',test_mode=False)
target.to_csv('target10items.csv')

I'm attaching it here.
target10items.csv.zip

1) Can you replicate your issue using this smaller DataFrame ?
2) I notice you're using a very old version of pandas (we're at 0.20.3 right now). Can you try upgrading and see if that resolves your issue?

'Bairro' is not in your output for print(delivery.columns) but is in the list you provide after set_index. It's a little suspicious that 'Bairro' appears immediately before 'Rua' in that list. Maybe there's an issue in the error message selecting the missing column?

Okay, I think the issue is that 'Bairro' is actually the missing key, but pandas 0.18.1 had a bug where the error message displays the wrong item as the missing key.

Using the following code

import pandas as pd
import numpy as np

cols = pd.Index(['Complemento_endereço', 'cnpj', 'Data_fundação', 'Número',
   'Razão_social', 'CEP', 'situacao_cadastral', 'situacao_especial', 'Rua',
   'Nome_Fantasia', 'last_revenue_normalized', 'last_revenue_year',
   'Telefone', 'email', 'Capital_Social', 'Cidade', 'Estado',
   'Razão_social', 'name_bairro', 'Natureza_Jurídica', 'CNAE', '#CNAE',
   'CNAEs_secundários', 'Pessoas', 'percent'],
  dtype='object')
delivery = pd.DataFrame(np.random.random(size=(5, len(cols))), columns=cols)

delivery = delivery.set_index('cnpj')[['Razão_social','Nome_Fantasia','Data_fundação','CEP','Estado','Cidade','Bairro','Rua','Número','Complemento_endereço','Telefone','email','Capital_Social', 'CNAE', '#CNAE', 'Natureza_Jurídica','Pessoas' ]]

On pandas 0.18.1, I get the following error:

KeyError: "['Rua'] not in index"

However, on pandas 0.20.3, I get the corrected error:

KeyError: "['Bairro'] not in index"

You
Nailed
It
@jschendel

Thanks a lot @gfyoung

Thank you so much.

Closing, as it seems that your issue has been resolved.

Hi
I do not see any real idea to solve the problem, @gfyoung Why do you close this? I still have this problem. NO complaint, just so tired of this error.

@wangxuesong29 do you have a minimal example? http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

I've got the same problem as you. I've observed that if I change the data in .csv format in OpenOffice program then the error occurs. Instead of that I've downloaded the data from the Internet and I edited the data in simple Notepad++ editor. Then it works normally. I know that perhaps this solution doesn't help in you case, but maybe you should change the text editor or program that supports .csv files.

ERROR:
pandas version 0.23.4
Have the same problem, leaving the same code as above,

After running the code i get:

'Bairro' not in index

CODE:
`import pandas as pd
import numpy as np

cols = pd.Index(['Complemento_endereço', 'cnpj', 'Data_fundação', 'Número',
'Razão_social', 'CEP', 'situacao_cadastral', 'situacao_especial', 'Rua',
'Nome_Fantasia', 'last_revenue_normalized', 'last_revenue_year',
'Telefone', 'email', 'Capital_Social', 'Cidade', 'Estado',
'Razão_social', 'name_bairro', 'Natureza_Jurídica', 'CNAE', '#CNAE',
'CNAEs_secundários', 'Pessoas', 'percent'],
dtype='object')
delivery = pd.DataFrame(np.random.random(size=(5, len(cols))), columns=cols)

delivery = delivery.set_index('cnpj')[['Razão_social','Nome_Fantasia','Data_fundação','CEP','Estado','Cidade','Bairro','Rua','Número','Complemento_endereço','Telefone','email','Capital_Social', 'CNAE', '#CNAE', 'Natureza_Jurídica','Pessoas' ]]
`

This is for those who landed here from searching Google to see what's wrong.

If you are working out of a CSV, or XLSX make 100% sure none of your columns names have a space at the front or end of it.

When importing a CSV i noticed there was an issue getting a column. When exporting the df to a csv and opening it in excel, it's impossible to see the trailing or leading white spaces. You have to open it with notepad or notepad++

Again, this is for those who landed here from a google search. Making sure all leading and trailing whitespaces are removed from your column header names in your csv, xlsx or any other dataframe file template you may be using.

I am also having this error.
The columns are named correctlz but when I use seaborn with my csv file I get the error (my column is our of index)

import seaborn as sns
import pandas as pd
Data = pd.read_csv('test.csv',delimiter=',') 
sns.lmplot(x='predLabel', y='trueLabel', data=Data)

the error message:
KeyError: "['predLabel' 'trueLabel'] not in index"

I also have the same issue
The columns are named correct but when I use seaborn with my csv file I get the error (my column is out of index)

import seaborn as sns
import pandas as pd
df =
pd.read_csv('lawma1.csv', index_col =[0, 1], delimiter=', ')
sns.lmplot(x='WEEK1', y='FLEET', data=df).savefig('law.png')

the error message:
KeyError: "['FLEET'] not in index

I had this error and it was because I had a dot "." at the end of a column name, it worked after I removed it.
podaci = pd.read_csv('data/fifa19a.csv', names=['id', 'ime', 'godine', 'ocjena', 'potencijal.', 'bodovi', 'stopalo', 'placa_tis_eur', 'cijena_mil_eur'])
It was like this in the "potencijal" column
It could be a bug

i am facing the same problem,i even used the print(X.columns) and it showed the index 'exposure_end' but when i ussed it in centroids_new=X.groupby(["clusters"]).mean()[["exposure_end","Duration"]] it is showing the error 'exposure_end' not in index.please help i am stuck here for the past two hours.

i found the solution for my problem.
i was using the above said statement that centroids_new=X.groupby(["clusters"]).mean()[["exposure_end","Duration"]] i used x.mean(axis=1)above this statement and then used the statement
centroids_new=X.groupby(["clusters"]).mean()[["exposure_end","Duration"]] without the mean and it worked fine. was not able to use axis in the statement before because it wasnt working with groupby so had to do it in two steps.
and the main problem why it was happening was the axis wasn set to 1.

I found a solution to the problem, works perfectly for me.

Check if your csv file is separated by ' , ' or ' ; ' . In my case, my data was separated by ' , ' but I was using ' ; '.

So simply added

Df= pd.read_csv('C:\Users\user\Desktop\data.csv', sep=" , ")

"column " and "column"

are two different things ,the first has a space in front. SO SIMPLY ADD SPACE WHERE SPACE IS

Eg:
df["column "] worked for me

Was this page helpful?
0 / 5 - 0 ratings