Pandas: df.duplicated and drop_duplicates raise TypeError with set and list values.

Created on 22 Mar 2016  ·  3Comments  ·  Source: pandas-dev/pandas

IN:

import pandas as pd
df = pd.DataFrame([[{'a', 'b'}], [{'b','c'}], [{'b', 'a'}]])
df

OUT:

    0
0   {a, b}
1   {c, b}
2   {a, b}

IN:

df.duplicated()

OUT:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-77-7cc63ba1ed41> in <module>()
----> 1 df.duplicated()

venv/lib/python3.5/site-packages/pandas/util/decorators.py in wrapper(*args, **kwargs)
     89                 else:
     90                     kwargs[new_arg_name] = new_arg_value
---> 91             return func(*args, **kwargs)
     92         return wrapper
     93     return _deprecate_kwarg

venv/lib/python3.5/site-packages/pandas/core/frame.py in duplicated(self, subset, keep)
   3100 
   3101         vals = (self[col].values for col in subset)
-> 3102         labels, shape = map(list, zip(*map(f, vals)))
   3103 
   3104         ids = get_group_index(labels, shape, sort=False, xnull=False)

TypeError: type object argument after * must be a sequence, not map

I expect:

0    False
1    False
2     True
dtype: bool

pd.show_versions() output:

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.3.0-1-amd64
machine: x86_64
processor: 
byteorder: little
LC_ALL: None
LANG: ru_RU.UTF-8

pandas: 0.18.0
nose: None
pip: 1.5.6
setuptools: 18.8
Cython: None
numpy: 1.10.4
scipy: None
statsmodels: None
xarray: None
IPython: 4.1.2
sphinx: None
patsy: None
dateutil: 2.5.1
pytz: 2016.1
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: None
Bug Missing-data

Most helpful comment

How about ignoring unhashable columns for the purposes of dropping duplicates?
Like adding a kwarg 'unhashable_type' whose default is 'raise' (which works as current), but can be set to 'ignore' (at the risk of dropping rows which aren't entirely duplicated).

All 3 comments

I guess. you are using a list-like value INSIDE a cell of a frame. This is quite inefficient and not generally supported. pull-requests accepts to fix in any event.

Current pandas gives a slightly different TypeError (TypeError: unhashable type: 'set'), which does get to the point - how would you deduplicate sets or lists? Unlike tuples and primitive types, these are not hashable (sets could be converted to frozensets, which are hashable), so you have to come up with a deduplication strategy.

In any case, since you're dealing with an object dtype, there is no guarantee that the next row won't contain a set or a list, so this deduplication gets only worse from then on. So pandas treats each value as a separate one and processes them as long as they are hashable. Just try a column with three tuples, it will work, then change the last one to be a set and it will fail on that very value.

So, I'm not sure there's a solid implementation that would work here given the lack of hashability in lists, there could potentially be a fix for sets, which would be converted to frozensets upon hash map insertion, but that does seem hacky and arbitrary.

How about ignoring unhashable columns for the purposes of dropping duplicates?
Like adding a kwarg 'unhashable_type' whose default is 'raise' (which works as current), but can be set to 'ignore' (at the risk of dropping rows which aren't entirely duplicated).

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ericdf picture ericdf  ·  3Comments

MatzeB picture MatzeB  ·  3Comments

Ashutosh-Srivastav picture Ashutosh-Srivastav  ·  3Comments

BDannowitz picture BDannowitz  ·  3Comments

marcelnem picture marcelnem  ·  3Comments