Pandas: When using to_sql(), continue if duplicate primary keys are detected?

Created on 13 Apr 2017  ·  19Comments  ·  Source: pandas-dev/pandas

Code Sample, a copy-pastable example if possible

df.to_sql('TableNameHere', engine, if_exists='append', chunksize=900, index=False)

Problem description

I am trying to append a large DataFrame to a SQL table. Some of the rows in the DataFrame are duplicates of those in the SQL table, some are not. But to_sql() completely stops executing if even one duplicate is detected.

It would make sense for to_sql(if_exists='append') to merely warn the user which rows had duplicate keys and just continue to add the new rows, not completely stop executing. For large datasets, you will likely have duplicates but want to ignore them.

Maybe add an argument to ignore duplicates and keep executing? Perhaps an additional if_exists option like 'append_skipdupes'?

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.0.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: English_United States.1252

pandas: 0.19.2
nose: None
pip: 9.0.1
setuptools: 28.8.0
Cython: None
numpy: 1.12.0
scipy: None
statsmodels: None
xarray: None
IPython: 5.3.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
httplib2: None
apiclient: None
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.5
boto: None
pandas_datareader: None

Enhancement IO SQL

Most helpful comment

This should also support the "on duplicate update" mode as well.

All 19 comments

This should also support the "on duplicate update" mode as well.

@rosstripi I think the idea to have this would certainly be accepted, but AFAIK the main bottleneck is an implementation for this using sql/sqlalchemy in a flavor agnostic way. Some exploration how this could be done is certainly welcome!

Hi did you figure out any workaround for this? Please let me know

Any update on this implementation?

I am now facing this problem with PostgreSQL and SQLAlchemy and would love to have that "on duplicate update".

Thanks for the work

A workaround would be to remove the unique index in the database:

sqlquery="ALTER 'TABLE DATABASE'.'TABLE' DROP INDEX 'idx_name'"
afterwards
df.to_sql('TableNameHere', engine, if_exists='append', chunksize=900, index=False)
can be executed.

Just let your MySQL Server add the index again and drop the duplicates.
sqlquery="ALTER IGNORE TABLE 'DATABASE'.'TABLE' ADD UNIQUE INDEX 'idx_name' ('column_name1' ASC, 'column_name2' ASC, 'column_name3' '[ASC | DESC]')"

Depending on your specific application, this can be helpful.
Anyway if_existsoption like append_skipdupes would be much better.

append_skipdupes would be the perfect way to handle this.

yes, append_skipdupes +1

Agreed that it would be good to be able to deal with this with options in df.to_sql().

Here's the workaround I use in sqlite:

CREATE TABLE IF NOT EXISTS my_table_name (
    some_kind_of_id INT PRIMARY KEY ON CONFLICT IGNORE,
    ...

Then, when I insert duplicates, they get silently ignored, and the non-duplicates are processed correctly. In my case, the data are (i.e. should be) static, so I don't need to update. It's just that the form of the data feed is such that I'll get duplicates that are ignorable.

an other workaround with MariaDb and MySql :
df.to_csv("test.csv")
then use :
LOAD DATA INFILE 'test.csv' IGNORE INTO TABLE mytable or
LOAD DATA INFILE 'test.csv' REPLACE INTO TABLE mytable.

LOAD DATA is very faster than INSERT.

complete code:

csv_path = str(Path(application_path) / "tmp" / "tmp.csv").replace("\\", "\\\\")
df.to_csv(csv_path, index=False, sep='\t', quotechar="'", na_rep=r'\N')
rq = """LOAD DATA LOCAL INFILE '{file_path}' REPLACE INTO TABLE {db}.{db_table}
        LINES TERMINATED BY '\\r\\n'
        IGNORE 1 LINES
         ({col});
        """.format(db=db,
                   file_path=csv_path,
                   db_table=table_name,
                   col=",".join(df.columns.tolist()))

I believe this is being addressed in #29636 with the upsert_ignore argument, which addresses #14553.

append_skipdupes +1

+1 for append_skipdupes

Agree 'append_skipdupes' should be added.

Yes, please. 'append_skipdupes' should be added and not only for the Primary Key column. If there are duplicates among other Unique columns also it should skip appending those new duplicate rows.

+1 for append_skipdupes

append_skipdupes +1

append_skipdupes +1

+1 for append_skipdupes

Meantime you can use this https://pypi.org/project/pangres/

Was this page helpful?
0 / 5 - 0 ratings