Pandas: When using to_sql(), continue if duplicate primary keys are detected?

Created on 13 Apr 2017 · 19Comments · Source: pandas-dev/pandas

Code Sample, a copy-pastable example if possible

df.to_sql('TableNameHere', engine, if_exists='append', chunksize=900, index=False)

Problem description

I am trying to append a large DataFrame to a SQL table. Some of the rows in the DataFrame are duplicates of those in the SQL table, some are not. But to_sql() completely stops executing if even one duplicate is detected.

It would make sense for to_sql(if_exists='append') to merely warn the user which rows had duplicate keys and just continue to add the new rows, not completely stop executing. For large datasets, you will likely have duplicates but want to ignore them.

Maybe add an argument to ignore duplicates and keep executing? Perhaps an additional if_exists option like 'append_skipdupes'?

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.0.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: English_United States.1252

pandas: 0.19.2
nose: None
pip: 9.0.1
setuptools: 28.8.0
Cython: None
numpy: 1.12.0
scipy: None
statsmodels: None
xarray: None
IPython: 5.3.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
httplib2: None
apiclient: None
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.5
boto: None
pandas_datareader: None

Enhancement IO SQL

Source

rosstripi

👍173

Most helpful comment

This should also support the "on duplicate update" mode as well.

rockg on 13 Apr 2017

👍34

All 19 comments

This should also support the "on duplicate update" mode as well.

rockg on 13 Apr 2017

👍34

@rosstripi I think the idea to have this would certainly be accepted, but AFAIK the main bottleneck is an implementation for this using sql/sqlalchemy in a flavor agnostic way. Some exploration how this could be done is certainly welcome!

jorisvandenbossche on 13 Apr 2017

Hi did you figure out any workaround for this? Please let me know

muniswamy89 on 6 Jun 2018

Any update on this implementation?

I am now facing this problem with PostgreSQL and SQLAlchemy and would love to have that "on duplicate update".

Thanks for the work

AlvaroPica on 10 Dec 2018

A workaround would be to remove the unique index in the database:

sqlquery="ALTER 'TABLE DATABASE'.'TABLE' DROP INDEX 'idx_name'"
afterwards
df.to_sql('TableNameHere', engine, if_exists='append', chunksize=900, index=False)
can be executed.

Just let your MySQL Server add the index again and drop the duplicates.
sqlquery="ALTER IGNORE TABLE 'DATABASE'.'TABLE' ADD UNIQUE INDEX 'idx_name' ('column_name1' ASC, 'column_name2' ASC, 'column_name3' '[ASC | DESC]')"

Depending on your specific application, this can be helpful.
Anyway if_existsoption like append_skipdupes would be much better.

valewyss on 16 Apr 2019

👍6 👎4 😕2

append_skipdupes would be the perfect way to handle this.

cgi1 on 14 May 2019

👍16

yes, append_skipdupes +1

macdet on 28 Jun 2019

👍4

Agreed that it would be good to be able to deal with this with options in df.to_sql().

Here's the workaround I use in sqlite:

CREATE TABLE IF NOT EXISTS my_table_name (
    some_kind_of_id INT PRIMARY KEY ON CONFLICT IGNORE,
    ...

Then, when I insert duplicates, they get silently ignored, and the non-duplicates are processed correctly. In my case, the data are (i.e. should be) static, so I don't need to update. It's just that the form of the data feed is such that I'll get duplicates that are ignorable.

jtkiley on 6 Aug 2019

👍3

an other workaround with MariaDb and MySql :
df.to_csv("test.csv")
then use :
LOAD DATA INFILE 'test.csv' IGNORE INTO TABLE mytable or
LOAD DATA INFILE 'test.csv' REPLACE INTO TABLE mytable.

LOAD DATA is very faster than INSERT.

complete code:

csv_path = str(Path(application_path) / "tmp" / "tmp.csv").replace("\\", "\\\\")
df.to_csv(csv_path, index=False, sep='\t', quotechar="'", na_rep=r'\N')
rq = """LOAD DATA LOCAL INFILE '{file_path}' REPLACE INTO TABLE {db}.{db_table}
        LINES TERMINATED BY '\\r\\n'
        IGNORE 1 LINES
         ({col});
        """.format(db=db,
                   file_path=csv_path,
                   db_table=table_name,
                   col=",".join(df.columns.tolist()))

netchose on 24 Oct 2019

I believe this is being addressed in #29636 with the upsert_ignore argument, which addresses #14553.

kjford on 9 Dec 2019

append_skipdupes +1

iveteran on 20 Jun 2020

👍3

+1 for append_skipdupes

grantog on 23 Aug 2020

Agree 'append_skipdupes' should be added.

Arham-Aalam on 26 Aug 2020

👍4

Yes, please. 'append_skipdupes' should be added and not only for the Primary Key column. If there are duplicates among other Unique columns also it should skip appending those new duplicate rows.