Suppose you have an existing SQL table called person_age
, where id
is the primary key:
age
id
1 18
2 42
and you also have new data in a DataFrame
called extra_data
age
id
2 44
3 95
then it would be useful to have an option on extra_data.to_sql()
that allows to pass the DataFrame to SQL with an INSERT
or UPDATE
option on the rows, based on the primary key
.
In this case, the id=2
row would get updated to age=44
and the id=3
row would get added
age
id
1 18
2 44
3 95
merge
from SQLAlchemy?I looked at pandas
sql.py
sourcecode to come up with a solution, but I couldn't follow.
(Apologies for mixing sqlalchemy
and sqlite
import pandas as pd
from sqlalchemy import create_engine
import sqlite3
conn = sqlite3.connect('example.db')
c = conn.cursor()
c.execute('''DROP TABLE IF EXISTS person_age;''')
c.execute('''
CREATE TABLE person_age
(id INTEGER PRIMARY KEY ASC, age INTEGER NOT NULL)
''')
conn.commit()
conn.close()
##### Create original table
engine = create_engine("sqlite:///example.db")
sql_df = pd.DataFrame({'id' : [1, 2], 'age' : [18, 42]})
sql_df.to_sql('person_age', engine, if_exists='append', index=False)
#### Extra data to insert/update
extra_data = pd.DataFrame({'id' : [2, 3], 'age' : [44, 95]})
extra_data.set_index('id', inplace=True)
#### extra_data.to_sql() with row update or insert option
expected_df = pd.DataFrame({'id': [1, 2, 3], 'age': [18, 44, 95]})
expected_df.set_index('id', inplace=True)
This would be nice functionality, but the main problem is that we want it to be database-flavor independent and based on sqlalchemy core (so not sqlalchemy ORM) for inclusion in pandas itself.
Which will make this difficult to implement ..
Yeah, I think this is out of scope for pandas since upserts aren't supported by all db engines.
While an INSERT OR UPDATE
isn't supported by all engines, an INSERT OR REPLACE
can be made engine agnostic by deleting rows from the target table for the set of primary keys in the DataFrame index followed by an insert of all rows in the DataFrame. You'd want to do this in a transaction.
@TomAugspurger Could we add the upsert option for supported db engines and throw an error for unsupported db engines ?
I'd like to see this as well. I am caught in between using pure SQL and SQL Alchemy (haven't gotten this to work yet, I think it has something to do with how I pass the dicts). I use psycopg2 COPY to bulk insert, but I would love to use pd.to_sql for tables where values might change over time and I don't mind it inserting a bit slower.
insert_values = df.to_dict(orient='records')
insert_statement = sqlalchemy.dialects.postgresql.insert(table).values(insert_values)
upsert_statement = insert_statement.on_conflict_do_update(
constraint='fact_case_pkey',
set_= df.to_dict(orient='dict')
)
And pure SQL:
def create_update_query(df, table=FACT_TABLE):
"""This function takes the Airflow execution date passes it to other functions"""
columns = ', '.join([f'{col}' for col in DATABASE_COLUMNS])
constraint = ', '.join([f'{col}' for col in PRIMARY_KEY])
placeholder = ', '.join([f'%({col})s' for col in DATABASE_COLUMNS])
values = placeholder
updates = ', '.join([f'{col} = EXCLUDED.{col}' for col in DATABASE_COLUMNS])
query = f"""INSERT INTO {table} ({columns})
VALUES ({placeholder})
ON CONFLICT ({constraint})
DO UPDATE SET {updates};"""
query.split()
query = ' '.join(query.split())
return query
def load_updates(df, connection=DATABASE):
"""Uses COPY from STDIN to load to Postgres
:param df: The dataframe which is writing to StringIO, then loaded to the the database
:param connection: Refers to a PostgresHook
"""
conn = connection.get_conn()
cursor = conn.cursor()
df1 = df.where((pd.notnull(df)), None)
insert_values = df1.to_dict(orient='records')
for row in insert_values:
cursor.execute(create_update_query(df), row)
conn.commit()
cursor.close()
del cursor
conn.close()
@ldacey this style worked for me (insert_statement.excluded is an alias to the row of data that violated the constraint):
insert_values = merged_transactions_channels.to_dict(orient='records')
insert_statement = sqlalchemy.dialects.postgresql.insert(orders_to_channels).values(insert_values)
upsert_statement = insert_statement.on_conflict_do_update(
constraint='orders_to_channels_pkey',
set_={'channel_owner': insert_statement.excluded.channel_owner}
)
@cdagnino This snippet might not work in the case of composite keys, that scenario has to be taken care of also. I'll try to find a way to do the same
One way to solve this update issue is to use sqlachemy's bulk_update_mappings. This function takes in a list of dictionary values and updates each row based on the tables primary key.
session.bulk_update_mappings(
Table,
pandas_df.to_dict(orient='records)
)
I agree with @neilfrndes, shouldn't allow a nice feature like this not to be implemented because some DBs don't support. Is there any chance this feature might happen?
Probably. if someone makes a PR. On further consideration, I don't think I'm opposed to this on the principle that some databases don't support it. However, I'm not too familiar with the sql code, so I'm not sure what the best approach is.
One possibility is to provide some examples for upserts using the method
callable if this PR is introduced: https://github.com/pandas-dev/pandas/pull/21401
For postgres that would look something like (untested):
from sqlalchemy.dialects import postgresql
def pg_upsert(table, conn, keys, data_iter):
for row in data:
row_dict = dict(zip(keys, row))
stmt = postgresql.insert(table).values(**row_dict)
upsert_stmt = stmt.on_conflict_do_update(
index_elements=table.index,
set_=row_dict)
conn.execute(upsert_stmt)
Something similar could be done for mysql.
For postgres I am using execute_values. In my case, my query is a jinja2 template to flag whether I should do update set or do nothing. This has been quite quick and flexible. Not as fast as using COPY or copy_expert but it works well.
from psycopg2.extras import execute_values
df = df.where((pd.notnull(df)), None)
tuples = [tuple(x) for x in df.values]
``
with pg_conn:
with pg_conn.cursor() as cur:
execute_values(cur=cur,
sql=insert_query,
argslist=tuples,
template=None,
)
@danich1 can you, please, set an example of how this would work?
I tried to have a look into bulk_update_mappings but I got really lost and couldn't make it to work.
@cristianionescu92 An example would be this:
I have a table called User with the following fields: id and name.
| id | name |
| --- | --- |
| 0 | John |
| 1 | Joe |
| 2 | Harry |
I have a pandas data frame with the same columns but updated values:
| id | name |
| --- | --- |
| 0 | Chris |
| 1 | James |
Let's also assume that we have a session variable open to access the database. By calling this method:
session.bulk_update_mappings(
User,
<pandas dataframe above>.to_dict(orient='records')
)
Pandas will convert the table into a list of dictionaries [{id: 0, name: "chris"}, {id: 1, name:"james"}] that sql will use to update the rows of the table. So final table will look like:
| id | name |
| --- | --- |
| 0 | Chris |
| 1 | James |
| 2 | Harry |
Hi, @danich1 and thanks a lot for your response. I figured out myself the mechanics of how the update would work. Unfortunately I don't know work how to work with a session, I am quite beginner.
Let me show you what I am doing:
` import pypyodbc
from to_sql_newrows import clean_df_db_dups, to_sql_newrows #these are 2 functions I found on GitHub, unfortunately I cannot remember the link. Clean_df_db_dups excludes from a dataframe the rows which already exist in an SQL table by checking several key columns and to_sql_newrows is a function which inserts into sql the new rows.
from sqlalchemy import create_engine
engine = create_engine("engine_connection_string")
#Write data to SQL
Tablename = 'Dummy_Table_Name'
Tablekeys = Tablekeys_string
dftoupdateorinsertinSQL= random_dummy_dataframe
#Connect to sql server db using pypyodbc
cnxn = pypyodbc.connect("Driver={SQL Server};"
"Server=ServerName;"
"Database=DatabaseName;"
"uid=userid;pwd=password")
newrowsdf= clean_df_db_dups(dftoupdateorinsertinSQL, Tablename, engine, dup_cols=Tablekeys)
newrowsdf.to_sql(Tablename, engine, if_exists='append', index=False, chunksize = 140)
end=timer()
tablesize = (len(newrowsdf.index))
print('inserted %r rows '%(tablesize))`
The above code is basically excluding from a dataframe the rows which I already have in SQL and only inserts the new rows. What I need is to update the rows which exists. Can you, please, help me understand what I should do next?
Motivation for a better TO_SQL
to_sql
integrating better with database practices is increasingly of value as data science grows and gets mixed with data engineering.
upsert
is one of them, in particular because many people find that the work around is to use replace
instead, which drops the table, and with it all the views and constraints.
The alternative I've seen in more experienced users is to stop using pandas at this stage, and this tends to propagate upstream and makes the pandas package loose retention among experienced users. Is this the direction Pandas wants to go?
I understand we want to_sql to remain database agnostic as much as possible, and use core sql alchemy. A method that truncates or deletes instead of a true upsert would still add a lot of value though.
Integration with Pandas product vision
A lot of the above debate happened before the introduction of the method
argument (as mentioned by @kjford with psql_insert_copy
) and the possibility to pass on a callable.
I'd happily contribute to either the core pandas functionality, or failing that, documentation on solution / best practice on how to achieve an upsert functionality within Pandas, such as the below:
https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-sql-method
What is the preferred way forward for Pandas core dev / product managers?
I think we're open to an implementation that's engine-specific. The proposal to use method='upsert'
seems reasonable, but at this point I think we need someone to come up with a clear design proposal.
I have a similar requirement where I want to update existing data in a MySQL table from multiple CSVs over time.
I thought I could df.to_sql() to insert the new data into a newly created temporary table and then run a MySQL query to control how to append/update data in the existing table.
MySQL Reference: https://stackoverflow.com/questions/2472229/insert-into-select-from-on-duplicate-key-update?answertab=active#tab-top
Disclaimer: I started using Python and Pandas only a few days ago.
Hey Pandas folk: I have had this same issue, needing to frequently update my local database with records that I ultimately load and manipulate in pandas. I built a simple library to do this - it’s basically a stand-in for df.to_sql and pd.read_sql_table that uses the DataFrame index as a primary key by default. Uses sqlalchemy core only.
https://pypi.org/project/pandabase/0.2.1/
Https://github.com/notsambeck/pandabase
This tool is fairly opinionated, probably not appropriate to include in Pandas as-is. But for my specific use case it solves the problem... if there is interest in massaging this to make it fit in Pandas I am happy to help.
For now, the following works (in the limited case of current-ish pandas and sqlalchemy, named index as primary key, SQLite or Postgres back end, and supported datatypes):
pip install pandabase / pandabase.to_sql(df, table_name, con_string, how=‘upsert’)
Working on a general solution to this with cvonsteg. Planning to come back with a proposed design in October.
@TomAugspurger as suggested, @rugg2 and I have come up with the following design proposal for an upsert
option in to_sql()
.
2 new variables to be added as a possible method
argument in the to_sql()
method:
1) upsert_update
- on row match, update row in database (for knowingly updating records - represents most use cases)
2) upsert_ignore
- on row match, do not update row in database (for cases where datasets have overlap, and you do not want to override data in tables)
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine("connection string")
df = pd.DataFrame(...)
df.to_sql(
name='table_name',
con=engine,
if_exists='append',
method='upsert_update' # (or upsert_ignore)
)
To implement this, SQLTable
class would receive 2 new private methods containing the upsert logic, which would be called from the SQLTable.insert() method:
def insert(self, chunksize=None, method=None):
#set insert method
if method is None:
exec_insert = self._execute_insert
elif method == "multi":
exec_insert = self.execute_insert_multi
#new upsert methods <<<
elif method == "upsert_update":
exec_insert = self.execute_upsert_update
elif method == "upsert_ignore":
exec_insert = self.execute_upsert_ignore
# >>>
elif callable(method):
exec_inset = partial(method, self)
else:
raise ValueError("Invalid parameter 'method': {}".format(method))
...
We propose the following implementation, with rationale outlined in detail below (all points are open for discussion):
DELETE
and INSERT
upsert
, and implementations can vary across flavoursupsert_ignore
these operations would obviously be skipped on matching recordsUNIQUE
constraint, then it is plausible that multiple rows may match the upsert condition. In this case, no upsert should be performed as it is ambiguous as to which record should be updated. To enforce this from pandas, each row would need to be individually assessed to check that only 1 or 0 rows match, before it is inserted. While this functionality is reasonably straightforward to implement, it results in each record requiring a read and a write operation (plus a delete if a 1 record clash found), which feels highly inefficient for larger datasets. @TomAugspurger, if the upsert
proposal designed with @cvonsteg suits you, we will proceed with the implementation in code (incl. tests) and raise a pull request.
Let us know if you would like to proceed differently.
Reading through the proposal is on my todo list. I'm a bit behind on my
email right now.
On Wed, Oct 9, 2019 at 9:18 AM Romain notifications@github.com wrote:
@TomAugspurger https://github.com/TomAugspurger, if the design we
designed with @cvonsteg https://github.com/cvonsteg suits you, we will
proceed with the implementation in code (incl. tests) and raise a pull
request.Let us know if you would like to proceed differently.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/14553?email_source=notifications&email_token=AAKAOITBNTWOQRBW3OWDEZDQNXR25A5CNFSM4CU2M7O2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAYBJ7A#issuecomment-540022012,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAKAOIRZQEQWUY36PQ36QTLQNXR25ANCNFSM4CU2M7OQ
.
I personally don't have anything against it so think a PR is welcome. One implementation across all DBMs using SQLAlchemy core is certainly how this should start if I am reading your points correctly, and same with just primary keys.
Always easier to start small and focused and expand from there
need this feature badly.
PR we wrote with cvonsteg should now give the functionality: down to reviews now!
This functionality would be absolutely glorious! I'm not too versed in the vocabulary of github; does the comment by @rugg2 that the functionality is "down to reviews now" mean that it's down to the pandas team to review it? And if it's approved, does that mean it will become available via a new version of pandas that we can install, or would we be required to apply the commit manually ourselves via git? (I have had problems with this through conda so if that's the case I'd like to get up to speed by the time this functionality is ready). Thank you!!
@pmgh2345 - yep, as you said, "down to reviews now" means a pull request has been raised and is under review from the core devs. You can see the PR mentioned above (#29636). Once it is approved, you could technically fork the branch with the updated code and compile your own local version of pandas with the functionality built in. However, I would personally recommend waiting until it's been merged into master and released, and then just pip installing the newest version of pandas.
PR we wrote with cvonsteg should now give the functionality: down to reviews now!
It might be worth adding a new parameter to the to_sql
method, rather than using if_exists
. The reason being, if_exists
is checking for table, not row, existence.
@cvonsteg originally proposed using method=
, which would avoid the ambiguity of having two meanings for if_exists
.
df.to_sql(
name='table_name',
con=engine,
if_exists='append',
method='upsert_update' # (or upsert_ignore)
)
@brylie we could add a new parameter that's true, but as you know each new parameter makes an API more clunky. There is a tradeoff.
If we have to chose among the current parameters, as you said we initially thought of using the method
argument, but after more thoughts we realised that both (1) the usage and (2) the logic better fits the if_exists
argument.
1) from an API usage point of view
The user will want to chose both method="multi" or None on the one hand, and "upsert" on the other. However, there aren't equivalently strong use cases with using the "upsert" functionality at the same time as a if_exists="append" or "replace", if any.
2) from a logic point of view
Let me know if I understood your point well, and please shout if you think the current implementation under review (PR #29636) would be a net negative!
Yep, you understand my point. The current implementation is a net positive but slightly diminished by ambiguous semantics.
I still maintain that the if_exists
should continue to refer to only one thing, table existence. Having ambiguity in the parameters negatively impacts readability, and might lead to convoluted internal logic. Whereas, adding a new parameter, like upsert=True
is clear and explicit.
Hello!
If you want to see a non agnostic implementation for doing upserts I have an example with my library pangres. It handles PostgreSQL and MySQL using sqlalchemy functions specific for those databases types. As for SQlite (and other databases types allowing a similar upsert syntax) it uses a compiled regular sqlalchemy Insert.
I share this thinking it might give a few ideas to collaborators (I am aware though, that we want this to be SQL type agnostic which makes a lot of sense). Also perhaps a speed comparison would be interesting too when the PR of @cvonsteg goes through.
Please mind I am not a long time sqlalchemy expert or such!
I really want this feature. I agree that a method='upsert_update'
is a good idea.
Is this still planned? Pandas really need this feature
Yes this is still planned, and we're almost there!
Code is written, but there is one test that doesn't pass. Help welcomed!
https://github.com/pandas-dev/pandas/pull/29636
On Tue, May 5, 2020, 19:18 Leonel Atencio notifications@github.com wrote:
Is this still planned? Pandas really need this feature
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/14553#issuecomment-624223231,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AI5X625A742YTYFZE7YW5A3RQBJ6NANCNFSM4CU2M7OQ
.
Hello! Is the functionality ready or is something still missing? If something is still missing, please let me know if I can help with anything!
Any news?))
Coming from Java world, never thought this simple functionality might turned my codebase upside down.
Hi everyone,
I've looked into how upserts are implemented in SQL across dialects and found a number of techniques that can inform design decisions here. But first, I want to warn against using DELETE ... INSERT logic. If there are foreign keys or triggers, other records across the database will end up being deleted or otherwise messed up. In MySQL, REPLACE does the same damage. I've actually created hours of work for myself fixing data because I used REPLACE. So, that said, here are the techniques implemented in SQL:
Dialect | Technique
-- | --
MySQL | INSERT ... ON DUPLICATE KEY UPDATE
PostgreSQL | INSERT ... ON CONFLICT
SQLite | INSERT ... ON CONFLICT
Db2 | MERGE
SQL Server | MERGE
Oracle | MERGE
SQL:2016 | MERGE
With wildly varying syntax, I understand the temptation to use DELETE ... INSERT to make the implementation dialect agnostic. But there's another way: we can imitate the logic of the MERGE statement using a temp table and basic INSERT and UPDATE statements. The SQL:2016 MERGE syntax is as follows:
MERGE INTO target_table
USING source_table
ON search_condition
WHEN MATCHED THEN
UPDATE SET col1 = value1, col2 = value2,...
WHEN NOT MATCHED THEN
INSERT (col1,col2,...)
VALUES (value1,value2,...);
Borrowed from Oracle Tutorial
and adjusted to conform to SQL Wikibook
Since every dialect supported by SQLAlchemy supports temp tables, a safer, dialect-agnostic approach to doing an upsert would be to, in a single transaction:
Besides being a dialect-agnostic technique, it also has the advantage of being expanded upon by allowing the end-user to choose how to insert or how to update the data as well as on what key to join the data.
While the syntax of temp tables, and update joins might differ slightly between dialects, they should be supported everywhere.
Below is a proof of concept I wrote for MySQL:
import uuid
import pandas as pd
from sqlalchemy import create_engine
# This proof of concept uses this sample database
# https://downloads.mysql.com/docs/world.sql.zip
# Arbitrary, unique temp table name to avoid possible collision
source = str(uuid.uuid4()).split('-')[-1]
# Table we're doing our upsert against
target = 'countrylanguage'
db_url = 'mysql://<{user: }>:<{passwd: }>.@<{host: }>/<{db: }>'
df = pd.read_sql(
f'SELECT * FROM `{target}`;',
db_url
)
# Change for UPDATE, 5.3->5.4
df.at[0,'Percentage'] = 5.4
# Change for INSERT
df = df.append(
{'CountryCode': 'ABW','Language': 'Arabic','IsOfficial': 'F','Percentage':0.0},
ignore_index=True
)
# List of PRIMARY or UNIQUE keys
key = ['CountryCode','Language']
# Do all of this in a single transaction
engine = create_engine(db_url)
with engine.begin() as con:
# Create temp table like target table to stage data for upsert
con.execute(f'CREATE TEMPORARY TABLE `{source}` LIKE `{target}`;')
# Insert dataframe into temp table
df.to_sql(source,con,if_exists='append',index=False,method='multi')
# INSERT where the key doesn't match (new rows)
con.execute(f'''
INSERT INTO `{target}`
SELECT
*
FROM
`{source}`
WHERE
(`{'`, `'.join(key)}`) NOT IN (SELECT `{'`, `'.join(key)}` FROM `{target}`);
''')
# Create a doubled list of tuples of non-key columns to template the update statement
non_key_columns = [(i,i) for i in df.columns if i not in key]
# Whitespace for aesthetics
whitespace = '\n\t\t\t'
# Do an UPDATE ... JOIN to set all non-key columns of target to equal source
con.execute(f'''
UPDATE
`{target}` `t`
JOIN
`{source}` `s` ON `t`.`{"` AND `t`.`".join(["`=`s`.`".join(i) for i in zip(key,key)])}`
SET
`t`.`{f"`,{whitespace}`t`.`".join(["`=`s`.`".join(i) for i in non_key_columns])}`;
''')
# Drop our temp table.
con.execute(f'DROP TABLE `{source}`;')
Here, I make the following assumptions:
Despite the assumptions, I hope my MERGE-inspired technique informs efforts to build a flexible, robust upsert option.
I think this is an useful functionality however out of scope it seems as it is intuitive to have such a common feature while adding rows to a table.
Please think again to add this function: it is very useful to add rows to an existing table.
Alas Pangres is limited to Python 3.7+. As in my case (I am forced to use an old Python 3.4) it is not always a viable solution.
Thanks, @GoldstHa - that is really helpful input. I will attempt to create a POC for the MERGE-like implementation
Given the issues with the DELETE/INSERT
approach, and the potential blocker on @GoldstHa MERGE
approach on MySQL DBs, I've done a bit more digging. I have scratched together a proof of concept using the sqlalchemy update functionality, which looks promising. I will attempt to implement it properly this week in the Pandas codebase, ensuring that this approach works across all DB flavours.
There have been some good discussions around the API, and how an upsert should actually be called (i.e. via the if_exists
argument, or via an explicit upsert
argument). This will be clarified soon. For now, this is the pseudocode proposal for how the functionality would work using the SqlAlchemy
upsert
statement:
Identify primary key(s) and existing pkey values from DB table (if no primary key constraints identified, but upsert is called, return an error)
Make a temp copy of the incoming DataFrame
Identify records in incoming DataFrame with matching primary keys
Split temp DataFrame into records which have a primary key match, and records which don't
if upsert:
Update the DB table using `update` for only the rows which match
else:
Ignore rows from DataFrame with matching primary key values
finally:
Append remaining DataFrame rows with non-matching values in the primary key column to the DB table
Most helpful comment
While an
INSERT OR UPDATE
isn't supported by all engines, anINSERT OR REPLACE
can be made engine agnostic by deleting rows from the target table for the set of primary keys in the DataFrame index followed by an insert of all rows in the DataFrame. You'd want to do this in a transaction.