Pandas: Adding (Insert or update if key exists) option to `.to_sql`

Created on 1 Nov 2016 · 42Comments · Source: pandas-dev/pandas

Suppose you have an existing SQL table called person_age, where id is the primary key:

and you also have new data in a DataFrame called extra_data

then it would be useful to have an option on extra_data.to_sql() that allows to pass the DataFrame to SQL with an INSERT or UPDATE option on the rows, based on the primary key.

In this case, the id=2 row would get updated to age=44 and the id=3 row would get added

Expected Output

(Maybe) helpful code references

Use merge from SQLAlchemy?
The query: '''INSERT or REPLACE into person_age (id, age) values (?,?,?) ''' in this question

I looked at pandas sql.py sourcecode to come up with a solution, but I couldn't follow.

Code to replicate the example above

(Apologies for mixing sqlalchemy and sqlite

import pandas as pd
from sqlalchemy import create_engine
import sqlite3
conn = sqlite3.connect('example.db')

c = conn.cursor()
c.execute('''DROP TABLE IF EXISTS person_age;''')
c.execute('''
          CREATE TABLE person_age
          (id INTEGER PRIMARY KEY ASC, age INTEGER NOT NULL)
          ''')
conn.commit()
conn.close()

##### Create original table

engine = create_engine("sqlite:///example.db")
sql_df = pd.DataFrame({'id' : [1, 2], 'age' : [18, 42]})

sql_df.to_sql('person_age', engine, if_exists='append', index=False)


#### Extra data to insert/update

extra_data = pd.DataFrame({'id' : [2, 3], 'age' : [44, 95]})
extra_data.set_index('id', inplace=True)

#### extra_data.to_sql()  with row update or insert option

expected_df = pd.DataFrame({'id': [1, 2, 3], 'age': [18, 44, 95]})
expected_df.set_index('id', inplace=True)

Enhancement IO SQL

Source

cdagnino

👍103

Most helpful comment

While an INSERT OR UPDATE isn't supported by all engines, an INSERT OR REPLACE can be made engine agnostic by deleting rows from the target table for the set of primary keys in the DataFrame index followed by an insert of all rows in the DataFrame. You'd want to do this in a transaction.

kjford on 6 Nov 2016

👍40

All 42 comments

This would be nice functionality, but the main problem is that we want it to be database-flavor independent and based on sqlalchemy core (so not sqlalchemy ORM) for inclusion in pandas itself.
Which will make this difficult to implement ..

jorisvandenbossche on 2 Nov 2016

Yeah, I think this is out of scope for pandas since upserts aren't supported by all db engines.

TomAugspurger on 2 Nov 2016

kjford on 6 Nov 2016

👍40

@TomAugspurger Could we add the upsert option for supported db engines and throw an error for unsupported db engines ?

neilfrndes on 8 May 2017

👍24

I'd like to see this as well. I am caught in between using pure SQL and SQL Alchemy (haven't gotten this to work yet, I think it has something to do with how I pass the dicts). I use psycopg2 COPY to bulk insert, but I would love to use pd.to_sql for tables where values might change over time and I don't mind it inserting a bit slower.

insert_values = df.to_dict(orient='records')
insert_statement = sqlalchemy.dialects.postgresql.insert(table).values(insert_values)
upsert_statement = insert_statement.on_conflict_do_update(
    constraint='fact_case_pkey',
    set_= df.to_dict(orient='dict')
)

And pure SQL:

def create_update_query(df, table=FACT_TABLE):
    """This function takes the Airflow execution date passes it to other functions"""
    columns = ', '.join([f'{col}' for col in DATABASE_COLUMNS])
    constraint = ', '.join([f'{col}' for col in PRIMARY_KEY])
    placeholder = ', '.join([f'%({col})s' for col in DATABASE_COLUMNS])
    values = placeholder
    updates = ', '.join([f'{col} = EXCLUDED.{col}' for col in DATABASE_COLUMNS])
    query = f"""INSERT INTO {table} ({columns}) 
    VALUES ({placeholder}) 
    ON CONFLICT ({constraint}) 
    DO UPDATE SET {updates};"""
    query.split()
    query = ' '.join(query.split())
    return query

def load_updates(df, connection=DATABASE):
    """Uses COPY from STDIN to load to Postgres
     :param df: The dataframe which is writing to StringIO, then loaded to the the database
     :param connection: Refers to a PostgresHook
    """
    conn = connection.get_conn()
    cursor = conn.cursor()
    df1 = df.where((pd.notnull(df)), None)
    insert_values = df1.to_dict(orient='records')
    for row in insert_values:
        cursor.execute(create_update_query(df), row)
        conn.commit()
    cursor.close()
    del cursor
    conn.close()

ldacey on 6 Jun 2017

👍5

@ldacey this style worked for me (insert_statement.excluded is an alias to the row of data that violated the constraint):

insert_values = merged_transactions_channels.to_dict(orient='records')
 insert_statement = sqlalchemy.dialects.postgresql.insert(orders_to_channels).values(insert_values)
    upsert_statement = insert_statement.on_conflict_do_update(
        constraint='orders_to_channels_pkey',
        set_={'channel_owner': insert_statement.excluded.channel_owner}
    )

ODemidenko on 30 Jun 2017

❤2 👍2

@cdagnino This snippet might not work in the case of composite keys, that scenario has to be taken care of also. I'll try to find a way to do the same

rajbiswas on 6 Mar 2018

One way to solve this update issue is to use sqlachemy's bulk_update_mappings. This function takes in a list of dictionary values and updates each row based on the tables primary key.

session.bulk_update_mappings(
  Table,
  pandas_df.to_dict(orient='records)
)

danich1 on 1 May 2018

😕2 👍1

I agree with @neilfrndes, shouldn't allow a nice feature like this not to be implemented because some DBs don't support. Is there any chance this feature might happen?

joshhornby on 24 Nov 2018

Probably. if someone makes a PR. On further consideration, I don't think I'm opposed to this on the principle that some databases don't support it. However, I'm not too familiar with the sql code, so I'm not sure what the best approach is.

TomAugspurger on 24 Nov 2018

One possibility is to provide some examples for upserts using the method callable if this PR is introduced: https://github.com/pandas-dev/pandas/pull/21401

For postgres that would look something like (untested):

from sqlalchemy.dialects import postgresql

def pg_upsert(table, conn, keys, data_iter):
    for row in data:
        row_dict = dict(zip(keys, row))
        stmt = postgresql.insert(table).values(**row_dict)
        upsert_stmt = stmt.on_conflict_do_update(
            index_elements=table.index,
            set_=row_dict)
        conn.execute(upsert_stmt)

Something similar could be done for mysql.

kjford on 26 Nov 2018

For postgres I am using execute_values. In my case, my query is a jinja2 template to flag whether I should do update set or do nothing. This has been quite quick and flexible. Not as fast as using COPY or copy_expert but it works well.

from psycopg2.extras import execute_values

df = df.where((pd.notnull(df)), None)
tuples = [tuple(x) for x in df.values]

``with pg_conn: with pg_conn.cursor() as cur: execute_values(cur=cur, sql=insert_query, argslist=tuples, template=None, )

ldacey on 27 Nov 2018

@danich1 can you, please, set an example of how this would work?

I tried to have a look into bulk_update_mappings but I got really lost and couldn't make it to work.

cristianionescu92 on 14 Dec 2018

@cristianionescu92 An example would be this:
I have a table called User with the following fields: id and name.

| id | name |
| --- | --- |
| 0 | John |
| 1 | Joe |
| 2 | Harry |

I have a pandas data frame with the same columns but updated values:

| id | name |
| --- | --- |
| 0 | Chris |
| 1 | James |

Let's also assume that we have a session variable open to access the database. By calling this method:

session.bulk_update_mappings(
User,
<pandas dataframe above>.to_dict(orient='records')
)

Pandas will convert the table into a list of dictionaries [{id: 0, name: "chris"}, {id: 1, name:"james"}] that sql will use to update the rows of the table. So final table will look like:

| id | name |
| --- | --- |
| 0 | Chris |
| 1 | James |
| 2 | Harry |

danich1 on 20 Dec 2018

👍6

Hi, @danich1 and thanks a lot for your response. I figured out myself the mechanics of how the update would work. Unfortunately I don't know work how to work with a session, I am quite beginner.

Let me show you what I am doing:

` import pypyodbc
from to_sql_newrows import clean_df_db_dups, to_sql_newrows #these are 2 functions I found on GitHub, unfortunately I cannot remember the link. Clean_df_db_dups excludes from a dataframe the rows which already exist in an SQL table by checking several key columns and to_sql_newrows is a function which inserts into sql the new rows.

from sqlalchemy import create_engine
engine = create_engine("engine_connection_string")

#Write data to SQL
Tablename = 'Dummy_Table_Name'
Tablekeys = Tablekeys_string
dftoupdateorinsertinSQL= random_dummy_dataframe

#Connect to sql server db using pypyodbc
cnxn = pypyodbc.connect("Driver={SQL Server};"
                        "Server=ServerName;"
                        "Database=DatabaseName;"
                        "uid=userid;pwd=password")

newrowsdf= clean_df_db_dups(dftoupdateorinsertinSQL, Tablename, engine, dup_cols=Tablekeys)
newrowsdf.to_sql(Tablename, engine, if_exists='append', index=False, chunksize = 140)
end=timer()

tablesize = (len(newrowsdf.index))

print('inserted %r rows '%(tablesize))`

The above code is basically excluding from a dataframe the rows which I already have in SQL and only inserts the new rows. What I need is to update the rows which exists. Can you, please, help me understand what I should do next?

cristianionescu92 on 10 Jan 2019

Motivation for a better TO_SQL
to_sql integrating better with database practices is increasingly of value as data science grows and gets mixed with data engineering.

upsert is one of them, in particular because many people find that the work around is to use replace instead, which drops the table, and with it all the views and constraints.

The alternative I've seen in more experienced users is to stop using pandas at this stage, and this tends to propagate upstream and makes the pandas package loose retention among experienced users. Is this the direction Pandas wants to go?

I understand we want to_sql to remain database agnostic as much as possible, and use core sql alchemy. A method that truncates or deletes instead of a true upsert would still add a lot of value though.

Integration with Pandas product vision
A lot of the above debate happened before the introduction of the method argument (as mentioned by @kjford with psql_insert_copy) and the possibility to pass on a callable.

I'd happily contribute to either the core pandas functionality, or failing that, documentation on solution / best practice on how to achieve an upsert functionality within Pandas, such as the below:
https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-sql-method

What is the preferred way forward for Pandas core dev / product managers?

rugg2 on 29 Jul 2019

👍6 🚀3

I think we're open to an implementation that's engine-specific. The proposal to use method='upsert' seems reasonable, but at this point I think we need someone to come up with a clear design proposal.

TomAugspurger on 29 Jul 2019

👍5

I have a similar requirement where I want to update existing data in a MySQL table from multiple CSVs over time.

I thought I could df.to_sql() to insert the new data into a newly created temporary table and then run a MySQL query to control how to append/update data in the existing table.

MySQL Reference: https://stackoverflow.com/questions/2472229/insert-into-select-from-on-duplicate-key-update?answertab=active#tab-top

Disclaimer: I started using Python and Pandas only a few days ago.

pratham2003 on 20 Aug 2019

Hey Pandas folk: I have had this same issue, needing to frequently update my local database with records that I ultimately load and manipulate in pandas. I built a simple library to do this - it’s basically a stand-in for df.to_sql and pd.read_sql_table that uses the DataFrame index as a primary key by default. Uses sqlalchemy core only.

https://pypi.org/project/pandabase/0.2.1/
Https://github.com/notsambeck/pandabase

This tool is fairly opinionated, probably not appropriate to include in Pandas as-is. But for my specific use case it solves the problem... if there is interest in massaging this to make it fit in Pandas I am happy to help.

For now, the following works (in the limited case of current-ish pandas and sqlalchemy, named index as primary key, SQLite or Postgres back end, and supported datatypes):

pip install pandabase / pandabase.to_sql(df, table_name, con_string, how=‘upsert’)

notsambeck on 11 Sep 2019

❤1

Working on a general solution to this with cvonsteg. Planning to come back with a proposed design in October.

rugg2 on 11 Sep 2019

👍2

@TomAugspurger as suggested, @rugg2 and I have come up with the following design proposal for an upsert option in to_sql().

Interface Proposal

2 new variables to be added as a possible method argument in the to_sql() method:
1) upsert_update - on row match, update row in database (for knowingly updating records - represents most use cases)
2) upsert_ignore - on row match, do not update row in database (for cases where datasets have overlap, and you do not want to override data in tables)

import pandas as pd
from sqlalchemy import create_engine

engine = create_engine("connection string")
df = pd.DataFrame(...)

df.to_sql(
    name='table_name', 
    con=engine, 
    if_exists='append', 
    method='upsert_update' # (or upsert_ignore)
)

Implementation Proposal

To implement this, SQLTable class would receive 2 new private methods containing the upsert logic, which would be called from the SQLTable.insert() method:

def insert(self, chunksize=None, method=None):

    #set insert method
    if method is None:
        exec_insert = self._execute_insert
    elif method == "multi":
        exec_insert = self.execute_insert_multi
    #new upsert methods <<<
    elif method == "upsert_update":
        exec_insert = self.execute_upsert_update
    elif method == "upsert_ignore":
        exec_insert = self.execute_upsert_ignore
    # >>>
    elif callable(method):
        exec_inset = partial(method, self)
    else:
        raise ValueError("Invalid parameter 'method': {}".format(method))

    ...

We propose the following implementation, with rationale outlined in detail below (all points are open for discussion):

(1) Engine agnostic using SQLAlchemy core, via an atomic sequence of `DELETE` and `INSERT`

Only some dbms natively support upsert, and implementations can vary across flavours
As a first implementation we believe it'd be easier to test and maintain one implementation across all dbms. In future, if the demand exists, engine-specific implementations can be added.
For upsert_ignore these operations would obviously be skipped on matching records
It will be worth comparing an engine-agnostic implementation vs engine-specific implementations in terms of performance.

(2) Upsert on Primary Key only

Upserts default to primary key clashes unless otherwise specifed
Some DBMS allow users to specify non-primary-key columns, against which to check for uniqueness. Whilst this grants the user more flexibility, it comes with potential pitfalls. If these columns do not have a UNIQUE constraint, then it is plausible that multiple rows may match the upsert condition. In this case, no upsert should be performed as it is ambiguous as to which record should be updated. To enforce this from pandas, each row would need to be individually assessed to check that only 1 or 0 rows match, before it is inserted. While this functionality is reasonably straightforward to implement, it results in each record requiring a read and a write operation (plus a delete if a 1 record clash found), which feels highly inefficient for larger datasets.
In a future improvement, if the community calls for it, we could add the functionality to extend upsert to not only work on the primary key, but also on user specified fields. This is a longer term question for the core-dev team, as to whether Pandas should remain simple to protect users that have a poorly designed database, or have more functionalities.

cvonsteg on 30 Sep 2019

👍22

@TomAugspurger, if the upsert proposal designed with @cvonsteg suits you, we will proceed with the implementation in code (incl. tests) and raise a pull request.

Let us know if you would like to proceed differently.

rugg2 on 9 Oct 2019

Reading through the proposal is on my todo list. I'm a bit behind on my
email right now.

On Wed, Oct 9, 2019 at 9:18 AM Romain notifications@github.com wrote:

@TomAugspurger https://github.com/TomAugspurger, if the design we
designed with @cvonsteg https://github.com/cvonsteg suits you, we will
proceed with the implementation in code (incl. tests) and raise a pull
request.

Let us know if you would like to proceed differently.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/14553?email_source=notifications&email_token=AAKAOITBNTWOQRBW3OWDEZDQNXR25A5CNFSM4CU2M7O2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAYBJ7A#issuecomment-540022012,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAKAOIRZQEQWUY36PQ36QTLQNXR25ANCNFSM4CU2M7OQ
.

TomAugspurger on 9 Oct 2019

👍1

I personally don't have anything against it so think a PR is welcome. One implementation across all DBMs using SQLAlchemy core is certainly how this should start if I am reading your points correctly, and same with just primary keys.

Always easier to start small and focused and expand from there

WillAyd on 9 Oct 2019

👍3

need this feature badly.

bearHunting on 23 Oct 2019

👍22

PR we wrote with cvonsteg should now give the functionality: down to reviews now!

rugg2 on 18 Nov 2019

👍6

This functionality would be absolutely glorious! I'm not too versed in the vocabulary of github; does the comment by @rugg2 that the functionality is "down to reviews now" mean that it's down to the pandas team to review it? And if it's approved, does that mean it will become available via a new version of pandas that we can install, or would we be required to apply the commit manually ourselves via git? (I have had problems with this through conda so if that's the case I'd like to get up to speed by the time this functionality is ready). Thank you!!

pmgh2345 on 25 Nov 2019

@pmgh2345 - yep, as you said, "down to reviews now" means a pull request has been raised and is under review from the core devs. You can see the PR mentioned above (#29636). Once it is approved, you could technically fork the branch with the updated code and compile your own local version of pandas with the functionality built in. However, I would personally recommend waiting until it's been merged into master and released, and then just pip installing the newest version of pandas.

cvonsteg on 26 Nov 2019

PR we wrote with cvonsteg should now give the functionality: down to reviews now!

It might be worth adding a new parameter to the to_sql method, rather than using if_exists. The reason being, if_exists is checking for table, not row, existence.

@cvonsteg originally proposed using method=, which would avoid the ambiguity of having two meanings for if_exists.

df.to_sql(
    name='table_name', 
    con=engine, 
    if_exists='append', 
    method='upsert_update' # (or upsert_ignore)
)

brylie on 5 Dec 2019

👍7

@brylie we could add a new parameter that's true, but as you know each new parameter makes an API more clunky. There is a tradeoff.

If we have to chose among the current parameters, as you said we initially thought of using the method argument, but after more thoughts we realised that both (1) the usage and (2) the logic better fits the if_exists argument.

1) from an API usage point of view
The user will want to chose both method="multi" or None on the one hand, and "upsert" on the other. However, there aren't equivalently strong use cases with using the "upsert" functionality at the same time as a if_exists="append" or "replace", if any.

2) from a logic point of view

method currently works on _how_ the data is being inserted: row by row or "multi"
if_exists captures the business logic of how we manage our records: "replace", "append", "upsert_update" (upsert when key exists, append when new), "upsert_ignore" (ignore when key exists, append when new). Although replace and append are looking at table existence, it can also be understood in its impact at the record level.

Let me know if I understood your point well, and please shout if you think the current implementation under review (PR #29636) would be a net negative!

rugg2 on 18 Dec 2019

👍2

Yep, you understand my point. The current implementation is a net positive but slightly diminished by ambiguous semantics.

I still maintain that the if_exists should continue to refer to only one thing, table existence. Having ambiguity in the parameters negatively impacts readability, and might lead to convoluted internal logic. Whereas, adding a new parameter, like upsert=True is clear and explicit.

brylie on 19 Dec 2019

👍9

Hello!

If you want to see a non agnostic implementation for doing upserts I have an example with my library pangres. It handles PostgreSQL and MySQL using sqlalchemy functions specific for those databases types. As for SQlite (and other databases types allowing a similar upsert syntax) it uses a compiled regular sqlalchemy Insert.

I share this thinking it might give a few ideas to collaborators (I am aware though, that we want this to be SQL type agnostic which makes a lot of sense). Also perhaps a speed comparison would be interesting too when the PR of @cvonsteg goes through.
Please mind I am not a long time sqlalchemy expert or such!

ThibTrip on 7 Apr 2020

I really want this feature. I agree that a method='upsert_update' is a good idea.

jbsilva on 1 May 2020

👍3

Is this still planned? Pandas really need this feature

sansagara on 5 May 2020

👍4

Yes this is still planned, and we're almost there!

Code is written, but there is one test that doesn't pass. Help welcomed!
https://github.com/pandas-dev/pandas/pull/29636

On Tue, May 5, 2020, 19:18 Leonel Atencio notifications@github.com wrote:

Is this still planned? Pandas really need this feature

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/14553#issuecomment-624223231,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AI5X625A742YTYFZE7YW5A3RQBJ6NANCNFSM4CU2M7OQ
.

rugg2 on 5 May 2020

👍9

Hello! Is the functionality ready or is something still missing? If something is still missing, please let me know if I can help with anything!

osdiego on 24 Jun 2020

👍3

Any news?))

Coming from Java world, never thought this simple functionality might turned my codebase upside down.

agigao on 5 Jul 2020

👍1

Hi everyone,

I've looked into how upserts are implemented in SQL across dialects and found a number of techniques that can inform design decisions here. But first, I want to warn against using DELETE ... INSERT logic. If there are foreign keys or triggers, other records across the database will end up being deleted or otherwise messed up. In MySQL, REPLACE does the same damage. I've actually created hours of work for myself fixing data because I used REPLACE. So, that said, here are the techniques implemented in SQL:

With wildly varying syntax, I understand the temptation to use DELETE ... INSERT to make the implementation dialect agnostic. But there's another way: we can imitate the logic of the MERGE statement using a temp table and basic INSERT and UPDATE statements. The SQL:2016 MERGE syntax is as follows:

MERGE INTO target_table 
USING source_table 
ON search_condition
    WHEN MATCHED THEN
        UPDATE SET col1 = value1, col2 = value2,...
    WHEN NOT MATCHED THEN
        INSERT (col1,col2,...)
        VALUES (value1,value2,...);

Borrowed from Oracle Tutorial
and adjusted to conform to SQL Wikibook

Since every dialect supported by SQLAlchemy supports temp tables, a safer, dialect-agnostic approach to doing an upsert would be to, in a single transaction:

Create a temp table.
Insert the data into that temp table.
Do an UPDATE ... JOIN.
INSERT where the key (PRIMARY or UNIQUE) doesn't match.
Drop the temp table.

Besides being a dialect-agnostic technique, it also has the advantage of being expanded upon by allowing the end-user to choose how to insert or how to update the data as well as on what key to join the data.

While the syntax of temp tables, and update joins might differ slightly between dialects, they should be supported everywhere.

Below is a proof of concept I wrote for MySQL:

import uuid

import pandas as pd
from sqlalchemy import create_engine


# This proof of concept uses this sample database
# https://downloads.mysql.com/docs/world.sql.zip


# Arbitrary, unique temp table name to avoid possible collision
source = str(uuid.uuid4()).split('-')[-1]

# Table we're doing our upsert against
target = 'countrylanguage'

db_url = 'mysql://<{user: }>:<{passwd: }>.@<{host: }>/<{db: }>'

df = pd.read_sql(
    f'SELECT * FROM `{target}`;',
    db_url
)

# Change for UPDATE, 5.3->5.4
df.at[0,'Percentage'] = 5.4
# Change for INSERT
df = df.append(
    {'CountryCode': 'ABW','Language': 'Arabic','IsOfficial': 'F','Percentage':0.0},
    ignore_index=True
)

# List of PRIMARY or UNIQUE keys
key = ['CountryCode','Language']

# Do all of this in a single transaction
engine = create_engine(db_url)
with engine.begin() as con:
    # Create temp table like target table to stage data for upsert
    con.execute(f'CREATE TEMPORARY TABLE `{source}` LIKE `{target}`;')
    # Insert dataframe into temp table
    df.to_sql(source,con,if_exists='append',index=False,method='multi')
    # INSERT where the key doesn't match (new rows)
    con.execute(f'''
        INSERT INTO `{target}`
        SELECT
            *
        FROM
            `{source}`
        WHERE
            (`{'`, `'.join(key)}`) NOT IN (SELECT `{'`, `'.join(key)}` FROM `{target}`);
    ''')
    # Create a doubled list of tuples of non-key columns to template the update statement
    non_key_columns = [(i,i) for i in df.columns if i not in key]
    # Whitespace for aesthetics
    whitespace = '\n\t\t\t'
    # Do an UPDATE ... JOIN to set all non-key columns of target to equal source
    con.execute(f'''
        UPDATE
            `{target}` `t`
                JOIN
            `{source}` `s` ON `t`.`{"` AND `t`.`".join(["`=`s`.`".join(i) for i in zip(key,key)])}`
        SET
            `t`.`{f"`,{whitespace}`t`.`".join(["`=`s`.`".join(i) for i in non_key_columns])}`;
    ''')
    # Drop our temp table.
    con.execute(f'DROP TABLE `{source}`;')

Here, I make the following assumptions:

The structure of your source and destination are the same.
That you want to do simple inserts using the data in your dataframe.
That you want to simply update all non-key columns with the data from your dataframe.
That you don't want to make any changes to data in key columns.

Despite the assumptions, I hope my MERGE-inspired technique informs efforts to build a flexible, robust upsert option.

GoldstHa on 30 Jul 2020

👍4

I think this is an useful functionality however out of scope it seems as it is intuitive to have such a common feature while adding rows to a table.

raajtilaksarma on 10 Sep 2020

👍1

Please think again to add this function: it is very useful to add rows to an existing table.
Alas Pangres is limited to Python 3.7+. As in my case (I am forced to use an old Python 3.4) it is not always a viable solution.

Nemecsek on 16 Sep 2020

Thanks, @GoldstHa - that is really helpful input. I will attempt to create a POC for the MERGE-like implementation

cvonsteg on 1 Oct 2020

👍1

Given the issues with the DELETE/INSERT approach, and the potential blocker on @GoldstHa MERGE approach on MySQL DBs, I've done a bit more digging. I have scratched together a proof of concept using the sqlalchemy update functionality, which looks promising. I will attempt to implement it properly this week in the Pandas codebase, ensuring that this approach works across all DB flavours.

Modified Approach Proposal

There have been some good discussions around the API, and how an upsert should actually be called (i.e. via the if_exists argument, or via an explicit upsert argument). This will be clarified soon. For now, this is the pseudocode proposal for how the functionality would work using the SqlAlchemy upsert statement:

Identify primary key(s) and existing pkey values from DB table (if no primary key constraints identified, but upsert is called, return an error)

Make a temp copy of the incoming DataFrame

Identify records in incoming DataFrame with matching primary keys

Split temp DataFrame into records which have a primary key match, and records which don't

if upsert:
    Update the DB table using `update` for only the rows which match
else:
    Ignore rows from DataFrame with matching primary key values
finally:
    Append remaining DataFrame rows with non-matching values in the primary key column to the DB table

cvonsteg on 2 Nov 2020

🎉4

Was this page helpful?

0 / 5 - 0 ratings

Related issues

BUG: fillna with inplace does not work with multiple columns selection by loc

hiiwave · 3Comments

Dataframe creation: Specifying dtypes with a dictionary

amelio-vazquez-reina · 3Comments

ENH: Support for multiple comment characters with readers

ebran · 3Comments

can't plot multi-row subplots

ericdf · 3Comments

ValueError plotting bar plot from DataFrame with existing Axes

swails · 3Comments