Pandas: Use multi-row inserts for massive speedups on to_sql over high latency connections

Created on 1 Dec 2014 · 48Comments · Source: pandas-dev/pandas

I have been trying to insert ~30k rows into a mysql database using pandas-0.15.1, oursql-0.9.3.1 and sqlalchemy-0.9.4. Because the machine is as across the atlantic from me, calling data.to_sql was taking >1 hr to insert the data. On inspecting with wireshark, the issue is that it is sending an insert for every row, then waiting for the ACK before sending the next, and, long story short, the ping times are killing me.

However, following the instructions from SQLAlchemy, I changed

def _execute_insert(self, conn, keys, data_iter):
    data = [dict((k, v) for k, v in zip(keys, row)) for row in data_iter]
    conn.execute(self.insert_statement(), data)

def _execute_insert(self, conn, keys, data_iter):
    data = [dict((k, v) for k, v in zip(keys, row)) for row in data_iter]
    conn.execute(self.insert_statement().values(data))

and the entire operation completes in less than a minute. (To save you a click, the difference is between multiple calls to insert into foo (columns) values (rowX) and one massive insert into foo (columns) VALUES (row1), (row2), row3)). Given how often people are likely to use pandas to insert large volumes of data, this feels like a huge win that would be great to be included more widely.

Some challenges:

Not every database supports multirow inserts (SQLite and SQLServer didn't in the past, though they do now). I don't know how to check for this via SQLAlchemy
The MySQL server I was using didn't allow me to insert the data all in one go, I had to set the chunksize (5k worked fine, but I guess the full 30k was too much). If we made this the default insert, most people would have to add a chunk size (which might be hard to calculate, as it might be determined by the maximum packet size of the server).

The easiest way to do this, would be to add a multirow= boolean parameter (default False) to the to_sql function, and then leave the user responsible for setting the chunksize, but perhaps there's a better way?

Thoughts?

IO SQL Performance

Source

maxgrenderjones

👍36 🎉7

Most helpful comment

We've figured out how to monkey patch - might be useful to someone else. Have this code before importing pandas.

from pandas.io.sql import SQLTable

def _execute_insert(self, conn, keys, data_iter):
    print "Using monkey-patched _execute_insert"
    data = [dict((k, v) for k, v in zip(keys, row)) for row in data_iter]
    conn.execute(self.insert_statement().values(data))

SQLTable._execute_insert = _execute_insert

nhockham on 26 Feb 2015

👍26 ❤14 🎉13 👀1

All 48 comments

This seems reasonable. Thanks for investigating this!

For the implementation, it will depend on how sqlalchemy deals with database flavors that does not support this (I can't test this at the moment, but it seems that sqlalchemy raises an error (eg http://stackoverflow.com/questions/23886764/multiple-insert-statements-in-mssql-with-sqlalchemy). Also, if it has the consequence that a lot of people will have to set chunksize, this is indeed not a good idea to do as default (unless we set chunksize to a value by default).
So adding a keyword seems maybe better.

@artemyk @mangecoeur @hayd @danielballan

jorisvandenbossche on 3 Dec 2014

Apparently SQLAlchemy has a flag dialect.supports_multivalues_insert (see e.g. http://pydoc.net/Python/SQLAlchemy/0.8.3/sqlalchemy.sql.compiler/ , possibly called supports_multirow_insert in other versions, https://www.mail-archive.com/[email protected]/msg202880.html ).

Since this has the potential to speed up inserts a lot, and we can check for support easily, I'm thinking maybe we could do it by default, and also set chunksize to a default value (e.g. 16kb chunks... not sure what's too big in most situations). If the multirow insert fails, we could throw an exception suggesting lowering the chunksize?

artemyk on 3 Dec 2014

Now I just need to persuade the SQLAlchemy folks to set supports_multivalues_insert to true on SQL Server >2005 (I hacked it into the code and it works fine, but it's not on by default).

On a more on-topic note, I think the chunksize could be tricky. On my mysql setup (which I probably configured to allow large packets), I can set chunksize=5000, on my SQLServer setup, 500 was too large, but 100 worked fine. However, it's probably true that most of the benefits from this technique come from going from inserting 1 row at a time to 100, rather than 100 to 1000.

maxgrenderjones on 3 Dec 2014

What if chunksize=None meant "Adaptively choose a chunksize"? Attempt something like 5000, 500, 50, 1. Users could turn this off by specifying a chunksize. If the overhead from these attempts is too large, I like @maxgrenderjones suggestion: chunksize=10 is a better default than chunksize=1.

danielballan on 3 Dec 2014

On that last comment "chunksize=10 is a better default than chunksize=1" -> that is not fully true I think. The current situation is to do _one_ execute statement that consists of multiline single-row insert statements (which is not a chunksize of 1), while chunksize=10 would mean doing a lot of execute statements with each time one multi-row insert.
And I don't know if this is necessarily faster, but much depends on the situation. For example with the current code and with a local sqlite database:

In [4]: engine = create_engine('sqlite:///:memory:') #, echo='debug')

In [5]: df = pd.DataFrame(np.random.randn(50000, 10))

In [6]: %timeit df.to_sql('test_default', engine, if_exists='replace')
1 loops, best of 3: 956 ms per loop

In [7]: %timeit df.to_sql('test_default', engine, if_exists='replace', chunksize=10)
1 loops, best of 3: 2.23 s per loop

But of course this does not use the multi-row feature

jorisvandenbossche on 3 Dec 2014

We've figured out how to monkey patch - might be useful to someone else. Have this code before importing pandas.

from pandas.io.sql import SQLTable

def _execute_insert(self, conn, keys, data_iter):
    print "Using monkey-patched _execute_insert"
    data = [dict((k, v) for k, v in zip(keys, row)) for row in data_iter]
    conn.execute(self.insert_statement().values(data))

SQLTable._execute_insert = _execute_insert

nhockham on 26 Feb 2015

👍26 ❤14 🎉13 👀1

Maybe we can just start with adding this feature through a new multirow=True keyword (with a default of False for now), and then we can later always see if we can enable it by default?

@maxgrenderjones @nhockham interested to do a PR to add this?

jorisvandenbossche on 26 Feb 2015

@jorisvandenbossche I think it's risky to start adding keyword arguments to address specific performance profiles. If you can guarantee that it's faster in all cases (if necessary by having it determine the best method based on the inputs) then you don't need a flag at all.

Different DB-setups may have different performance optimizations (different DB perf profiles, local vs network, big memory vs fast SSD, etc, etc), if you start adding keyword flags for each it becomes a mess.

I would suggest creating subclasses of SQLDatabase and SQLTable to address performance specific implementations, they would be used through the object-oriented API. Perhaps a "backend switching" method could be added but frankly using the OO api is very simple so this is probably overkill for what is already a specialized use-case.

I created such a sub-class for loading large datasets to Postgres (it's actually much faster to save data to CSV then use the built-in non-standard COPY FROM sql commands than to use inserts, see https://gist.github.com/mangecoeur/1fbd63d4758c2ba0c470#file-pandas_postgres-py). To use it you just do PgSQLDatabase(engine, <args>).to_sql(frame, name,<kwargs>)

mangecoeur on 26 Feb 2015

Just for reference, I tried running the code by @jorisvandenbossche (Dec 3rd post) using the multirow feature. It's quite a bit slower. So the speed-tradeoffs here is not trivial:

In [4]: engine = create_engine('sqlite:///:memory:') #, echo='debug')

In [5]: df = pd.DataFrame(np.random.randn(50000, 10))

In [6]: 

In [6]: %timeit df.to_sql('test_default', engine, if_exists='replace')
1 loops, best of 3: 1.05 s per loop

In [7]: 

In [7]: from pandas.io.sql import SQLTable

In [8]: 

In [8]: def _execute_insert(self, conn, keys, data_iter):
   ...:         data = [dict((k, v) for k, v in zip(keys, row)) for row in data_iter]
   ...:         conn.execute(self.insert_statement().values(data))
   ...:     

In [9]: SQLTable._execute_insert = _execute_insert

In [10]: 

In [10]: reload(pd)
Out[10]: <module 'pandas' from '/usr/local/lib/python2.7/site-packages/pandas/__init__.pyc'>

In [11]: 

In [11]: %timeit df.to_sql('test_default', engine, if_exists='replace', chunksize=10)
1 loops, best of 3: 9.9 s per loop

Also, I agree that adding keyword parameters is risky. However, the multirow feature seems pretty fundamental. Also, 'monkey-patching' is probably not more robust to API changes than keyword parameters.

artemyk on 26 Feb 2015

Its as i suspected. Monkey patching isn't the solution I was suggesting - rather that we ship a number of performance oriented subclasses that the informed user could use through the OO interface (to avoid loading the functional api with too many options)

-----Original Message-----
From: "Artemy Kolchinsky" [email protected]
Sent: ‎26/‎02/‎2015 17:13
To: "pydata/pandas" [email protected]
Cc: "mangecoeur" jon.[email protected]
Subject: Re: [pandas] Use multi-row inserts for massive speedups on to_sqlover high latency connections (#8953)

Just for reference, I tried running the code by @jorisvandenbossche (Dec 3rd post) using the multirow feature. It's quite a bit slower. So the speed-tradeoffs here is not trivial:
In [4]: engine = create_engine('sqlite:///:memory:') #, echo='debug')

In [5]: df = pd.DataFrame(np.random.randn(50000, 10))

In [6]:

In [6]: %timeit df.to_sql('test_default', engine, if_exists='replace')
1 loops, best of 3: 1.05 s per loop

In [7]:

In [7]: from pandas.io.sql import SQLTable

In [8]:

In [8]: def _execute_insert(self, conn, keys, data_iter):
...: data = [dict((k, v) for k, v in zip(keys, row)) for row in data_iter]
...: conn.execute(self.insert_statement().values(data))
...:

In [9]: SQLTable._execute_insert = _execute_insert

In [10]:

In [10]: reload(pd)
Out[10]:

In [11]:

In [11]: %timeit df.to_sql('test_default', engine, if_exists='replace', chunksize=10)
1 loops, best of 3: 9.9 s per loop
Also, I agree that adding keyword parameters is risky. However, the multirow feature seems pretty fundamental. Also, 'monkey-patching' is probably not more robust to API changes than keyword parameters.
—
Reply to this email directly or view it on GitHub.

mangecoeur on 26 Feb 2015

As per the initial ticket title, I don't think this approach is going to be preferable in all cases, so I wouldn't make it the default. However, without it, the pandas to_sql unusable for me, so it's important enough for me to continue to request the change. (It's also become the first thing I change when I upgrade my pandas version). As for sensible chunksize values, I don't think there is one true n, as the packet size will depend on how many columns there are (and what's in them) in hard to predict ways. Unfortunately SQLServer fails with an error message that looks totally unrelated (but isn't) if you set the chunksize too high (which is probably why multirow inserts aren't turned on except with a patch in SQLAlchemy), but it works fine with mysql. Users may need to experiment to determine what value of n is likely to result in an acceptably large packet size (for whatever their backing database is). Having pandas chose n is likely to get land us way further down in the implementation details than we want to be (i.e. the opposite direction from the maximum-possible-abstraction SQLALchemy approach)

In short, my recommendation would be to add it as a keyword, with some helpful commentary about how to use it. This wouldn't be the first time a keyword was used to select an implementation (see: http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.apply.html) but that perhaps isn't the best example, as I haven't the first idea about what raw= means, even having read the explanation!

maxgrenderjones on 16 Mar 2015

👍3

I have noticed that it also consumes a huge amount of memory. Like a 1.6+ GB DataFrame with some 700,000 rows and 301 columns requires almost 34 GB during insert! That is like over the top inefficient. Any ideas on why that might be the case? Here is a screen clip:

dragonator4 on 21 Jul 2016

😄3

Hi guys,
any progress on this issue?

I am try to insert around 200K rows using to_sql but it takes forever and consume a huge amount of memory! Using chuncksize helps with the memory but still the speed is very slow.

My impression, looking at the MSSQL DBase trace is that the insertion is actually performed one row at the time.

The only viable approach now is to dump to a csv file on a shared folder and use BULK INSERT. But it very annoying and inelegant!

andreacassioli on 19 Oct 2016

👍2

@andreacassioli You can use odo to insert a DataFrame into an SQL database through an intermediary CSV file. See Loading CSVs into SQL Databases.

I don't think you can come even close to BULK INSERT performance using ODBC.

ostrokach on 20 Oct 2016

👍2 ❤1

@ostrokach thank you, indeed I am using csv files now. If I could get close, I would trade a bit of time for simplicity!

andreacassioli on 20 Oct 2016

I thought this might help somebody:
http://docs.sqlalchemy.org/en/latest/faq/performance.html#i-m-inserting-400-000-rows-with-the-orm-and-it-s-really-slow

indera on 3 Mar 2017

👍1

@indera pandas does not use the ORM, only sqlalchemy Core (which is what the doc entry there suggests to use for large inserts)

jorisvandenbossche on 3 Mar 2017

is there any consensus on how to work around this in the meantime? I'm inserting a several million rows into postgres and it takes forever. Is CSV / odo the way to go?

russlamb on 6 Mar 2017

@russlamb a practical way to solve this problem is simply to bulk upload. This is someone db specific though, so odo has solutions for postgresl (and may be mysql) I think. for something like sqlserver you have to 'do this yourself' (IOW you have to write it).

jreback on 6 Mar 2017

For sqlserver I used the FreeTDS driver (http://www.freetds.org/software.html and https://github.com/mkleehammer/pyodbc ) with SQLAlchemy entities which resulted in very fast inserts (20K rows per data frame):

from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()


class DemographicEntity(Base):
    __tablename__ = 'DEMOGRAPHIC'

    patid = db.Column("PATID", db.Text, primary_key=True)
    """
    patid = db.Column("PATID", db.Text, primary_key=True, autoincrement=False, nullable=True)
    birth_date = db.Column("BIRTH_DATE", db.Date)
    birth_time = db.Column("BIRTH_TIME", db.Text(5))
    sex = db.Column("SEX", db.Text(2))

def get_db_url(db_host, db_port, db_name, db_user, db_pass):
    params = parse.quote(
        "Driver={{FreeTDS}};Server={};Port={};"
        "Database={};UID={};PWD={};"
        .format(db_host, db_port, db_name, db_user, db_pass))
    return 'mssql+pyodbc:///?odbc_connect={}'.format(params)

def get_db_pool():
    """
    Create the database engine connection.
    @see http://docs.sqlalchemy.org/en/latest/core/engines.html

    :return: Dialect object which can either be used directly
            to interact with the database, or can be passed to
            a Session object to work with the ORM.
    """
    global DB_POOL

    if DB_POOL is None:
        url = get_db_url(db_host=DB_HOST, db_port=DB_PORT, db_name=DB_NAME,
                         db_user=DB_USER, db_pass=DB_PASS)
        DB_POOL = db.create_engine(url,
                                   pool_size=10,
                                   max_overflow=5,
                                   pool_recycle=3600)

    try:
        DB_POOL.execute("USE {db}".format(db=DB_NAME))
    except db.exc.OperationalError:
        logger.error('Database {db} does not exist.'.format(db=DB_NAME))

    return DB_POOL


def save_frame():
    db_pool = get_db_pool()
    records = df.to_dict(orient='records')
    result = db_pool.execute(entity.__table__.insert(), records)

indera on 6 Mar 2017

Is CSV / odo the way to go?

This solution will almost always be faster I think, regardless of the multi-row / chunksize settings.

But, @russlamb, it is always interesting to hear whether such a multi-row keyword would be an improvement in your case. See eg https://github.com/pandas-dev/pandas/issues/8953#issuecomment-76139975 on a way to easily test this out.

I think there is agreement that we want to have a way to specify this (without necessarily changing the default). So if somebody wants to make a PR for this, that is certainly welcome.
There was only some discussion on how to add this ability (new keyword vs subclass using OO api).

jorisvandenbossche on 6 Mar 2017

@jorisvandenbossche The document I linked above mentions "Alternatively, the SQLAlchemy ORM offers the Bulk Operations suite of methods, which provide hooks into subsections of the unit of work process in order to emit Core-level INSERT and UPDATE constructs with a small degree of ORM-based automation."

What I am suggesting is to implement a sqlserver specific version for to_sql which under the hood uses the SQLAlchemy ORMs for speedups as in the code I posted above.

indera on 6 Mar 2017

This was proposed before. The way you go is to implement an pandas sql
class optimised for a backend. I posted a gist in the past for using
postgres COPY FROM command which is much faster. However something similar
is now available in odo, and built in a more robust way. There isn't much
point IMHO in duplicating work from odo.

On 7 Mar 2017 00:53, "Andrei Sura" notifications@github.com wrote:

@jorisvandenbossche https://github.com/jorisvandenbossche The document
I linked above mentions "Alternatively, the SQLAlchemy ORM offers the Bulk
Operations suite of methods, which provide hooks into subsections of the
unit of work process in order to emit Core-level INSERT and UPDATE
constructs with a small degree of ORM-based automation."

What I am suggesting is to implement a sqlserver specific version for
"to_sql" which under the hood uses the SQLAlchemy core for speedups.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/8953#issuecomment-284437587,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAtYVDXKLuTlsh9ycpMQvU5C0hs_RxuYks5rjCwBgaJpZM4DCjLh
.

mangecoeur on 7 Mar 2017

Also noticed you mentioned sqlalchemy could core instead. Unless something
has changed a lot, only sqlalchemy core is used in any case, no orm. If you
want to speed up more than using core you have to go to lower level, db
specific optimisation

On 7 Mar 2017 00:53, "Andrei Sura" notifications@github.com wrote:

@jorisvandenbossche https://github.com/jorisvandenbossche The document
I linked above mentions "Alternatively, the SQLAlchemy ORM offers the Bulk
Operations suite of methods, which provide hooks into subsections of the
unit of work process in order to emit Core-level INSERT and UPDATE
constructs with a small degree of ORM-based automation."

What I am suggesting is to implement a sqlserver specific version for
"to_sql" which under the hood uses the SQLAlchemy core for speedups.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/8953#issuecomment-284437587,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAtYVDXKLuTlsh9ycpMQvU5C0hs_RxuYks5rjCwBgaJpZM4DCjLh
.

mangecoeur on 7 Mar 2017

Is this getting fixed/taken care of? As of now inserting pandas dataframes into a SQL db is extremely slow unless it's a toy dataframe. Let's decide on a solution and push it forward?

dfernan on 1 Jun 2017

👍14 ❤3

@dfernan As mentioned above, you may want to look at the odo. Using an intermediary CSV file will always be orders of magnitude faster that going through sqlalchemy, no matter what kind of improvements happen here...

ostrokach on 5 Jun 2017

@ostrokach, I'm unconvinced that odo's behavior is what the typical Pandas user wants. Multi-row inserts over ODBC are probably fast enough for most analysts.

To speak for myself, I just spent a few hours switching from the monkey patch above to odo. Plain pandas runtime was 10+ hours, RBAR. The monkey patch runs in 2 on the same data set.
The odo/CSV route was faster, as expected, but not by enough to make it worth the effort. I fiddled with CSV conversion issues I didn't care much about, all in the name of avoiding the monkey patch. I'm importing 250K rows from ~10 mysql and PG DBs into a common area in Postgres, for NLP analysis.

I'm intimately familiar with the bulk loading approaches odo espouses. I've used them for years where I'm starting with CSV data. There are key limitations to them:

For the df->CSV->Postgres case, shell access and a scp step is needed to get the CSV on the PG host. It looks like @mangecoeur has gotten around this with a stream to STDIN.
For my purpose (250K rows of comments, with lots of special cases in the text content) I struggled to get CSV parameters right. I didn't want performance gains badly enough to keep investing in this.

I switch back to the the patch, so I could get on to analysis work.

I agree with @jorisvandenbossche, @maxgrenderjones. An option (not a default) to choose this would be immensely useful. @artemyk's point about dialect.supports_multivalues_insert might even make this a reasonable default.

I'm glad to submit a PR if that'd move this forward.

markschwarz on 20 Sep 2017

👍3

just to add my experience with odo, it has not worked for MS Sql bulk inserts because of known issue with encoding. imho m-row insert are good practical solution for most ppl.

s-trooper on 21 Sep 2017

@markschwarz an option to enable this to work faster would be very welcome!

jreback on 22 Sep 2017

Tracing the queries using sqlite, I do seem to be getting in multi-inserts when using chunksize:

2017-09-28 00:21:39,007 INFO sqlalchemy.engine.base.Engine INSERT INTO country_hsproduct_year (location_id, product_id, year, export_rca, import_value, cog, export_value, distance, location_level, product_level) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
2017-09-28 00:21:39,007 INFO sqlalchemy.engine.base.Engine ((75, 1237, 1996, 1.7283086776733398, 273487116.0, 0.0, 514320160.0, 0.5413745641708374, 'country', '4digit'), (75, 1237, 1997, 1.7167805433273315, 312047528.0, 0.0, 592372864.0, 0.5314807891845703, 'country', '4digit'), (75, 1237, 1998, 1.2120152711868286, 341676961.0, 0.0, 468860608.0, 0.5472233295440674, 'country', '4digit'), (75, 1237, 1999, 1.236651062965393, 334604240.0, 0.0, 440722336.0, 0.5695921182632446, 'country', '4digit'), (75, 1237, 2000, 1.189828872680664, 383555023.0, 0.0, 426384832.0, 0.5794379711151123, 'country', '4digit'), (75, 1237, 2001, 0.9920380115509033, 374157144.0, 0.3462945520877838, 327031392.0, 0.6234743595123291, 'country', '4digit'), (75, 1237, 2002, 1.0405025482177734, 471456583.0, 0.0, 377909376.0, 0.6023964285850525, 'country', '4digit'), (75, 1237, 2003, 1.147829532623291, 552441401.0, 0.0, 481313504.0, 0.5896202325820923, 'country', '4digit')  ... displaying 10 of 100000 total bound parameter sets ...  (79, 1024, 2015, 0.0, None, 0.8785018920898438, 0.0, 0.9823430776596069, 'country', '4digit'), (79, 1025, 1995, 0.0, None, 0.5624096989631653, 0.0, 0.9839603304862976, 'country', '4digit'))

makmanalp on 28 Sep 2017

(without the monkey patch, that is)

makmanalp on 28 Sep 2017

Interestingly, with the monkey patch, it breaks when i give it chunksize of 10^5, but not 10^3. The error is "too many sql variables" on sqlite.

makmanalp on 28 Sep 2017

@makmanalp, I haven't traced on PG to check for that behavior, yet, but I nearly always set chunksize on insert. In my example above, I randomly set it to 5 values between 200-5000. I didn't see drastic elapsed time differences among those choices, without the monkey patch. With the patch, elapsed time dropped ~80%.

markschwarz on 28 Sep 2017

👍2

https://github.com/pandas-dev/pandas/issues/8953#issuecomment-76139975

Is this monkey patch still working? I tried it on MS SQL Server but didn't see an improvement. Also, it throw an exception:

(pyodbc.Error) ('07002', '[07002] [Microsoft][SQL Server Native Client 11.0]COUNT field incorrect or syntax error (0) (SQLExecDirectW)')

hangyao on 7 Nov 2017

@hangyao I think that patch is implementation-specific, one of those things that python DBAPI leaves to the DBAPI driver on how to handle. So it could be faster or not. RE: the syntax error, unsure about that.

makmanalp on 8 Nov 2017

I am thinking about adding the function below in the /io/sql.py file on line 507 within the _engine_builder function within a new IF clouse on line 521 making the 'new' _engine_builder the below snippet. I've tested it briefly on my environment and it works great for MSSQL databases, achieving >100x speed-ups. I haven't tested it on other databases yet.

The thing from witholding me to make a PR is that I think it's more effort to make it neat and safe than just inserting it like below, this might not always be the desired spec and adding a boolean switch, which turns this setting on/off , (e.g. fast_executemany=True) in the to_sql seemed like a bit too large endeavor to just do without asking, I reckon.

So my questions are:

Does the below function work and also increase the INSERT speed for PostgreSQL?
Does pandas event want this snippet in their source? If so:
Is it desired to add this function to the default sql.py functionality or is there a better place to add this?

Love to hear some comments.

Source for the answer: https://stackoverflow.com/questions/48006551/speeding-up-pandas-dataframe-to-sql-with-fast-executemany-of-pyodbc/48861231#48861231

@event.listens_for(engine, 'before_cursor_execute')
def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
    if executemany:
        cursor.fast_executemany = True

def _engine_builder(con):
    """
    Returns a SQLAlchemy engine from a URI (if con is a string)
    else it just return con without modifying it.
    """
    global _SQLALCHEMY_INSTALLED
    if isinstance(con, string_types):
        try:
            import sqlalchemy
        except ImportError:
            _SQLALCHEMY_INSTALLED = False
        else:
            con = sqlalchemy.create_engine(con)

    @event.listens_for(engine, 'before_cursor_execute')
    def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
        if executemany:
            cursor.fast_executemany = True
return con

tsktsktsk123 on 20 Mar 2018

@tsktsktsk123 There has recently been a PR merged related to this: https://github.com/pandas-dev/pandas/pull/19664. I didn't yet look in detail to your post, and it's certainly not exactly the same (it uses supports_multivalues_insert attribute of the sqlalchemy engine), but just to make sure you are aware of it in case this already helps as well.

jorisvandenbossche on 20 Mar 2018

🎉1

That's great news! I haven't looked into the PR but will compare it this weekend and get back with the results. Thanks for the heads up.

tsktsktsk123 on 21 Mar 2018

I was just trying 0.23.0 RC2 (on postgresql) and instead of having a performance boost my script got significant slower. The DB query got much faster but measuring the time for to_sql() it actually became up to 1.5 times slower (like from 7 to 11 seconds)...

Not sure the slow down comes from this PR though as I just tested the RC.

Anyone else has experienced same problem?

schettino72 on 14 May 2018

@schettino72 How much data were you inserting?

grisaitis on 15 May 2018

Around 30K rows with 10 columns. But really almost anything I try is slower (SQL is faster but overall slower). It is creating a huge SQL statement where there is value interpolation for EVERY value. Something like

 %(user_id_m32639)s, %(event_id_m32639)s, %(colx_m32639)s,

schettino72 on 15 May 2018

I found d6tstack much simpler to use, it's a one-liner d6tstack.utils.pd_to_psql(df, cfg_uri_psql, 'benchmark', if_exists='replace') and it's much faster than df.to_sql(). Supports postgres and mysql. See https://github.com/d6t/d6tstack/blob/master/examples-sql.ipynb

citynorman on 15 Oct 2018

👍1

I've been using the Monkey Patch Solution:

from pandas.io.sql import SQLTable

def _execute_insert(self, conn, keys, data_iter):
    print "Using monkey-patched _execute_insert"
    data = [dict((k, v) for k, v in zip(keys, row)) for row in data_iter]
    conn.execute(self.insert_statement().values(data))

SQLTable._execute_insert = _execute_insert

for some time now, but now I'm getting an error:

TypeError: insert_statement() missing 2 required positional arguments: 'data' and 'conn'

Is anyone else getting this? I'm on Python 3.6.5 (Anaconda) and pandas==0.23.0

VincentLa14 on 25 Nov 2018

is this getting fixed ? Currently, df.to_sql is extremely slow and can't be used at all for many practical use cases. Odo project seems to have been abandoned already.
I have following use cases in financial time series where df.to_sql is pretty much not usable:
1) copying historical csv data to postgres database - can't use df.to_sql and had to go with custom code around psycopg2 copy_from functionality
2) streaming data (coming in a batch of ~500-3000 rows per second) to be dumped to postgres database - again df.to_sql performance is pretty disappointing as it is taking too much time to insert these natural batches of data to postgres.
The only place where I find df.to_sql useful now is to create tables automatically !!! - which is not the use case it was designed for.
I am not sure if other people also share the same concern but this issue needs some attention for "dataframes-to-database" interfaces to work smoothly.
Look forward.

hnazkani on 25 Dec 2018

Hey, I'm getting this error when I try to perform a multi-insert to a SQLite database:

This is my code:
df.to_sql("financial_data", con=conn, if_exists="append", index=False, method="multi")

and I get this error:

Traceback (most recent call last):

  File "<ipython-input-11-cf095145b980>", line 1, in <module>
    handler.insert_financial_data_from_df(data, "GOOG")

  File "C:\Users\user01\Documents\Code\FinancialHandler.py", line 110, in insert_financial_data_from_df
    df.to_sql("financial_data", con=conn, if_exists="append", index=False, method="multi")

  File "C:\Users\user01\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py", line 2531, in to_sql
    dtype=dtype, method=method)

  File "C:\Users\user01\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\sql.py", line 460, in to_sql
    chunksize=chunksize, dtype=dtype, method=method)

  File "C:\Users\user01\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\sql.py", line 1547, in to_sql
    table.insert(chunksize, method)

  File "C:\Users\user01\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\sql.py", line 686, in insert
    exec_insert(conn, keys, chunk_iter)

  File "C:\Users\user01\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\sql.py", line 609, in _execute_insert_multi
    conn.execute(self.table.insert(data))

TypeError: insert() takes exactly 2 arguments (1 given)

Why is this happening? I'm using Python 3.7.3 (Anaconda), pandas 0.24.2 and sqlite3 2.6.0.

Thank you very much in advance!

jconstanzo on 27 Sep 2019

@jconstanzo can you open this as a new issue?
And if possible, can you try to provide a reproducible example? (eg a small example dataframe that can show the problem)

jorisvandenbossche on 27 Sep 2019

@jconstanzo Having the same issue here. Using method='multi' (in my case, in combination with chunksize) seems to trigger this error when you try to insert into a SQLite database.

Unfortunately I can't really provide an example dataframe because my dataset is huge, that's the reason I'm using method and chunksize in the first place.