I need to programmatically insert 10's of millions of records into a postgres database. Presently I am executing 1000's of insert statements in a single "query".
Is there a better way to do this, some bulk insert statement I dont know about?
I need to programmatically insert 10's of millions of records into a postgres database. Presently I am executing 1000's of insert statements in a single "query".
Is there a better way to do this, some bulk insert statement I dont know about?
It mostly depends on the (other) activity in the database. Operations like this effectively freeze the entire database for other sessions. Another consideration is the datamodel and the presence of constraints,triggers, etc.
My first approach is always: create a (temp) table with a structure similar to the target table (create table tmp AS select * from target where 1=0), and start by reading the file into the temp table. Then I check what can be checked: duplicates, keys that already exist in the target, etc.
Then I just do a "do insert into target select * from tmp" or similar.
If this fails, or takes too long, I abort it and consider other methods (temporarily dropping indexes/constraints, etc)
I just encountered this issue and would recommend csvsql for bulk imports to Postgres. To perform a bulk insert you'd simply
createdb
and then usecsvsql
, which connects to your database and creates individual tables for an entire folder of CSVs.There is an alternative to using COPY, which is the multirow values syntax that Postgres supports. From the documentation:
The above code inserts two rows, but you can extend it arbitrarily, until you hit the maximum number of prepared statement tokens (it might be $999, but I'm not 100% sure about that). Sometimes one cannot use COPY, and this is a worthy replacement for those situations.
You can use
COPY table TO ... WITH BINARY
which is "somewhat faster than the text and CSV formats." Only do this if you have millions of rows to insert, and if you are comfortable with binary data.Here is an example recipe in Python, using psycopg2 with binary input.
UNNEST
function with arrays can be used along with multirow VALUES syntax. I'm think that this method is slower than usingCOPY
but it is useful to me in work with psycopg and python (pythonlist
passed tocursor.execute
becomes pgARRAY
):without
VALUES
using subselect with additional existance check:the same syntax to bulk updates:
One way to speed things up is to explicitly perform multiple inserts or copy's within a transaction (say 1000). Postgres's default behavior is to commit after each statement, so by batching the commits, you can avoid some overhead. As the guide in Daniel's answer says, you may have to disable autocommit for this to work. Also note the comment at the bottom that suggests increasing the size of the wal_buffers to 16 MB may also help.