I see plenty of examples of importing a CSV into a PostgreSQL db, but what I need is an efficient way to import 500,000 CSV's into a single PostgreSQL db. Each CSV is a bit over 500KB (so grand total of approx 272GB of data).
The CSV's are identically formatted and there are no duplicate records (the data was generated programatically from a raw data source). I have been searching and will continue to search online for options, but I would appreciate any direction on getting this done in the most efficient manner possible. I do have some experience with Python, but will dig into any other solution that seems appropriate.
Thanks!
If you start by reading the PostgreSQL guide "Populating a Database" you'll see several pieces of advice:
- Load the data in a single transaction.
- Use
COPY
if at all possible.
- Remove indexes, foreign key constraints etc before loading the data and restore them afterwards.
PostgreSQL's COPY
statement already supports the CSV format:
COPY table (column1, column2, ...) FROM '/path/to/data.csv' WITH (FORMAT CSV)
so it looks as if you are best off not using Python at all, or using Python only to generate the required sequence of COPY
statements.
Nice chunk of data you have there. I'm not 100% sure about Postgre, but at least MySQL provides some SQL commands, to feed a csv directly into a table. This bypasses any insert checks and so on and is thatswhy more than a order of magnitude faster than any ordinary insert operations.
So the probably fastest way to go is create some simple python script, telling your postgre server, which csv files in which order to hungrily devour into it's endless tables.
I use php and postgres, and read the csv file with php and ride a string in the following format:
{ {line1 column1, line1 column2, line1 column3} , { line2 column1,line2 column2,line2 column3} }
Care in a single transaction by passing the string parameter to postgresql function.
I can check all records, formatting, amount of data, etc., and get a result of importing 500,000 records in about 3 minutes.
To read the data in postgresql function:
DECLARE
d varchar[];
BEGIN
FOREACH d SLICE 1 IN ARRAY p_dados
LOOP
INSERT INTO schema.table (
column1,
column2,
column3,
)
VALUES (
d[1],
d[2]::INTEGER, -- explicit conversion to INTEGER
d[3]::BIGINT, -- explicit conversion to BIGINT
);
END LOOP;
END;