I have tens of millions of rows to transfer from multidimensional array files into a PostgreSQL database. My tools are Python and psycopg2. The most efficient way to bulk instert data is using copy_from
. However, my data are mostly 32-bit floating point numbers (real or float4), so I'd rather not convert from real → text → real. Here is an example database DDL:
CREATE TABLE num_data
(
id serial PRIMARY KEY NOT NULL,
node integer NOT NULL,
ts smallint NOT NULL,
val1 real,
val2 double precision
);
Here is where I'm at with Python using strings (text):
# Just one row of data
num_row = [23253, 342, -15.336734, 2494627.949375]
import psycopg2
# Python3:
from io import StringIO
# Python2, use: from cStringIO import StringIO
conn = psycopg2.connect("dbname=mydb user=postgres")
curs = conn.cursor()
# Convert floating point numbers to text, write to COPY input
cpy = StringIO()
cpy.write('\t'.join([repr(x) for x in num_row]) + '\n')
# Insert data; database converts text back to floating point numbers
cpy.seek(0)
curs.copy_from(cpy, 'num_data', columns=('node', 'ts', 'val1', 'val2'))
conn.commit()
Is there an equivalent that could work using a binary mode? I.e., keep the floating point numbers in binary? Not only would this preserve the floating point precision, but it could be faster.
(Note: to see the same precision as the example, use SET extra_float_digits='2'
)
Here is my version. Based on Mike's version.
Its very ad-hoc but there are two pros:
readline
hstore
binary format.Here is the binary equivalent of COPY FROM for Python 3:
Update
I rewrote the above approach to writing the files for COPY. My data in Python is in NumPy arrays, so it makes sense to use these. Here is some example
data
with with 1M rows, 7 columns:On my database, I have two tables that look like:
and another similar table named
num_data_text
.Here are some simple helper functions to prepare the data for COPY (both text and binary formats) by using the information in the NumPy record array:
This how I'm using the helper functions to benchmark the two COPY format methods:
Here is the output from the last two
time_pgcopy
commands:So both the NumPy → file and file → database steps are way faster with the binary approach. The obvious difference is how Python prepares the COPY file, which is really slow for text. Generally speaking, the binary format loads into the database in 2/3 of the time as the text format for this schema.
Lastly, I compared the values in both tables within the database to see if the numbers were different. About 1.46% of the rows have different values for column
s0
, and this fraction increases to 6.17% fors6
(probably related on the random method that I used). The non-zero absolute differences between all 70M 32-bit float values range between 9.3132257e-010 and 7.6293945e-006. These small differences between the text and binary loading methods are due to the loss of precision from the float → text → float conversions required for the text format method.