-->

python postgres can I fetchall() 1 million rows?

2019-01-23 15:14发布

问题:

I am using psycopg2 module in python to read from postgres database, I need to some operation on all rows in a column, that has more than 1 million rows.

I would like to know would cur.fetchall() fail or cause my server to go down? (since my RAM might not be that big to hold all that data)

q="SELECT names from myTable;"
cur.execute(q)
rows=cur.fetchall()
for row in rows:
    doSomething(row)

what is the smarter way to do this?

回答1:

fetchall() fetches up to the arraysize limit, so to prevent a massive hit on your database you can either fetch rows in manageable batches, or simply step through the cursor till its exhausted:

row = cur.fetchone()
while row:
   # do something with row
   row = cur.fetchone()


回答2:

Consider using server side cursor:

When a database query is executed, the Psycopg cursor usually fetches all the records returned by the backend, transferring them to the client process. If the query returned an huge amount of data, a proportionally large amount of memory will be allocated by the client.

If the dataset is too large to be practically handled on the client side, it is possible to create a server side cursor. Using this kind of cursor it is possible to transfer to the client only a controlled amount of data, so that a large dataset can be examined without keeping it entirely in memory.

Here's an example:

cursor.execute("DECLARE super_cursor BINARY CURSOR FOR SELECT names FROM myTable")
while True:
    cursor.execute("FETCH 1000 FROM super_cursor")
    rows = cursor.fetchall()

    if not rows:
        break

    for row in rows:
        doSomething(row)


回答3:

The solution Burhan pointed out reduces the memory usage for large datasets by only fetching single rows:

row = cursor.fetchone()

However, I noticed a significant slowdown in fetching rows one-by-one. I access an external database over an internet connection, that might be a reason for it.

Having a server side cursor and fetching bunches of rows proved to be the most performant solution. You can change the sql statements (as in alecxe answers) but there is also pure python approach using the feature provided by psycopg2:

cursor = conn.cursor('name_of_the_new_server_side_cursor')
cursor.execute(""" SELECT * FROM table LIMIT 1000000 """)

while True:
    rows = cursor.fetchmany(5000)
    if not rows:
        break

    for row in rows:
        # do something with row
        pass

you find more about server side cursors in the psycopg2 wiki



回答4:

Here is the code to use for simple server side cursor with the speed of fetchmany management.

The principle is to use named cursor in Psycopg2 and give it a good itersize to load many rows at once like fetchmany would do but with a single loop of for rec in cursor that does an implicit fetchnone().

With this code I make queries of 150 millions rows from multi-billion rows table within 1 hour and 200 meg ram.