How to efficiently UPDATE a column in a large Post

2019-02-26 04:25发布

问题:

I have a large table with ca. 10 million rows in PostgreSQL 9.4 database. It looks somewhat like this:

gid | number1 | random | result | ...
 1  |    2    |  NULL  |  NULL  | ...
 2  |   15    |  NULL  |  NULL  | ...
... |   ...   |   ...  |  ...   | ...

Now I would like to update the columns random and result as a function of number1. That means that at least random needs to be produced in a script outside of the database. Since I have limited RAM, I wonder how I could do that efficiently using psycopg2. I believe I face two problems: How to fetch the data without using too much RAM and how to get it back in there. The simpleton approach would look like this:

curs.execute("""SELECT gid1, number1 FROM my_table;""")
data = curs.fetchall()

result = []
for i in data:
    result.append((create_random(i[1]), i[0]))
curs.executemany("""UPDATE my_table
                    SET random = %s
                    WHERE gid = %s;""",
                 results)
curs.execute("""UPDATE my_table
                SET result = number1 * random;""")

However, this will certainly quickly deplete all my memory and take forever to UPDATE my_table.

What would be a smarter strategy? The database is being accessed exclusively and can be locked. The PostgreSQL random function is unfortunately not suitable for my case.

回答1:

unnest the data to make it all at once:

def create_random(i):
    return random() * i

curs.execute("select gid, number from t;")
data = curs.fetchall()

results = []
for i in data:
    results.append((create_random(i[1]), i[0]))

curs.execute("""
    update t
    set
        rnd = s.rnd,
        result = number * s.rnd
    from unnest(%s) s(rnd numeric, gid integer)
    where t.gid = s.gid;
""", (results,))

con.commit()

Table t:

create table t (
    gid integer,
    number integer,
    rnd float,
    result float
);