So I have a python script that goes through roughly 350,000 data objects, and depending on some tests, it needs to update a row which represents each one of those objects in a MySQl db. I'm also using pymysql as I've had the least trouble with it especially when sending over large select queries (select statements with where column IN (....)
clause that can contain 100,000+ values).
Since each update for each row can be different, each update statement is different. For example, for one row we might want to update first_name
but for another row we want to leave first_name
untouched and we want to update last_name
.
This is why I don't want to use the cursor.executemany()
method which takes in one generic update statement and you then feed it the values however as I mentioned, each update is different so having one generic update statement doesn't really work for my case. I also don't want to send over 350,000 update statements individually over the wire. Is there anyway I can package all of my update statements together and send them at once?
I tried having them all in one query and using the cursor.execute()
method but it doesn't seem to update all the rows.
SQL #1: CREATE TABLE t
with whatever columns you might need to change. Make all of them NULL
(as opposed to NOT NULL
).
SQL #2: Do a bulk INSERT
(or LOAD DATA
) of all the changes needed. Eg, if changing only first_name
, fill in id
and first_name
, but have the other columns NULL
.
SQL #3-14:
UPDATE real_table
JOIN t ON t.id = real_table.id
SET real_table.first_name = t.first_name
WHERE t.first_name IS NOT NULL;
# ditto for each other column.
All SQLs except #1 will be time-consuming. And, since UPDATE
needs to build a undo log, it could timeout or otherwise be problematical. See a discussion of chunking if necessary.
If necessary, use functions such as COALESCE()
, GREATEST()
, IFNULL()
, etc.
Mass UPDATEs
usually imply poor schema design.
(If Ryan jumps in with an 'Answer' instead of just a 'Comment', he should probably get the 'bounty'.)
Your best performance will be if you can encode your "tests" into the SQL logic itself, so you can boil everything down to a handful of UPDATE statements. Or at least get as many as possible done that way, so that fewer rows need to be updated individually.
For example:
UPDATE tablename set firstname = [some logic]
WHERE [logic that identifies which rows need the firstname updated];
You don't describe much about your tests, so it's hard to be sure. But you can typically get quite a lot of logic into your WHERE clause with a little bit of work.
Another option would be to put your logic into a stored procedure. You'll still be doing 350,000 updates, but at least they aren't all "going over the wire". I would use this only as a last resort, though; business logic should be kept in the application layer whenever possible, and stored procedures make your application less portable.