I have a Table with 10 columns and in that table I have thousands/millions of rows. In some scenario, I want to update more than 10K records at a time. currently my scenario code works sequentially like,
for i in (primary key ids for all records to be updated)
executeupdate(i)
what I thought is instead of running same query 10K times, I will add all ids in a string and run a single update query like,
executeupdate(all ids)
actual DB queries can be like this,
suppose I have primary key ids like,
10001,10002,10003,10004,10005
so in first case My queries will be like
update tab1 set status="xyz" where Id="10001"
update tab1 set status="xyz" where Id="10002"
update tab1 set status="xyz" where Id="10003"
update tab1 set status="xyz" where Id="10004"
update tab1 set status="xyz" where Id="10005"
and My bulk update query will be like,
update tab1 set status="xyz" where id in ("10001","10002","10003","10004","10005")
so My question is, will I get any Performance improvement (executime time) by doing bulk update or total query execution time will be same as for each record index scan will happen and update will take place?
Note : I am using DB2 9.5 as database
Thanks.
You will definitely see a performance improvement, because you will reduce the number of roundtrips.
However, this approach does not scale very well; thousands of ID's in one statement could get a bit tricky. Also, there is a limit on the size of your query (could be 64k). You could consider to 'page' through your table and update - say - 100 records per update statement.
In general, a "bulk" update will be faster, regardless of database. Of course, you can test the performance of the two, and report back.
Each call to
update
requires a bunch of overhead, in terms of processing the query, setting up locks on tables/pages/rows. Doing a single update consolidates this overhead.The downside to a single
update
is that it might be faster overall, but it might lock underlying resources for longer periods of time. For instance, the single updates might take 10 milliseconds each, for an elapsed time of 10 seconds for 1,000 of them. However, no resource is locked for more than 10 milliseconds. The bulk update might take 5 seconds, but the resources would be locked for more of this period.To speed these updates, be sure that
id
is indexed.I should note. This is a general principle. I have not specifically tested single versus multiple update performance on DB2.
If you are using .NET (and there's probably a similar option in other languages like Java), there is a option you can use on your DB2Connection class called
BeginChain
, which will greatly improve performance.Basically, when you have the chain option activated, your DB2 client will keep all of the commands in a queue. When you call
EndChain
, the queue will be sent to the server at once, and processed at one time.The documentation says that this should perform much better than non-chained
UPDATE
/INSERT
/DELETE
s (and this is what we've seen in my shop), but there are some differences you might need to be aware of:ExecuteNonQuery
will return-1
when chaining is active.Additionally, performance can be improved further by using a query with Parameter Markers instead of separate individual queries (assuming status can change as well, otherwise, you might just use a literal):
Edit for comment: I'm not sure if the confusion is in using Parameter Markers (which are just placeholders for values in a query, see the link for more details), or in the actual usage of chaining. If it is the second, then here is some example code (I didn't verify that it works, so use at your own risk
:)
):One other aspect I would like to point out is the commit interval. If the single update statement updates few 100 K rows, the transaction log also grows acordingly, it might become slower. I have seen reduction in total time while using ETL tools like informatica which fired sets of update statements per record followed by a commit- compared to a single update statement based on conditions to do it in a single go. This was counter-intuitive for me.
I came here with same question a week back. Then I faced a situation where I had to update a table with around 3500 rows in mySQL database through JDBC. I updated same table twice: once through a For loop, by iterating through a collection of objects, and once using a bulk update query. Here are my findings:
clearly, bulk update wins by a huge margin.
Why this Difference?
To answer this, let's see how a query actually gets executed in DBMS.
Unlike procedural languages, you instruct the DBMS what to do, but not how to do. The DBMS then does the followings.
Now, when you update a table in database row by row, each of the queries you execute goes through parsing, optimization and execution. In stead if you write a loop to create a rather long query, and then execute the same, it is parsed only once. And the amount of time you save by using batch update in place of iterative approach increases almost linearly with number of rows you update.
A few tips that might come handy while updating data in your database