Is bulk update faster than single update in db2?

I have a Table with 10 columns and in that table I have thousands/millions of rows. In some scenario, I want to update more than 10K records at a time. currently my scenario code works sequentially like,

for i in (primary key ids for all records to be updated)
     executeupdate(i)

what I thought is instead of running same query 10K times, I will add all ids in a string and run a single update query like,

executeupdate(all ids)

actual DB queries can be like this,

suppose I have primary key ids like,

10001,10002,10003,10004,10005

so in first case My queries will be like

update tab1 set status="xyz" where Id="10001"
update tab1 set status="xyz" where Id="10002"
update tab1 set status="xyz" where Id="10003"
update tab1 set status="xyz" where Id="10004"
update tab1 set status="xyz" where Id="10005"

and My bulk update query will be like,

update tab1 set status="xyz" where id in ("10001","10002","10003","10004","10005")

so My question is, will I get any Performance improvement (executime time) by doing bulk update or total query execution time will be same as for each record index scan will happen and update will take place?

Note : I am using DB2 9.5 as database

Thanks.

标签： sql database db2

5条回答

迷人小祖宗

2楼-- · 2019-03-06 04:09

You will definitely see a performance improvement, because you will reduce the number of roundtrips.

However, this approach does not scale very well; thousands of ID's in one statement could get a bit tricky. Also, there is a limit on the size of your query (could be 64k). You could consider to 'page' through your table and update - say - 100 records per update statement.

0人赞添加讨论(0) 举报

唯我独甜

3楼-- · 2019-03-06 04:10

In general, a "bulk" update will be faster, regardless of database. Of course, you can test the performance of the two, and report back.

Each call to update requires a bunch of overhead, in terms of processing the query, setting up locks on tables/pages/rows. Doing a single update consolidates this overhead.

The downside to a single update is that it might be faster overall, but it might lock underlying resources for longer periods of time. For instance, the single updates might take 10 milliseconds each, for an elapsed time of 10 seconds for 1,000 of them. However, no resource is locked for more than 10 milliseconds. The bulk update might take 5 seconds, but the resources would be locked for more of this period.

To speed these updates, be sure that id is indexed.

I should note. This is a general principle. I have not specifically tested single versus multiple update performance on DB2.

0人赞添加讨论(0) 举报

该账号已被封号

4楼-- · 2019-03-06 04:10

If you are using .NET (and there's probably a similar option in other languages like Java), there is a option you can use on your DB2Connection class called BeginChain, which will greatly improve performance.

Basically, when you have the chain option activated, your DB2 client will keep all of the commands in a queue. When you call EndChain, the queue will be sent to the server at once, and processed at one time.

The documentation says that this should perform much better than non-chained UPDATE/INSERT/DELETEs (and this is what we've seen in my shop), but there are some differences you might need to be aware of:

No exceptions will be thrown on individual statements. They will all be batched up in one DB2Exception, which will contain multiple errors in the DB2Error property.
ExecuteNonQuery will return -1 when chaining is active.

Additionally, performance can be improved further by using a query with Parameter Markers instead of separate individual queries (assuming status can change as well, otherwise, you might just use a literal):

UPDATE tab1 
SET status = @status
WHERE id   = @id

Edit for comment: I'm not sure if the confusion is in using Parameter Markers (which are just placeholders for values in a query, see the link for more details), or in the actual usage of chaining. If it is the second, then here is some example code (I didn't verify that it works, so use at your own risk :)):

//Below is a function that returns an open DB2Connection
//object. It can vary by shop, so put it whatever you do.
using (var conn = (DB2Connection) GetConnection())
{
    using (var trans = conn.BeginTransaction())
    {
        var sb = new StringBuilder();
        sb.AppendLine("UPDATE tab1 ");
        sb.AppendLine("   SET status = 'HISTORY' ");
        sb.AppendLine(" WHERE id = @id");

        trans.Connection.BeginChain();

        using (var cmd = trans.Connection.CreateCommand())
        {
            cmd.CommandText = sb.ToString();
            cmd.Transaction = trans;

            foreach (var id in ids)
            {
                cmd.Parameters.Clear();
                cmd.Parameters.Add("@id", id);
                cmd.ExecuteNonQuery();
            }    
        }

        trans.Connection.EndChain();         
        trans.Commit();
    }
}

0人赞添加讨论(0) 举报

倾城　Initia

5楼-- · 2019-03-06 04:17

One other aspect I would like to point out is the commit interval. If the single update statement updates few 100 K rows, the transaction log also grows acordingly, it might become slower. I have seen reduction in total time while using ETL tools like informatica which fired sets of update statements per record followed by a commit- compared to a single update statement based on conditions to do it in a single go. This was counter-intuitive for me.

0人赞添加讨论(0) 举报

放我归山

6楼-- · 2019-03-06 04:26

I came here with same question a week back. Then I faced a situation where I had to update a table with around 3500 rows in mySQL database through JDBC. I updated same table twice: once through a For loop, by iterating through a collection of objects, and once using a bulk update query. Here are my findings:

When I updated the data in the database through iteration, it took around 7.945 seconds to execute completely.
When I came up with a rather gigantic (where 'gigantic' means 183 pages long) update query and executed the same, it took around 2.24 seconds to complete the update process.

clearly, bulk update wins by a huge margin.

Why this Difference?

To answer this, let's see how a query actually gets executed in DBMS.

Unlike procedural languages, you instruct the DBMS what to do, but not how to do. The DBMS then does the followings.

Syntax checking, or more commonly called 'Parsing'. And parsing comprises of steps like Lexical Analysis, Syntactic Analysis, Semantic Parsing.
A series of optimization (Although the definition of 'optimization' itself may vary from product to product. At least that's what I learned while surfing through the internet. I don't have much knowledge about it though.).
execution.

Now, when you update a table in database row by row, each of the queries you execute goes through parsing, optimization and execution. In stead if you write a loop to create a rather long query, and then execute the same, it is parsed only once. And the amount of time you save by using batch update in place of iterative approach increases almost linearly with number of rows you update.

A few tips that might come handy while updating data in your database

It is always a good practice to use indexed columns as reference while writing Any query.
Try to use integers or numbers and not strings for sorting or searching data in database. Your server is way more comfortable in comparing two numbers than comparing two strings.
Avoid using views and 'in' clause. they make your task easier, but slows down your database. Use joins in stead.

0人赞添加讨论(0) 举报