Bulk insert in Java using prepared statements batc

2019-01-07 09:24发布

问题:

I am trying to fill a resultSet in Java with about 50,000 rows of 10 columns and then inserting them into another table using the batchExecute method of PreparedStatement.

To make the process faster I did some research and found that while reading data into resultSet the fetchSize plays an important role.

Having a very low fetchSize can result into too many trips to the server and a very high fetchSize can block the network resources, so I experimented a little bit and set up an optimum size that suits my infrastructure.

I am reading this resultSet and creating insert statements to insert into another table of a different database.

Something like this (just a sample, not real code):

for (i=0 ; i<=50000 ; i++) {
    statement.setString(1, "a@a.com");
    statement.setLong(2, 1);
    statement.addBatch();
}
statement.executeBatch();
  • Will the executeBatch method try to send all the data at once ?
  • Is there a way to define the batch size?
  • Is there any better way to speed up the process of bulk insertion?

While updating in bulk (50,000 rows 10 cols), is it better to use a updatable ResultSet or PreparedStaement with batch execution?

回答1:

I'll address your questions in turn.

  • Will the executeBatch method tries to send all the data at once?

This can vary with each JDBC driver, but the few I've studied will iterate over each batch entry and send the arguments together with the prepared statement handle each time to the database for execution. That is, in your example above, there would 50,000 executions of the prepared statement with 50,000 pairs of arguments, but these 50,000 steps can be done in a lower-level "inner loop," which is where the time savings come in. As a rather stretched analogy, it's like dropping out of "user mode" down into "kernel mode" and running the entire execution loop there. You save the cost of diving in and out of that lower-level mode for each batch entry.

  • Is there a way to define the batch size

You've defined it implicitly here by pushing 50,000 argument sets in before executing the batch via Statement#executeBatch(). A batch size of one is just as valid.

  • Is there any better way to speed up the process of bulk insertion?

Consider opening a transaction explicitly before the batch insertion, and commit it afterward. Don't let either the database or the JDBC driver impose a transaction boundary around each insertion step in the batch. You can control the JDBC layer with the Connection#setAutoCommit(boolean) method. Take the connection out of auto-commit mode first, then populate your batches, start a transaction, execute the batch, then commit the transaction via Connection#commit().

This advice assumes that your insertions won't be contending with concurrent writers, and assumes that these transaction boundaries will give you sufficiently consistent values read from your source tables for use in the insertions. If that's not the case, favor correctness over speed.

  • Is it better to use a updatable ResultSet or PreparedStatement with batch execution?

Nothing beats testing with your JDBC driver of choice, but I expect the latter—PreparedStatement and Statement#executeBatch() will win out here. The statement handle may have an associated list or array of "batch arguments," with each entry being the argument set provided in between calls to Statement#executeBatch() and Statement#addBatch() (or Statement#clearBatch()). The list will grow with each call to addBatch(), and not be flushed until you call executeBatch(). Hence, the Statement instance is really acting as an argument buffer; you're trading memory for convenience (using the Statement instance in lieu of your own external argument set buffer).

Again, you should consider these answers general and speculative so long as we're not discussing a specific JDBC driver. Each driver varies in sophistication, and each will vary in which optimizations it pursues.



回答2:

The batch will be done in "all at once" - that's what you've asked it to do.

50,000 seems a bit large to be attempting in one call. I would break it up into smaller chunks of 1,000, like this:

final int BATCH_SIZE = 1000;
for (int i = 0; i < DATA_SIZE; i++) {
  statement.setString(1, "a@a.com");
  statement.setLong(2, 1);
  statement.addBatch();
  if (i % BATCH_SIZE == BATCH_SIZE - 1)
    statement.executeBatch();
}
if (DATA_SIZE % BATCH_SIZE != 0)
  statement.executeBatch();

50,000 rows shouldn't take more than a few seconds.



回答3:

If it's just data from one/more tables in the DB to be inserted into this table and no intervention (alterations to the resultset), then call statement.executeUpdate(SQL) to perform INSERT-SELECT statment, this is quicker since there is no overhead. No data going outside of the DB and the entire operation is on the DB not in the application.



回答4:

Bulk unlogged update will not give you the improved performance you want the way you are going about it. See this