Optimizing MySQL inserts to handle a data stream

2019-07-27 10:27发布

问题:

I am consuming a high rate data stream and doing the following steps to store data in a MySQL database. For each new arriving item.

  • (1) Parse incoming item.
  • (2) Execute several "INSERT ... ON DUPLICATE KEY UPDATE"

I have used INSERT ... ON DUPLICATE KEY UPDATE to eliminate one additional round-trip to the database.

While trying to improve the overall performance, I have considered doing bulk updates in the following way:

  • (1) Parse incoming item.
  • (2) Generate SQL statement with "INSERT ... ON DUPLICATE KEY UPDATE" and append to a file.

Periodically flush the SQL statements in the file to the database.

Two questions:

  • (1) will this have a positive impact in the database load?
  • (2) how should I flush the statements to the database so that indices are only reconstructed after the complete flush? (using transactions?)

UPDATE: I am using Perl DBI + MySQL MyISAM.

Thanks in advance for any comments.

回答1:

You don't say what kind of database access environment (PERL DBI? JDBC? ODBC?) you're running in, or what kind of table storage engine (MyISAM? InnoDB?) you're using.

First of all, you're right to pick INSERT ... ON DUPLICATE KEY UPDATE. Good move, unless you can guarantee unique keys.

Secondly, if your database access environment allows it, you should use prepared statements. You definitely won't get good performance if you write a bunch of statements into a file, and then make a database client read the file once again. Do the INSERT operations directly from the software package that consumes the incoming data stream.

Thirdly, pick the right kind of table storage engine. MyISAM inserts are going to be faster than InnoDB, so if you're logging data and retrieving it later that will be a win. But InnoDB has better transactional integrity. If you're really handling tonnage of data, and you don't need to read it very often, consider the ARCHIVE storage engine.

Finally, consider doing a START TRANSACTION at the beginning of a batch of INSERT ... commands, then doing a COMMIT and another START TRANSACTION after a fixed number of rows, like 100 or so. If you're using InnoDB, this will speed things up a lot. If you're using MyISAM or ARCHIVE, it won't matter.

Your big wins will come from the prepared statement stuff and the best choice of storage engine.



回答2:

If your data does not need to go into the database immediately you can cache your insert data somewhere, then issue one larger insert statement, e.g.

insert into table_name (x, y, z) values (x1, y1, z1), (x2, y2, z2), ... (xN, yN, zN) on duplicate update ...;

To be clear, I would maintain a list of pending inserts. In this case a list of (x,z,y) triplets. Then once your list exceeds some threshold (N) you generate the insert statement and issue it.

I have no accurate timing figures for you, but this increased performance roughly 10 times when compared to inserting each row individually.

I also haven't played with the value of N, but I found 1000 to work nicely. I expect the optimal value is affected by hardware and database settings.

Hope this helps (I am also using MyIsam).