I'm currently working on a Java project which i need to prepare a big(to me) mysql database. I have to do web scraping using Jsoup and store the results into my database as well. As i estimated, i will have roughly 1,500,000 to 2,000,000 records to be inserted. In my first trial, i just use a loop to insert these records and it takes me one week to insert about 1/3 of my required records, which is too slow i think. Is it possible to make this process multi-threaded, so that i can have my records split into 3 sets, say 500,000 records per set, and then insert them into one database( one table specifically)?
问题:
回答1:
Multi-threading isn't going to help you here. You'll just move the contention bottleneck from your app server to the database.
Instead, try using batch-inserts instead, they generally make this sort of thing orders of magnitude faster. See "3.4 Making Batch Updates" in the JDBC tutorial.
Edit: As @Jon commented, you need to decouple the fetching of the web pages from their insertion into the database, otherwise the whole process will go at the speed of the slowest operation. You could have multiple threads fetching web pages, which add the data to a queue data structure, and then have a single thread draining the queue into the database using a batch insert.
回答2:
Just make sure two (or more) threads doesn't use the same connection at the same time, using a connection pool resolves that. c3po and apache dbcp comes to mind ...
回答3:
You can insert these records in different threads provided they do use different primary key values.
You should also look at Spring Batch which I believe will be useful in your case.
回答4:
You can chunk your record set into batches and do this, but perhaps you should think about other factors as well.
Are you doing a network round trip for each INSERT? If yes, latency could be the real enemy. Try batching those requests to cut down on network traffic.
Do you have transactions turned on? If yes, the size of the rollback log could be the problem.
I'd recommend profiling the app server and the database server to see where the time is being spent. You can waste a lot of time guessing about the root cause.
回答5:
I think multi thread approch usefull for your issue but you have to using a connection pool
such as C3P0
or Tomca 7 Connetcion pool
for more performance.
Another solution is using a batch-operation provider such as Spring-batch
, exist anothers utility for batch operation also.
Another solution is using a PL/SQl Procedure
with a input structure
parameter.