I have a requirement to write a batch job that fetches rows from a database table and based on a certain conditions, write to other tables or update this row with a certain value. We are using spring and jdbc to fetch the result set and iterate through and process the records using a standalone java program that is scheduled to run weekly. I know this is not the right way to do it, but we had to do it as a temporary solution. As the records grow into millions, we will end up with out of memory exceptions, so I know this is not the best approach.
Can any of you recommend what is the best way to deal with such a situation?
Use Threads and fetch 1000 records per thread and process them in parallel?
(OR)
Use any other batch mechanism to do this (i know there is spring-batch but have never used this)
(OR)
Any other ideas?
a batch job that fetches rows from a database table and based on a certain conditions, write to other tables or update this row with a certain value.
This sounds like the sort of thing you should do inside the database. For example, to fetch a particular row and update it based on certain conditions, SQL has the UPDATE ... WHERE ...
statement. To write to another table, you can use INSERT ... SELECT ...
.
These may get fairly complicated, but I suggest doing everything in your power to do this inside the database, since pulling the data out to filter it is incredibly slow and defeats the purpose of having a relational database.
Note: Make sure to experiment with this on a non-production system first, and implement any limits you need so you don't lock up production tables at bad times.
You already know that you can't bring a million rows into memory and operate on them.
You'll have to chunk them in some way.
Why bring them to the middle tier? I'd consider writing stored procedures and operating on the data on the database server. Bringing it to the middle tier doesn't seem like it's buying you anything. Have your batch job kick off the stored proc and do the calculations in-place in the database server.
It really depends on what and how you process the records.
But generally speaking you should not load them all into the memory at once but process with reasonably sized chunks.
agree with Brendan Long in general. However, I would probably still try to select on a subset of your "millions" dataset in the stored proc. Otherwise, you'll blow out the transaction log of your db. Just make sure you still commit your inserts or updates at a regular interval.
If you don't want to do this in the Stored proc, just have spring batch load the keys for the records you wish to manipulate at some fixed chunk size (use a cursor/paging reader), but get the stored proc to do the actual work. this way, you minimize data passed over to your middle tier while still getting the benefits of spring batch and your db's performance in manipulating data.