I want to use concurrency in Java to make requests to an online API, download and parse the response documents, and load the resulting data into a database.
Is it standard to have one pool of threads in which each thread requests, parses, and loads? In other words, only one class implements Runnable
. Or is it more efficient to have, say, three different pools of threads, with the first pool of threads making the requests and pushing them to a queue, the second pool of threads polling from the first queue, parsing, and pushing the parsed data to a second queue, and finally the third pool polling the data from the second queue and loading into the database? In this case, I'd write three different classes that implement Runnable
.
You have to consider which parts of the processing will benefit from parallelism. The online API communication will most likely be a candidate, since there will be sockets and network waits involved. Likewise with the DB interaction. Multithreaded parsing will probably only improve performance if there are multiple available CPU cores.
Splitting the entire process into 3 separate classes will definitely increase the cohesion, meaning each class will have less responsibilities, which is a good thing. On the other hand, making each of these classes a Runnable
and having several queues will increase the complexity (possibly unecessarily) of the application.
I would suggest making 3 separate classes, but dont make them Runnable
. Then make a Runnable
that contains
and orchestrates
the 3 classes, that is one single thread pool. If you see that this doesnt seem to be fast enough (and after some profiling), try splitting the runnable into 2 thread pools: a download and parse, and a db access.
The point being, start simple and add complexity as needed.
One important thing to consider: does the order of the processing matter? i.e., is it important that the parsed result from the first download request gets loaded into the DB before the results from the second request?
If so, you really need queues (or similar), one per task. In effect, three single-threaded thread "pools" (or use an ExecutorService).
If not, @Brady makes good points. Unlike him, I'd probably make all three classes Runnable
, but that doesn't mean you have to use three queues, you could still try a single pool and profile to see how it is working.
I don't believe there is a standard approach, it depends on your requirements.
If you are writing something quick and dirty then you're best having one pool.
If you're looking for something more resilient and where recovery is required then you may opt for several pools. Eg. if you persist the responses and if your app dies then when it restarts you can just re-queue the responses without having to fetch them again.