This is my first post here, so apologies if this isn't structured well.
We have been tasked to design a tool that will:
- Read a file (of account IDs), CSV format
- Download the account data file from the web for each account (by Id) (REST API)
- Pass the file to a converter that will produce a report (financial predictions etc) [~20ms]
- If the prediction threshold is within limits, run a parser to analyse the data [400ms]
- Generate a report for the analysis above [80ms]
- Upload all files generated to the web (REST API)
Now all those individual points are relatively easy to do. I'm interested in finding out how best to architect something to handle this and to do it fast & efficiently on our hardware.
We have to process roughly around 2 Million accounts. The square brackets gives an idea of how long each process takes on average. I'd like to use the maximum resources available on the machine - 24 core Xeon processors. It's not a memory intensive process.
Would using TPL and creating each of these as a task be a good idea? Each has to happen sequentially but many can be done at once. Unfortunately the parsers are not multi-threading aware and we don't have the source (it's essentially a black box for us).
My thoughts were something like this - assumes we're using TPL:
- Load account data (essentially a CSV import or SQL SELECT)
- For each Account (Id):
- Download the data file for each account
- ContinueWith using the data file, send to the converter
- ContinueWith check threshold, send to parser
- ContinueWith Generate Report
- ContinueWith Upload outputs
Does that sound feasible or am I not understanding it correctly? Would it be better to break down the steps a different way?
I'm a bit unsure on how to handle issues with the parser throwing exceptions (it's very picky) or when we get failures uploading.
All this is going to be in a scheduled job that will run after-hours as a console application.