Managing the TPL Queue

I've got a service that runs scans of various servers. The networks in question can be huge (hundreds of thousands of network nodes).

The current version of the software is using a queueing/threading architecture designed by us which works but isn't as efficient as it could be (not least of which because jobs can spawn children which isn't handled well)

V2 is coming up and I'm considering using the TPL. It seems like it should be ideally suited.

I've seen this question, the answer to which implies there's no limit to the tasks TPL can handle. In my simple tests (Spin up 100,000 tasks and give them to TPL), TPL barfed fairly early on with an Out-Of-Memory exception (fair enough - especially on my dev box).

The Scans take a variable length of time but 5 mins/task is a good average.

As you can imagine, scans for huge networks can take a considerable length of time, even on beefy servers.

I've already got a framework in place which allows the scan jobs (stored in a Db) to be split between multiple scan servers, but the question is how exactly I should pass work to the TPL on a specific server.

Can I monitor the size of TPL's queue and (say) top it up if it falls below a couple of hundred entries? Is there a downside to doing this?

I also need to handle the situation where a scan needs to be paused. This is seems easier to do by not giving the work to TPL than by cancelling/resetting tasks which may already be partially processed.

All of the initial tasks can be run in any order. Children must be run after the parent has started executing but since the parent spawns them, this shouldn't ever be a problem. Children can be run in any order. Because of this, I'm currently envisioning that child tasks be written back to the Db not spawned directly into TPL. This would allow other servers to "work steal" if required.

Has anyone had any experience with using the TPL in this way? Are there any considerations I need to be aware of?

TPL is about starting small units of work and running them in parallel. It is not about monitoring, pausing, or throttling this work.

You should see TPL as a low-level tool to start "work" and to synchronize threads.

Key point: TPL tasks != logical tasks. Logical tasks are in your case scan-tasks ("scan an ip-range from x to y"). Such a task should not correspond to a physical task "System.Threading.Task" because the two are different concepts.

You need to schedule, orchestrate, monitor and pause the logical tasks yourself because TPL does not understand them and cannot be made to.

Now the more practical concerns:

TPL can certainly start 100k tasks without OOM. The OOM happened because your tasks' code exhausted memory.
Scanning networks sounds like a great case for asynchronous code because while you are scanning you are likely to wait on results while having a great degree of parallelism. You probably don't want to have 500 threads in your process all waiting for a network packet to arrive. Asynchronous tasks fit well with the TPL because every task you run becomes purely CPU-bound and small. That is the sweet spot for TPL.