I have a difficult problem.
I am iterating through a set of URLs parameterized by date and fetching them. For example, here is an example of one:
somewebservice.com?start=01-01-2012&end=01-10-2012
Sometimes, the content returned from the URL gets truncated (missing random results with a 'truncated error' message attached) because I've defined too large a range, so I have to split the query into two URLs
somewebservice.com?start=01-01-2012&end=01-05-2012
somewebservice.com?start=01-06-2012&end=01-10-2012
I do this recursively until the results aren't truncated anymore, and then I write to a blob, which allows concurrent writes.
Each of these URL fetch calls/blob writes is handled in a separate task queue task.
The problem is, I can't for the life of me devise a scheme to know when all the tasks have completed. I've tried using sharded counters, but the recursion makes it difficult. Someone suggested I use the Pipeline API, so I watched the Slatkin talk 3 times. It doesn't appear to work with recursion (but I admit I still don't fully understand the lib).
Is there anyway to know when a set of task queue tasks (and children that get spawned recursively) are completed so I can finalize my blob and do whatever with it?
Thanks, John
All right, so here's what I did. I had to modify Mitch's solution just a bit, but he definitely got me in the right direction with the advice to return the future value instead of an immediate one.
I had to create an intermidate DummyJob that takes the output of the recursion
Then, I submit the output of the DummyJob to the Blob Finalizer in a waitFor
Thank you Mitch and Nick!!
Have you read the Pipelines Getting Started docs? Pipelines can create other pipelines and wait on them, so doing what you want is fairly straightforward:
Where
RecursiveCombiningPipeline
simply acts as a receiver for the values of the two sub-pipelines.Here is an example using Java Pipeline
package com.example;