I am considering Google DataFlow as an option for running a pipeline that involves steps like:
- Downloading images from the web;
- Processing images.
I like that DataFlow manages the lifetime of VMs required to complete the job, so I don't need to start or stop them myself, but all examples I came across use it for data mining kind of tasks. I wonder if it is a viable option for other batch tasks like image processing and crawling.
This use case is a possible application for Dataflow/Beam.
If you want to do this in a streaming fashion, you could have a crawler generating URLs and adding them to a PubSub or Kafka queue; and code a Beam pipeline to do the following:
- Read from PubSub
- Download the website content in a ParDo
- Parse image URLs from the website in another ParDo*
- Download each image and process it, again with a ParDo
- Store the result in GCS, BigQuery, or others, depending on what information you want from the image.
You can do the same with a batch job, just changing the source you're reading the URLs from.
*After parsing those image URLs, you may also want to reshuffle your data, to gain some parallelism.