I am building a python cloud video pipeline that will read video from a bucket, perform some computer vision analysis and return frames back to a bucket. As far as I can tell, there is not a Beam read method to pass GCS paths to opencv, similar to TextIO.read(). My options moving forward seem to download the file locally (they are large), use GCS fuse to mount on a local worker (possible?) or write a custom source method. Anyone have experience on what makes most sense?
My main confusion was this question here
Can google cloud dataflow (apache beam) use ffmpeg to process video or image data
How would ffmpeg have access to the path? Its not just a question of uploading the binary? There needs to be a Beam method to pass the item, correct?
I think that you will need to download the files first and then pass them through.
However instead of saving the files locally, is it possible to pass bytes through to opencv. Does it accept any sort of ByteStream or input stream?
You could have one ParDo which downloads the files using the GCS API, then passes it to a opencv through a stream, ByteChannel stdin pipe, etc.
If that is not available, you will need to save the files to disk locally. Then pass opencv the filename. This could be tricky because you may end up using too much disk space. So make sure to garbage collect the files properly and delete the files from local disk after opencv processes them.
I'm not sure but you may need to also select a certain VM machine type to ensure you have enough disk space, depending on the size of your files.