Reading video during cloud dataflow, using GCSfuse

2019-08-22 11:45发布

I am building a python cloud video pipeline that will read video from a bucket, perform some computer vision analysis and return frames back to a bucket. As far as I can tell, there is not a Beam read method to pass GCS paths to opencv, similar to TextIO.read(). My options moving forward seem to download the file locally (they are large), use GCS fuse to mount on a local worker (possible?) or write a custom source method. Anyone have experience on what makes most sense?

My main confusion was this question here

Can google cloud dataflow (apache beam) use ffmpeg to process video or image data

How would ffmpeg have access to the path? Its not just a question of uploading the binary? There needs to be a Beam method to pass the item, correct?

1条回答
小情绪 Triste *
2楼-- · 2019-08-22 12:17

I think that you will need to download the files first and then pass them through.

However instead of saving the files locally, is it possible to pass bytes through to opencv. Does it accept any sort of ByteStream or input stream?

You could have one ParDo which downloads the files using the GCS API, then passes it to a opencv through a stream, ByteChannel stdin pipe, etc.

If that is not available, you will need to save the files to disk locally. Then pass opencv the filename. This could be tricky because you may end up using too much disk space. So make sure to garbage collect the files properly and delete the files from local disk after opencv processes them.

I'm not sure but you may need to also select a certain VM machine type to ensure you have enough disk space, depending on the size of your files.

查看更多
登录 后发表回答