I try to run a DataFlow pipeline remotely which will use a pickle file. Locally, I can use the code below to invoke the file.
with open (known_args.file_path, 'rb') as fp:
file = pickle.load(fp)
However, I find it not valid when the path is about cloud storage(gs://...):
IOError: [Errno 2] No such file or directory: 'gs://.../.pkl'
I kind of understand why it is not working but I cannot find the right way to do it.
open()
is the standard Python library function that does not understand Google Cloud Storage paths. You need to use the BeamFileSystems
API instead, which is aware of it and of other filesystems supported by Beam.If you have pickle files in your GCS bucket, then you can load them as BLOBs and process them further like in your code (using
pickle.load()
):