Cloud ML Unable to find the file on Google Cloud S

2019-08-08 12:10发布

问题:

I am reading my data file using the following commands:

data_dir = arguments['data_dir']
data = pd.read_csv(data_dir + "/train.csv")

I am using this data to train my model on Google Cloud ML, I am successfully able to schedule the job but getting the following IO error while fetching the file:

IOError: File gs://cloud-bucket/data/train.csv does not exist

The address of the file is proper as I have uploaded it using the console in the above mentioned bucket. Also the Cloud ML is working in the same region and configured with the same project as my bucket

回答1:

GCS is not a POSIX file system and therefore you cannot typically use "regular" file libraries to manipulate files on GCS (e.g. see this, this, and this), including, of course, convenience functions like pd.read_csv.

In the case of pandas, you can pass a file handle, so, based on the aforementioned post, I recommend using TensorFlow's File wrapper which can read from GCS or standard POSIX file systems to enable you to run the same code locally and on the cloud:

from tensorflow.python.lib.io import file_io

data_dir = arguments['data_dir']
with file_io.FileIO(data_dir + "/train.csv", mode ='r') as f:
  data = pd.read_csv(f)

It might also be helpful to test your code by running it locally and passing in GCS filenames before submitting a cloud job.