How to open and process CSV file stored in Google

2019-02-25 09:37发布

I am using the Google Cloud Storage Client Library.

I am trying to open and process a CSV file (that was already uploaded to a bucket) using code like:

filename = '/<my_bucket/data.csv'
with gcs.open(filename, 'r') as gcs_file:
    csv_reader = csv.reader(gcs_file, delimiter=',', quotechar='"')

I get the error "argument 1 must be an iterator" in response to the first argument to csv.reader (i.e. the gcs_file). Apparently the gcs_file doesn't support the iterator .next method.

Any ideas on how to proceed? Do I need to wrap the gcs_file and create an iterator on it or is there an easier way?

2条回答
forever°为你锁心
2楼-- · 2019-02-25 09:58

I think it's better you have your own wrapper/iterator designed for csv.reader. If gcs_file was to support Iterator protocol, it is not clear what next() should return to always accommodate its consumer.

According to csv reader doc, it

Return a reader object which will iterate over lines in the given csvfile. csvfile can be any object which supports the iterator protocol and returns a string each time its next() method is called — file objects and list objects are both suitable. If csvfile is a file object, it must be opened with the ‘b’ flag on platforms where that makes a difference.

It expects a chunk of raw bytes from the underlying file, not necessarily a line. You can have a wrapper like this (not tested):

class CsvIterator(object)
  def __init__(self, gcs_file, chunk_size):
     self.gcs_file = gcs_file
     self.chunk_size = chunk_size
  def __iter__(self):
     return self
  def next(self):
     result = self.gcs_file.read(size=self.chunk_size)
     if not result:
        raise StopIteration()
     return result

The key is to read a chunk at a time so that when you have a large file, you don't blow up memory or experience timeout from urlfetch.

Or even simpler. To use iter built in:

csv.reader(iter(gcs_file.readline, ''))
查看更多
Animai°情兽
3楼-- · 2019-02-25 10:20

Try this:

from StringIO import StringIO
filename = '/<my_bucket/data.csv'
with gcs.open(filename, 'r') as gcs_file:
    csv_reader = csv.reader(StringIO(gcs_file.read()), delimiter=',',
                            quotechar='"')

This isn't ideal though. I've filed a feature request to have GCS files support iterating.

查看更多
登录 后发表回答