Accessing google cloud storage using hadoop FileSy

2019-08-10 05:41发布

From my machine, I've configured the hadoop core-site.xml to recognize the gs:// scheme and added gcs-connector-1.2.8.jar as a Hadoop lib. I can run hadoop fs -ls gs://mybucket/ and get the expected results. However, if I try to do the analogue from java using:

Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
FileStatus[] status = fs.listStatus(new Path("gs://mybucket/"));

I get the files under root in my local HDFS instead of in gs://mybucket/, but with those files prepended with gs://mybucket. If I modify the conf with conf.set("fs.default.name", "gs://mybucket"); before obtaining the fs, then I can see the files on GCS.

My question is:
1. Is this expected behavior?
2. Is there a disadvantage to using this hadoop FileSystem api as opposed to the google cloud storage client api?

1条回答
疯言疯语
2楼-- · 2019-08-10 05:52

As to your first question, "expected" is questionable, but I think I can at least explain. When FileSystem.get() is used the default FileSystem is returned and by default that is HDFS. My guess is that the HDFS client (DistributedFileSystem) has code to prepend scheme + authority automatically to all files in the filesystem.

Instead of using FileSystem.get(conf), try

FileSystem gcsFs = new Path("gs://mybucket/").getFS(conf)

On disadvantages, I could probably argue that if you end up needing to access the object-store directly then you'll end up writing code to interact with the storage APIs directly anyways (and there are things that do not translate very well to the Hadoop FS API, e.g., object composition, complex object write preconditions other than simple object overwrite protection, etc).

I am admittedly biased (working on the team), but if you're intending to use GCS from Hadoop Map/Reduce, from Spark, etc, the GCS connector for Hadoop should be a fairly safe bet.

查看更多
登录 后发表回答