Hadoop cannot connect to Google Cloud Storage

2019-04-10 00:26发布

问题:

I'm trying to connect Hadoop running on Google Cloud VM to Google Cloud Storage. I have:

  • Modified the core-site.xml to include properties of fs.gs.impl and fs.AbstractFileSystem.gs.impl
  • Downloaded and referenced the gcs-connector-latest-hadoop2.jar in a generated hadoop-env.sh
  • authenticated via gcloud auth login using my personal account (instead of a service account).

I'm able to run gsutil -ls gs://mybucket/ without any issues but when I execute

hadoop fs -ls gs://mybucket/

I get the output:

14/09/30 23:29:31 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.2.9-hadoop2 

ls: Error getting access token from metadata server at: http://metadata/computeMetadata/v1/instance/service-accounts/default/token

Wondering what steps I am missing to get Hadoop to be able to see the Google Storage?

Thanks!

回答1:

By default, the gcs-connector when running on Google Compute Engine is optimized for using the built-in service-account mechanisms, so in order to force it to use the oauth2 flow, there are a few extra configuration keys that need to be set; you can borrow the same "client_id" and "client_secret" from gcloud auth as follows and add them to your core-site.xml, also disabling fs.gs.auth.service.account.enable:

<property>
  <name>fs.gs.auth.service.account.enable</name>
  <value>false</value>
</property>
<property>
  <name>fs.gs.auth.client.id</name>
  <value>32555940559.apps.googleusercontent.com</value>
</property>
<property>
  <name>fs.gs.auth.client.secret</name>
  <value>ZmssLNjJy2998hD4CTg2ejr2</value>
</property>

You can optionally also set fs.gs.auth.client.file to something other than its default of ~/.credentials/storage.json.

If you do this, then when you run hadoop fs -ls gs://mybucket you'll see a new prompt, similar to the "gcloud auth login" prompt, where you'll visit a browser and enter a verification code again. Unfortunately, the connector can't quite consume a "gcloud" generated credential directly, even though it can possibly share a credentialstore file, since it asks explicitly for the GCS scopes that it needs (you'll notice that the new auth flow will ask only for GCS scopes, as opposed to a big list of services like "gcloud auth login").

Make sure you've also set fs.gs.project.id in your core-site.xml:

<property>
  <name>fs.gs.project.id</name>
  <value>your-project-id</value>
</property>

since the GCS connector likewise doesn't automatically infer a default project from the related gcloud auth.



回答2:

Thanks very much for both of your answers! Your answers led me to the configuration as noted in Migrating 50TB data from local Hadoop cluster to Google Cloud Storage.

I was able to utilize the fs.gs.auth.service.account.keyfile by generating a new service account and then applying the service account email address and p12 key.



回答3:

It looks like the instance itself isn't configured to use the correct service account (but the gsutil command line utility is). The Hadoop file system adaptor looks like it's not pulling those credentials.

First, try checking if that instance is configured with the correct service account. If not, you can set it up.

Hope this helps!