Hadoop 2.4.1 and Google Cloud Storage connector fo

I am trying to run Oryx on top of Hadoop using Google's Cloud Storage Connector for Hadoop: https://cloud.google.com/hadoop/google-cloud-storage-connector

I prefer to use Hadoop 2.4.1 with Oryx, so I use the hadoop2_env.sh set-up for the hadoop cluster I create on google compute engine, e.g.:

.bdutil -b <BUCKET_NAME> -n 2 --env_var_files hadoop2_env.sh \
--default_fs gs --prefix <PREFIX_NAME> deploy

I face two main problems when I try to run oryx using hadoop.

1) Despite confirming that my hadoop conf directory matches what is expected for the google installation on compute engine, e.g.:

$ echo $HADOOP_CONF_DIR
/home/hadoop/hadoop-install/etc/hadoop

I still find something is looking for a /conf directory, e.g.:

Caused by: java.lang.IllegalStateException: Not a directory: /etc/hadoop/conf

My understanding is that ../etc/hadoop should be the /conf directory, e.g.: hadoop: configuration files

And while I shouldn't need to make any changes, this problem is only resolved when I copy the config files into a newly created directory, e.g.:

sudo mkdir /etc/hadoop/conf
sudo cp /home/hadoop/hadoop-install/etc/hadoop/* /etc/hadoop/conf

So why is this? Is this a result of using the google hadoop connector?

2) After "resolving" the issue above, I find additional errors which seem (to me) to be related to communication between the hadoop cluster and the google file system:

Wed Oct 01 20:18:30 UTC 2014 WARNING Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Wed Oct 01 20:18:30 UTC 2014 INFO Namespace prefix: hdfs://BUCKET_NAME

Wed Oct 01 20:18:30 UTC 2014 SEVERE Unexpected error in execution java.lang.ExceptionInInitializerError at com.cloudera.oryx.common.servcomp.StoreUtils.listGenerationsForInstance(StoreUtils.java:50) at com.cloudera.oryx.computation.PeriodicRunner.run(PeriodicRunner.java:173) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.IllegalArgumentException: java.net.UnknownHostException: resistance-prediction at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:373) at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:258) at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:153) at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:602) at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:547) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:139) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2625) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2607) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368) at com.cloudera.oryx.common.servcomp.Store.(Store.java:76) at com.cloudera.oryx.common.servcomp.Store.(Store.java:57) ... 9 more

Caused by: java.net.UnknownHostException: BUCKET_NAME ... 22 more

What seems relevant to me is that the namespace prefix is hdfs:// when I set the default file system to gs://

Perhaps this is leading to the UnkownHostException?

Note that I have "confirmed" the hadoop cluster is connected to the google file system, e.g.: hadoop fs -ls yields the contents of my google cloud bucket and all the expected contents of the gs://BUCKET_NAME directory. However, I am not familiar with the google manifestation of hadoop via the hadoop connector, and the traditional way I usually test to see if the hadoop cluster is running, i.e.: jps only yields 6440 Jps rather than listing all the nodes. However, I am running this command from the master node of the hadoop cluster, i.e., PREFIX_NAME-m, and I am not sure of the expected output when using the google cloud storage connector for hadoop.

So, how can I resolve these errors and have my oryx job (via hadoop) successfully access data in my gs://BUCKET_NAME directory?

Thanks in advance for an insights or suggestions.

UPDATE: Thanks for the very detailed response. As a work-around I "hard coded" gs:// into oryx by changing:

  prefix = "hdfs://" + host + ':' + port;
} else {
  prefix = "hdfs://" + host;

to:

  prefix = "gs://" + host + ':' + port;
} else {
  prefix = "gs://" + host;

I now get the following errors:

Tue Oct 14 20:24:50 UTC 2014 SEVERE Unexpected error in execution java.lang.ExceptionInInitializerError at com.cloudera.oryx.common.servcomp.StoreUtils.listGenerationsForInstance(StoreUtils.java:50) at com.cloudera.oryx.computation.PeriodicRunner.run(PeriodicRunner.java:173) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1905) at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2573) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2586) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2625) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2607) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368) at com.cloudera.oryx.common.servcomp.Store.(Store.java:76) at com.cloudera.oryx.common.servcomp.Store.(Store.java:57)

As per the instructions here: https://cloud.google.com/hadoop/google-cloud-storage-connector#classpath I believe I have added connector jar to Hadoop's classpath; I added:

HADOOP_CLASSPATH=$HADOOP_CLASSPATH:'https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-1.2.9-hadoop2.jar

to /home/rich/hadoop-env-setup.sh. and (echo $HADOOP_CLASSPATH) yields:

/contrib/capacity-scheduler/.jar:/home/hadoop/hadoop-install/share/hadoop/common/lib/gcs-connector-1.2.9-hadoop2.jar:/contrib/capacity-scheduler/.jar:/home/hadoop/hadoop-install/share/hadoop/common/lib/gcs-connector-1.2.9-hadoop2.jar

Do I need to add more to the class path?

I also note (perhaps related) that I still get the error for /etc/hadoop/conf even with the export commands. I have been using the sudo mkdir /etc/hadoop/conf as a temporary work around. I mention this here in case it may be leading to additional issues.

There appear to be a couple of problems; the first of which is that normally, when things are run under hadoop jar, hadoop imbues the various system environment variables and classpaths, etc., into the program being run; in your case, since Oryx runs without using hadoop jar, instead using something like:

java -Dconfig.file=oryx.conf -jar computation/target/oryx-computation-x.y.z.jar

then $HADOOP_CONF_DIR doesn't actually make it into the environment so System.getenv in OryxConfiguration.java fails to pick it up, and uses the default /etc/hadoop/conf value. This is solved simply with the export command, which you can test by seeing if it makes it into a subshell:

echo $HADOOP_CONF_DIR
bash -c 'echo $HADOOP_CONF_DIR'
export HADOOP_CONF_DIR
bash -c 'echo $HADOOP_CONF_DIR'
java -Dconfig.file=oryx.conf -jar computation/target/oryx-computation-x.y.z.jar

The second, and more unfortunate issue is that Oryx appears to hard-code 'hdfs' rather allowing any filesystem scheme set by the user:

private Namespaces() {
  Config config = ConfigUtils.getDefaultConfig();
  boolean localData;
  if (config.hasPath("model.local")) {
    log.warn("model.local is deprecated; use model.local-data");
    localData = config.getBoolean("model.local");
  } else {
    localData = config.getBoolean("model.local-data");
  }
  if (localData) {
    prefix = "file:";
  } else {
    URI defaultURI = FileSystem.getDefaultUri(OryxConfiguration.get());
    String host = defaultURI.getHost();
    Preconditions.checkNotNull(host,
        "Hadoop FS has no host? Did you intent to set model.local-data=true?");
    int port = defaultURI.getPort();
    if (port > 0) {
      prefix = "hdfs://" + host + ':' + port;
    } else {
      prefix = "hdfs://" + host;
    }
  }
  log.info("Namespace prefix: {}", prefix);
}

It all depends on whether Oryx intends to add support for other filesystem schemes in the future, but in the meantime, you would either have to change the Oryx code yourself and recompile, or you could attempt to hack around it (but with potential for pieces of Oryx which have a hard dependency on HDFS to fail).

The change to Oryx should theoretically just be:

    String scheme = defaultURI.getScheme();
    if (port > 0) {
      prefix = scheme + "://" + host + ':' + port;
    } else {
      prefix = scheme + "://" + host;
    }

However, if you do go this route, keep in mind the eventual list consistency semantics of GCS, where multi-stage workflows must not rely on "list" operations to find immediately find all the outputs of a previous stage; Oryx may or may not have such a dependency.

The most reliable solution in your case would be to deploy with --default_fs hdfs, where bdutil will still install the gcs-connector so that you can run hadoop distcp to move your data from GCS to HDFS temporarily, run Oryx, and then once finished, copy it back out into GCS.