I am trying to run Oryx on top of Hadoop using Google's Cloud Storage Connector for Hadoop: https://cloud.google.com/hadoop/google-cloud-storage-connector
I prefer to use Hadoop 2.4.1 with Oryx, so I use the hadoop2_env.sh set-up for the hadoop cluster I create on google compute engine, e.g.:
.bdutil -b <BUCKET_NAME> -n 2 --env_var_files hadoop2_env.sh \
--default_fs gs --prefix <PREFIX_NAME> deploy
I face two main problems when I try to run oryx using hadoop.
1) Despite confirming that my hadoop conf directory matches what is expected for the google installation on compute engine, e.g.:
$ echo $HADOOP_CONF_DIR
/home/hadoop/hadoop-install/etc/hadoop
I still find something is looking for a /conf directory, e.g.:
Caused by: java.lang.IllegalStateException: Not a directory: /etc/hadoop/conf
My understanding is that ../etc/hadoop should be the /conf directory, e.g.: hadoop: configuration files
And while I shouldn't need to make any changes, this problem is only resolved when I copy the config files into a newly created directory, e.g.:
sudo mkdir /etc/hadoop/conf
sudo cp /home/hadoop/hadoop-install/etc/hadoop/* /etc/hadoop/conf
So why is this? Is this a result of using the google hadoop connector?
2) After "resolving" the issue above, I find additional errors which seem (to me) to be related to communication between the hadoop cluster and the google file system:
Wed Oct 01 20:18:30 UTC 2014 WARNING Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Wed Oct 01 20:18:30 UTC 2014 INFO Namespace prefix: hdfs://BUCKET_NAME
Wed Oct 01 20:18:30 UTC 2014 SEVERE Unexpected error in execution java.lang.ExceptionInInitializerError at com.cloudera.oryx.common.servcomp.StoreUtils.listGenerationsForInstance(StoreUtils.java:50) at com.cloudera.oryx.computation.PeriodicRunner.run(PeriodicRunner.java:173) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.IllegalArgumentException: java.net.UnknownHostException: resistance-prediction at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:373) at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:258) at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:153) at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:602) at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:547) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:139) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2625) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2607) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368) at com.cloudera.oryx.common.servcomp.Store.(Store.java:76) at com.cloudera.oryx.common.servcomp.Store.(Store.java:57) ... 9 more
Caused by: java.net.UnknownHostException: BUCKET_NAME ... 22 more
What seems relevant to me is that the namespace prefix is hdfs:// when I set the default file system to gs://
Perhaps this is leading to the UnkownHostException?
Note that I have "confirmed" the hadoop cluster is connected to the google file system, e.g.: hadoop fs -ls yields the contents of my google cloud bucket and all the expected contents of the gs://BUCKET_NAME directory. However, I am not familiar with the google manifestation of hadoop via the hadoop connector, and the traditional way I usually test to see if the hadoop cluster is running, i.e.: jps only yields 6440 Jps rather than listing all the nodes. However, I am running this command from the master node of the hadoop cluster, i.e., PREFIX_NAME-m, and I am not sure of the expected output when using the google cloud storage connector for hadoop.
So, how can I resolve these errors and have my oryx job (via hadoop) successfully access data in my gs://BUCKET_NAME directory?
Thanks in advance for an insights or suggestions.
UPDATE: Thanks for the very detailed response. As a work-around I "hard coded" gs:// into oryx by changing:
prefix = "hdfs://" + host + ':' + port;
} else {
prefix = "hdfs://" + host;
to:
prefix = "gs://" + host + ':' + port;
} else {
prefix = "gs://" + host;
I now get the following errors:
Tue Oct 14 20:24:50 UTC 2014 SEVERE Unexpected error in execution java.lang.ExceptionInInitializerError at com.cloudera.oryx.common.servcomp.StoreUtils.listGenerationsForInstance(StoreUtils.java:50) at com.cloudera.oryx.computation.PeriodicRunner.run(PeriodicRunner.java:173) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1905) at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2573) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2586) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2625) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2607) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368) at com.cloudera.oryx.common.servcomp.Store.(Store.java:76) at com.cloudera.oryx.common.servcomp.Store.(Store.java:57)
As per the instructions here: https://cloud.google.com/hadoop/google-cloud-storage-connector#classpath I believe I have added connector jar to Hadoop's classpath; I added:
HADOOP_CLASSPATH=$HADOOP_CLASSPATH:'https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-1.2.9-hadoop2.jar
to /home/rich/hadoop-env-setup.sh. and (echo $HADOOP_CLASSPATH) yields:
/contrib/capacity-scheduler/.jar:/home/hadoop/hadoop-install/share/hadoop/common/lib/gcs-connector-1.2.9-hadoop2.jar:/contrib/capacity-scheduler/.jar:/home/hadoop/hadoop-install/share/hadoop/common/lib/gcs-connector-1.2.9-hadoop2.jar
Do I need to add more to the class path?
I also note (perhaps related) that I still get the error for /etc/hadoop/conf even with the export commands. I have been using the sudo mkdir /etc/hadoop/conf as a temporary work around. I mention this here in case it may be leading to additional issues.