I'm relatively new to Giraph and I'm trying to get my Giraph edit-compile-deploy loop working for our code. I am able to run various examples inspired by http://blog.cloudera.com/blog/2014/02/how-to-write-and-run-giraph-jobs-on-hadoop/ , but I'm stuck with a ClassNotFoundException when running my modified version of the SimpleShortestPathsVertex Giraph example. I've tried various combinations of -libjars and HADOOP_CLASSPATH, but I'm out of ideas and I'd really appreciate your help. Details follow.
Versions
- Hadoop: Hadoop 2.0.0-cdh4.4.0
- Giraph: giraph-examples-1.0.0-for-hadoop-2.0.0-alpha-jar-with-dependencies.jar
The PageRankBenchmark runs OK
$ hadoop jar $GIRAPH_HOME/giraph-examples/target/giraph-examples-1.0.0-for-hadoop-2.0.0-alpha-jar-with-dependencies.jar \
org.apache.giraph.benchmark.PageRankBenchmark \
-Dgiraph.zkList=<myhost>:2181 \
-e 1 -s 3 -v -V 50 -w 1
...
14/08/01 11:42:44 INFO mapred.JobClient: Job complete: job_201407291058_0015
...
(full output is below)
The GiraphRunner SimpleShortestPathsVertex also runs OK
$ hadoop jar $GIRAPH_HOME/giraph-examples/target/giraph-examples-1.0.0-for-hadoop-2.0.0-alpha-jar-with-dependencies.jar \
org.apache.giraph.GiraphRunner \
-Dgiraph.zkList=<myhost>:2181 \
org.apache.giraph.examples.SimpleShortestPathsVertex \
-vif org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat \
-vip ginput/tiny_graph.txt \
-of org.apache.giraph.io.formats.IdWithValueTextOutputFormat \
-op goutput/shortestpathsC2 \
-ca SimpleShortestPathsVertex.source=2 \
-w 1
...
14/08/01 11:47:46 INFO mapred.JobClient: Job complete: job_201407291058_0017
...
(full output is below)
Bonus: the results are correct:
$ hadoop fs -cat goutput/shortestpathsC2/p*
0 1.0
2 2.0
1 0.0
3 1.0
4 5.0
But my modified version of SimpleShortestPathsVertex gets ClassNotFoundException
The jar containing the modified vertex (KdlSimpleShortestPathsVertex, no package) is OK:
$ jar -tf ~/kdl_hadoop_play.jar
META-INF/MANIFEST.MF
KdlSimpleShortestPathsVertex.class
META-INF/
But my run pukes:
$ hadoop jar $GIRAPH_HOME/giraph-core/target/giraph-1.0.0-for-hadoop-2.0.0-alpha-jar-with-dependencies.jar \
org.apache.giraph.GiraphRunner \
-Dgiraph.zkList=<myhost>:2181 \
-libjars ~/kdl_hadoop_play.jar \
KdlSimpleShortestPathsVertex \
-vif org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat \
-vip /user/cornell/ginput/tiny_graph.txt \
-of org.apache.giraph.io.formats.IdWithValueTextOutputFormat \
-op /user/cornell/goutput/shortestpathsC2 \
-ca KdlSimpleShortestPathsVertex.source=2 \
-w 1
Exception in thread "main" java.lang.ClassNotFoundException: KdlSimpleShortestPathsVertex
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:190)
at org.apache.giraph.utils.ConfigurationUtils.populateGiraphConfiguration(ConfigurationUtils.java:210)
at org.apache.giraph.utils.ConfigurationUtils.parseArgs(ConfigurationUtils.java:147)
at org.apache.giraph.GiraphRunner.run(GiraphRunner.java:74)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at org.apache.giraph.GiraphRunner.main(GiraphRunner.java:124)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
My best guess ...
...after looking around is that maybe GiraphRunner is not processing the -libjars correctly, as hinted at by http://grepalex.com/2013/02/25/hadoop-libjars/ ("Make sure your code is using GenericOptionsParser"). Browsing the Giraph source, I do not see that class accessed. I tried setting HADOOP_CLASSPATH to my jar, but that didn't solve the problem.
Any help would be awesome!
PageRankBenchmark output
14/08/01 11:42:27 INFO job.GiraphJob: run: Since checkpointing is disabled (default), do not allow any task retries (setting mapred.map.max.attempts = 0, old value = 4)
14/08/01 11:42:28 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
14/08/01 11:42:28 WARN bsp.BspOutputFormat: checkOutputSpecs: ImmutableOutputCommiter will not check anything
14/08/01 11:42:29 INFO mapred.JobClient: Running job: job_201407291058_0015
14/08/01 11:42:30 INFO mapred.JobClient: map 0% reduce 0%
14/08/01 11:42:40 INFO mapred.JobClient: map 50% reduce 0%
14/08/01 11:42:41 INFO mapred.JobClient: map 100% reduce 0%
14/08/01 11:42:44 INFO mapred.JobClient: Job complete: job_201407291058_0015
14/08/01 11:42:44 INFO mapred.JobClient: Counters: 39
14/08/01 11:42:44 INFO mapred.JobClient: File System Counters
14/08/01 11:42:44 INFO mapred.JobClient: FILE: Number of bytes read=0
14/08/01 11:42:44 INFO mapred.JobClient: FILE: Number of bytes written=369846
14/08/01 11:42:44 INFO mapred.JobClient: FILE: Number of read operations=0
14/08/01 11:42:44 INFO mapred.JobClient: FILE: Number of large read operations=0
14/08/01 11:42:44 INFO mapred.JobClient: FILE: Number of write operations=0
14/08/01 11:42:44 INFO mapred.JobClient: HDFS: Number of bytes read=88
14/08/01 11:42:44 INFO mapred.JobClient: HDFS: Number of bytes written=0
14/08/01 11:42:44 INFO mapred.JobClient: HDFS: Number of read operations=2
14/08/01 11:42:44 INFO mapred.JobClient: HDFS: Number of large read operations=0
14/08/01 11:42:44 INFO mapred.JobClient: HDFS: Number of write operations=1
14/08/01 11:42:44 INFO mapred.JobClient: Job Counters
14/08/01 11:42:44 INFO mapred.JobClient: Launched map tasks=2
14/08/01 11:42:44 INFO mapred.JobClient: Total time spent by all maps in occupied slots (ms)=15772
14/08/01 11:42:44 INFO mapred.JobClient: Total time spent by all reduces in occupied slots (ms)=0
14/08/01 11:42:44 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
14/08/01 11:42:44 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
14/08/01 11:42:44 INFO mapred.JobClient: Map-Reduce Framework
14/08/01 11:42:44 INFO mapred.JobClient: Map input records=2
14/08/01 11:42:44 INFO mapred.JobClient: Map output records=0
14/08/01 11:42:44 INFO mapred.JobClient: Input split bytes=88
14/08/01 11:42:44 INFO mapred.JobClient: Spilled Records=0
14/08/01 11:42:44 INFO mapred.JobClient: CPU time spent (ms)=2230
14/08/01 11:42:44 INFO mapred.JobClient: Physical memory (bytes) snapshot=411357184
14/08/01 11:42:44 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2428895232
14/08/01 11:42:44 INFO mapred.JobClient: Total committed heap usage (bytes)=806027264
14/08/01 11:42:44 INFO mapred.JobClient: Giraph Stats
14/08/01 11:42:44 INFO mapred.JobClient: Aggregate edges=50
14/08/01 11:42:44 INFO mapred.JobClient: Aggregate finished vertices=50
14/08/01 11:42:44 INFO mapred.JobClient: Aggregate vertices=50
14/08/01 11:42:44 INFO mapred.JobClient: Current master task partition=0
14/08/01 11:42:44 INFO mapred.JobClient: Current workers=1
14/08/01 11:42:44 INFO mapred.JobClient: Last checkpointed superstep=0
14/08/01 11:42:44 INFO mapred.JobClient: Sent messages=0
14/08/01 11:42:44 INFO mapred.JobClient: Superstep=4
14/08/01 11:42:44 INFO mapred.JobClient: Giraph Timers
14/08/01 11:42:44 INFO mapred.JobClient: Input superstep (milliseconds)=238
14/08/01 11:42:44 INFO mapred.JobClient: Setup (milliseconds)=2903
14/08/01 11:42:44 INFO mapred.JobClient: Shutdown (milliseconds)=68
14/08/01 11:42:44 INFO mapred.JobClient: Superstep 0 (milliseconds)=77
14/08/01 11:42:44 INFO mapred.JobClient: Superstep 1 (milliseconds)=64
14/08/01 11:42:44 INFO mapred.JobClient: Superstep 2 (milliseconds)=45
14/08/01 11:42:44 INFO mapred.JobClient: Superstep 3 (milliseconds)=43
14/08/01 11:42:44 INFO mapred.JobClient: Total (milliseconds)=3442
SimpleShortestPathsVertex output
14/08/01 11:47:37 INFO utils.ConfigurationUtils: No edge input format specified. Ensure your InputFormat does not require one.
14/08/01 11:47:37 INFO utils.ConfigurationUtils: Setting custom argument [SimpleShortestPathsVertex.source] to [2] in GiraphConfiguration
14/08/01 11:47:37 WARN job.GiraphConfigurationValidator: Output format vertex index type is not known
14/08/01 11:47:37 WARN job.GiraphConfigurationValidator: Output format vertex value type is not known
14/08/01 11:47:37 WARN job.GiraphConfigurationValidator: Output format edge value type is not known
14/08/01 11:47:37 INFO job.GiraphJob: run: Since checkpointing is disabled (default), do not allow any task retries (setting mapred.map.max.attempts = 0, old value = 4)
14/08/01 11:47:37 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
14/08/01 11:47:38 INFO mapred.JobClient: Running job: job_201407291058_0017
14/08/01 11:47:39 INFO mapred.JobClient: map 0% reduce 0%
14/08/01 11:47:44 INFO mapred.JobClient: map 50% reduce 0%
14/08/01 11:47:45 INFO mapred.JobClient: map 100% reduce 0%
14/08/01 11:47:46 INFO mapred.JobClient: Job complete: job_201407291058_0017
14/08/01 11:47:46 INFO mapred.JobClient: Counters: 39
14/08/01 11:47:46 INFO mapred.JobClient: File System Counters
14/08/01 11:47:46 INFO mapred.JobClient: FILE: Number of bytes read=0
14/08/01 11:47:46 INFO mapred.JobClient: FILE: Number of bytes written=367068
14/08/01 11:47:46 INFO mapred.JobClient: FILE: Number of read operations=0
14/08/01 11:47:46 INFO mapred.JobClient: FILE: Number of large read operations=0
14/08/01 11:47:46 INFO mapred.JobClient: FILE: Number of write operations=0
14/08/01 11:47:46 INFO mapred.JobClient: HDFS: Number of bytes read=200
14/08/01 11:47:46 INFO mapred.JobClient: HDFS: Number of bytes written=30
14/08/01 11:47:46 INFO mapred.JobClient: HDFS: Number of read operations=5
14/08/01 11:47:46 INFO mapred.JobClient: HDFS: Number of large read operations=0
14/08/01 11:47:46 INFO mapred.JobClient: HDFS: Number of write operations=2
14/08/01 11:47:46 INFO mapred.JobClient: Job Counters
14/08/01 11:47:46 INFO mapred.JobClient: Launched map tasks=2
14/08/01 11:47:46 INFO mapred.JobClient: Total time spent by all maps in occupied slots (ms)=8538
14/08/01 11:47:46 INFO mapred.JobClient: Total time spent by all reduces in occupied slots (ms)=0
14/08/01 11:47:46 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
14/08/01 11:47:46 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
14/08/01 11:47:46 INFO mapred.JobClient: Map-Reduce Framework
14/08/01 11:47:46 INFO mapred.JobClient: Map input records=2
14/08/01 11:47:46 INFO mapred.JobClient: Map output records=0
14/08/01 11:47:46 INFO mapred.JobClient: Input split bytes=88
14/08/01 11:47:46 INFO mapred.JobClient: Spilled Records=0
14/08/01 11:47:46 INFO mapred.JobClient: CPU time spent (ms)=1590
14/08/01 11:47:46 INFO mapred.JobClient: Physical memory (bytes) snapshot=341344256
14/08/01 11:47:46 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2363527168
14/08/01 11:47:46 INFO mapred.JobClient: Total committed heap usage (bytes)=504758272
14/08/01 11:47:46 INFO mapred.JobClient: Giraph Stats
14/08/01 11:47:46 INFO mapred.JobClient: Aggregate edges=12
14/08/01 11:47:46 INFO mapred.JobClient: Aggregate finished vertices=5
14/08/01 11:47:46 INFO mapred.JobClient: Aggregate vertices=5
14/08/01 11:47:46 INFO mapred.JobClient: Current master task partition=0
14/08/01 11:47:46 INFO mapred.JobClient: Current workers=1
14/08/01 11:47:46 INFO mapred.JobClient: Last checkpointed superstep=0
14/08/01 11:47:46 INFO mapred.JobClient: Sent messages=0
14/08/01 11:47:46 INFO mapred.JobClient: Superstep=4
14/08/01 11:47:46 INFO mapred.JobClient: Giraph Timers
14/08/01 11:47:46 INFO mapred.JobClient: Input superstep (milliseconds)=181
14/08/01 11:47:46 INFO mapred.JobClient: Setup (milliseconds)=313
14/08/01 11:47:46 INFO mapred.JobClient: Shutdown (milliseconds)=128
14/08/01 11:47:46 INFO mapred.JobClient: Superstep 0 (milliseconds)=57
14/08/01 11:47:46 INFO mapred.JobClient: Superstep 1 (milliseconds)=54
14/08/01 11:47:46 INFO mapred.JobClient: Superstep 2 (milliseconds)=36
14/08/01 11:47:46 INFO mapred.JobClient: Superstep 3 (milliseconds)=35
14/08/01 11:47:46 INFO mapred.JobClient: Total (milliseconds)=805
OK, after looking at the hadoop scripts along with Hadoop and Giraph source, I think I figured it out. The big hint came from Using the libjars option with Hadoop along with this line from the output:
The cause appears to be that GiraphRunner uses its own ConfigurationUtils.parseArgs() to get the org.apache.commons.cli.CommandLine instead of using the recommended org.apache.hadoop.util.GenericOptionsParser.getCommandLine(), which honors the 'libjars' option. This led me to fall back on Hadoop's generic classpath-handling tools: CLASSPATH and/or HADOOP_CLASSPATH. Here's what worked:
For example, on my machine:
Which gives the expected output and results.
More generally, it would be helpful if the Giraph team changed the code to use the (apparently) more standard parser.
Hope that helps!
I don't know why this isn't working but there is a quick-and-dirty way to fix this. Try putting your code in
giraph-examples/src/main/java/org/apache/giraph/examples/
directory (where SimpleShortestPath is located). And then build giraph-examples jar by runningmvn -DskipTests --projects giraph-examples --also-make package
. Then simply run your program as you did for SimpleShortestPath replacing SimpleShortestPath by your file name. I hope that helps.