I'm trying to turn a csv file into sequence files so that I can train and run a classifier across the data. I have a job java file that I compile and then jar into the mahout job jar. And when I try to hadoop jar
my job in the mahout jar, I get a java.lang.ClassNotFoundException: org.apache.mahout.math.VectorWritable
. I'm not sure why this is because if I look in the mahout jar, that class is indeed present.
Here are the steps I'm doing
#get new copy of mahout jar
rm iris.jar
cp /home/stephen/home/libs/mahout-distribution-0.7/core/target/mahout-core-0.7-job.jar iris.jar
javac -cp :/home/stephen/home/libs/hadoop-1.0.4/hadoop-core-1.0.4.jar:/home/stephen/home/libs/mahout-distribution-0.7/core/target/mahout-core-0.7-job.jar -d bin/ src/edu/iris/seq/CsvToSequenceFile.java
jar ufv iris.jar -C bin .
hadoop jar iris.jar edu.iris.seq.CsvToSequenceFile iris-data iris-seq
and this is what my java file looks like
public class CsvToSequenceFile {
public static void main(String[] args) throws IOException,
InterruptedException, ClassNotFoundException {
String inputPath = args[0];
String outputPath = args[1];
Configuration conf = new Configuration();
Job job = new Job(conf);
job.setJobName("Csv to SequenceFile");
job.setJarByClass(Mapper.class);
job.setMapperClass(Mapper.class);
job.setReducerClass(Reducer.class);
job.setNumReduceTasks(0);
job.setOutputKeyClass(LongWritable.class);
job.setOutputValueClass(VectorWritable.class);
job.setOutputFormatClass(SequenceFileOutputFormat.class);
job.setInputFormatClass(TextInputFormat.class);
TextInputFormat.addInputPath(job, new Path(inputPath));
SequenceFileOutputFormat.setOutputPath(job, new Path(outputPath));
// submit and wait for completion
job.waitForCompletion(true);
}
}
Here is the error in the command line
2/10/30 10:43:32 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
12/10/30 10:43:33 INFO input.FileInputFormat: Total input paths to process : 1
12/10/30 10:43:33 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/10/30 10:43:33 WARN snappy.LoadSnappy: Snappy native library not loaded
12/10/30 10:43:34 INFO mapred.JobClient: Running job: job_201210300947_0005
12/10/30 10:43:35 INFO mapred.JobClient: map 0% reduce 0%
12/10/30 10:43:50 INFO mapred.JobClient: Task Id : attempt_201210300947_0005_m_000000_0, Status : FAILED
java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.mahout.math.VectorWritable
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:899)
at org.apache.hadoop.mapred.JobConf.getOutputValueClass(JobConf.java:929)
at org.apache.hadoop.mapreduce.JobContext.getOutputValueClass(JobContext.java:145)
at org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat.getRecordWriter(SequenceFileOutputFormat.java:61)
at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.<init>(MapTask.java:628)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:753)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.mahout.math.VectorWritable
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:867)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:891)
... 11 more
Any ideas how to fix this or am I even trying to do this process correctly? I'm new to hadoop and mahout, so if I'm doing something the hard way, let me know. Thanks!
Try specifying the classpath explicitly, so instead of
hadoop jar iris.jar edu.iris.seq.CsvToSequenceFile iris-data iris-seq
try something likejava -cp ...
Create jar with dependencies, when you are creating the jar (map/reduce) .
With ref. to maven,you may add the below code in pom.xml and compile the code << mvn clean package assembly:single >> . This will create the jar with depencendcies in target folder and the created jar may look like <>-1.0-SNAPSHOT-jar-with-dependencies.jar
Hopefully this answers your doubt.
This is a very common problem, and almost certainly an issue with the way you are specifying your classpath in the hadoop command.
The way hadoop works, after you give the "hadoop" command, it ships your job to a tasktracker to execute. So, it's important to keep in mind that your job is executing on a separate JVM, with its own classpath, etc. Part of what you are doing with the "hadoop" command, is specifying the classpath that should be used, etc.
If you are using maven as a build system, I strongly recommend building a "fat jar", using the shade plugin. This will build a jar that contains all your necessary dependencies, and you won't have to worry about classpath issues when you add dependencies to your hadoop job, because you are shipping out a single jar.
If you don't want to go this route, have a look at this article, which describes your problem and some potential solutions. in particular, this should work for you: