ClassNotFoundException org.apache.mahout.math.Vect

I'm trying to turn a csv file into sequence files so that I can train and run a classifier across the data. I have a job java file that I compile and then jar into the mahout job jar. And when I try to hadoop jar my job in the mahout jar, I get a java.lang.ClassNotFoundException: org.apache.mahout.math.VectorWritable. I'm not sure why this is because if I look in the mahout jar, that class is indeed present.

Here are the steps I'm doing

#get new copy of mahout jar
rm iris.jar
cp /home/stephen/home/libs/mahout-distribution-0.7/core/target/mahout-core-0.7-job.jar iris.jar    
javac -cp :/home/stephen/home/libs/hadoop-1.0.4/hadoop-core-1.0.4.jar:/home/stephen/home/libs/mahout-distribution-0.7/core/target/mahout-core-0.7-job.jar -d bin/ src/edu/iris/seq/CsvToSequenceFile.java    
jar ufv iris.jar -C bin .    
hadoop jar iris.jar edu.iris.seq.CsvToSequenceFile iris-data iris-seq

and this is what my java file looks like

public class CsvToSequenceFile {

public static void main(String[] args) throws IOException,
        InterruptedException, ClassNotFoundException {

    String inputPath = args[0];
    String outputPath = args[1];

    Configuration conf = new Configuration();
    Job job = new Job(conf);
    job.setJobName("Csv to SequenceFile");
    job.setJarByClass(Mapper.class);

    job.setMapperClass(Mapper.class);
    job.setReducerClass(Reducer.class);

    job.setNumReduceTasks(0);

    job.setOutputKeyClass(LongWritable.class);
    job.setOutputValueClass(VectorWritable.class);

    job.setOutputFormatClass(SequenceFileOutputFormat.class);
    job.setInputFormatClass(TextInputFormat.class);

    TextInputFormat.addInputPath(job, new Path(inputPath));
    SequenceFileOutputFormat.setOutputPath(job, new Path(outputPath));

    // submit and wait for completion
    job.waitForCompletion(true);
}

}

Here is the error in the command line

2/10/30 10:43:32 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
12/10/30 10:43:33 INFO input.FileInputFormat: Total input paths to process : 1
12/10/30 10:43:33 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/10/30 10:43:33 WARN snappy.LoadSnappy: Snappy native library not loaded
12/10/30 10:43:34 INFO mapred.JobClient: Running job: job_201210300947_0005
12/10/30 10:43:35 INFO mapred.JobClient:  map 0% reduce 0%
12/10/30 10:43:50 INFO mapred.JobClient: Task Id : attempt_201210300947_0005_m_000000_0, Status : FAILED
java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.mahout.math.VectorWritable
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:899)
    at org.apache.hadoop.mapred.JobConf.getOutputValueClass(JobConf.java:929)
    at org.apache.hadoop.mapreduce.JobContext.getOutputValueClass(JobContext.java:145)
    at org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat.getRecordWriter(SequenceFileOutputFormat.java:61)
    at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.<init>(MapTask.java:628)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:753)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
    at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.mahout.math.VectorWritable
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:867)
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:891)
    ... 11 more

Any ideas how to fix this or am I even trying to do this process correctly? I'm new to hadoop and mahout, so if I'm doing something the hard way, let me know. Thanks!

回答1:

This is a very common problem, and almost certainly an issue with the way you are specifying your classpath in the hadoop command.

The way hadoop works, after you give the "hadoop" command, it ships your job to a tasktracker to execute. So, it's important to keep in mind that your job is executing on a separate JVM, with its own classpath, etc. Part of what you are doing with the "hadoop" command, is specifying the classpath that should be used, etc.

If you are using maven as a build system, I strongly recommend building a "fat jar", using the shade plugin. This will build a jar that contains all your necessary dependencies, and you won't have to worry about classpath issues when you add dependencies to your hadoop job, because you are shipping out a single jar.

If you don't want to go this route, have a look at this article, which describes your problem and some potential solutions. in particular, this should work for you:

Include the JAR in the “-libjars” command line option of the hadoop jar … command.

回答2:

Try specifying the classpath explicitly, so instead of hadoop jar iris.jar edu.iris.seq.CsvToSequenceFile iris-data iris-seq try something like java -cp ...

回答3:

Create jar with dependencies, when you are creating the jar (map/reduce) .

With ref. to maven,you may add the below code in pom.xml and compile the code << mvn clean package assembly:single >> . This will create the jar with depencendcies in target folder and the created jar may look like <>-1.0-SNAPSHOT-jar-with-dependencies.jar

<build>
<plugins>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</configuration>
</plugin>
</plugins>
</build>

Hopefully this answers your doubt.