I'm new to Spark and try to read a sequence file and use it in a classification problem. Here is how I read the sequence file
val tfidf = sc.sequenceFile("/user/hadoop/strainingtesting/tfidf-vectors", classOf[Text], classOf[VectorWritable])
I don't know how to split each line of the sequence file by tab? i.e. how to get the Text value?
How can I use it for NAiveBayes classifier in Mllib?
sc.sequenceFile
returns an RDD of key/value tuples (Tuple2 objects in Scala). So if you just want an RDD of text, you can do a map
to just pick up the keys in each line of the file.
val text = tfidf.map(_._1)
Naive Bayes expects an RDD of labeled vectors as input. And since there is no trivial way to convert your VectorWritable
objects, maybe you can use mahout's vector dump utility to actually convert your sequence files into text files. And then read into spark.
mahout vectordump \
-i /user/hadoop/strainingtesting/tfidf-vectors \
-d /user/hadoop/strainingtesting/dictionary.file-\* \
-dt sequencefile -c csv -p true \
-o /user/hadoop/strainingtesting/tf-vectors.txt
And now read the text files into Spark using sc.textFile
and perform necessary transformations.