How to split a sequence file in Spark

2019-09-11 06:08发布

问题:

I'm new to Spark and try to read a sequence file and use it in a classification problem. Here is how I read the sequence file

  val tfidf = sc.sequenceFile("/user/hadoop/strainingtesting/tfidf-vectors", classOf[Text], classOf[VectorWritable])

I don't know how to split each line of the sequence file by tab? i.e. how to get the Text value?

How can I use it for NAiveBayes classifier in Mllib?

回答1:

sc.sequenceFile returns an RDD of key/value tuples (Tuple2 objects in Scala). So if you just want an RDD of text, you can do a map to just pick up the keys in each line of the file.

val text = tfidf.map(_._1)

Naive Bayes expects an RDD of labeled vectors as input. And since there is no trivial way to convert your VectorWritable objects, maybe you can use mahout's vector dump utility to actually convert your sequence files into text files. And then read into spark.

mahout vectordump \
-i /user/hadoop/strainingtesting/tfidf-vectors \
-d /user/hadoop/strainingtesting/dictionary.file-\* \
-dt sequencefile -c csv -p true \
-o /user/hadoop/strainingtesting/tf-vectors.txt

And now read the text files into Spark using sc.textFile and perform necessary transformations.