Vectorization in Apache Mahout

2019-02-10 16:56发布

问题:

I am new to Mahout. I have a requirement to convert a text file to a vector for classification in later stage.

Could anybody of of shed some light on these below questions?

  1. How to convert a text file to a vector in mahout? The file format is like "username|comment about item|rating"
  2. The data will be few TBs. So which algorithm implementable I can use for classification using the vector I suppose to create?

Thanks, Arun

回答1:

You can check these 2 examples that also somewhat do/explain how to use the Sequence File API. Here and here

And you should definitely read this intro to text analysis