I'm very new to machine learning algorithms and Spark. I'm follow the Twitter Streaming Language Classifier found here:
Specifically this code:
Except I'm trying to run it in batch mode on some tweets it pulls out of Cassandra, in this case 200 total tweets.
As the example shows, I am using this object for "vectorizing" a set of tweets:
object Utils{
val numFeatures = 1000
val tf = new HashingTF(numFeatures)
/**
* Create feature vectors by turning each tweet into bigrams of
* characters (an n-gram model) and then hashing those to a
* length-1000 feature vector that we can pass to MLlib.
* This is a common way to decrease the number of features in a
* model while still getting excellent accuracy (otherwise every
* pair of Unicode characters would potentially be a feature).
*/
def featurize(s: String): Vector = {
tf.transform(s.sliding(2).toSeq)
}
}
Here is my code which is modified from ExaminAndTrain.scala:
val noSets = rawTweets.map(set => set.mkString("\n"))
val vectors = noSets.map(Utils.featurize).cache()
vectors.count()
val numClusters = 5
val numIterations = 30
val model = KMeans.train(vectors, numClusters, numIterations)
for (i <- 0 until numClusters) {
println(s"\nCLUSTER $i")
noSets.foreach {
t => if (model.predict(Utils.featurize(t)) == 1) {
println(t)
}
}
}
This code runs and each Cluster prints "Cluster 0" "Cluster 1" etc with nothing printing beneath. If i flip
models.predict(Utils.featurize(t)) == 1
to
models.predict(Utils.featurize(t)) == 0
the same thing happens except every tweet is printed beneath every cluster.
Here is what I intuitively think is happening (please correct my thinking if its wrong): This code turns each tweet into a vector, randomly picks some clusters, then runs kmeans to group the tweets (at a really high level, the clusters, i assume, would be common "topics"). As such, when it checks each tweet to see if models.predict == 1, different sets of tweets should appear under each cluster (and because its checking the training set against itself, every tweet should be in a cluster). Why isn't it doing this? Either my understanding of what kmeans does is wrong, my training set is too small or I'm missing a step.
Any help is greatly appreciated
Well, first of all KMeans is a clustering algorithm and as such unsupervised. So there is no "checking of the training set against itself" (well okay you can do it manually ;).
Your understanding is quite good actually, just that you miss the point that model.predict(Utils.featurize(t)) gives you the cluster that t belongs as assigned by KMeans. I think you want to check
models.predict(Utils.featurize(t)) == i
in your code since i iterates through all cluster labels.
Also a small remark: The feature vector is created on a 2-gram model of characters of the tweets. This intermediate step is important ;)
2-gram (for words) means: "A bear shouts at a bear" => {(A, bear), (bear, shouts), (shouts, at), (at, a), (a bear)} i.e. "a bear" is counted twice. Chars would be (A,[space]), ([space], b), (b, e) and so on.