Based on Unbalanced factor of KMeans?, I am trying to compute the Unbalanced Factor, but I fail.
Every element of the RDD r2_10
is a pair, where the key is cluster and the value is a tuple of points. All these are IDs. Below I present what happens:
In [1]: r2_10.collect()
Out[1]:
[(0, ('438728517', '28138008')),
(13824, ('4647699097', '6553505321')),
(9216, ('2575712582', '1776542427')),
(1, ('8133836578', '4073591194')),
(9217, ('3112663913', '59443972', '8715330944', '56063461')),
(4609, ('6812455719',)),
(13825, ('5245073744', '3361024394')),
(4610, ('324470279',)),
(2, ('2412402108',)),
(3, ('4766885931', '3800674818', '4673186647', '350804823', '73118846'))]
In [2]: pdd = r2_10.map(lambda x: (x[0], 1)).reduceByKey(lambda a, b: a + b)
In [3]: pdd.collect()
Out[3]:
[(13824, 1),
(9216, 1),
(0, 1),
(13825, 1),
(1, 1),
(4609, 1),
(9217, 1),
(2, 1),
(4610, 1),
(3, 1)]
In [4]: n = pdd.count()
In [5]: n
Out[5]: 10
In [6]: total = pdd.map(lambda x: x[1]).sum()
In [7]: total
Out[7]: 10
and total
is supposed to have the total number of points. However, it's 10...The goal is to be 22!
What am I missing here?
The problem is because you missed to count the number of points grouped in each cluster, thus you have to change how
pdd
was created.However, You could obtain the same result in a single pass (without computing
pdd
), by mapping the values of theRDD
and then reducing by usingsum
.