I have built a KMeansModel. My results are stored in a PySpark DataFrame called
transformed
.
(a) How do I interpret the contents of transformed
?
(b) How do I create one or more Pandas DataFrame from transformed
that would show summary statistics for each of the 13 features for each of the 14 clusters?
from pyspark.ml.clustering import KMeans
# Trains a k-means model.
kmeans = KMeans().setK(14).setSeed(1)
model = kmeans.fit(X_spark_scaled) # Fits a model to the input dataset with optional parameters.
transformed = model.transform(X_spark_scaled).select("features", "prediction") # X_spark_scaled is my PySpark DataFrame consisting of 13 features
transformed.show(5, truncate = False)
+------------------------------------------------------------------------------------------------------------------------------------+----------+
|features |prediction|
+------------------------------------------------------------------------------------------------------------------------------------+----------+
|(14,[4,5,7,8,9,13],[1.0,1.0,485014.0,0.25,2.0,1.0]) |12 |
|(14,[2,7,8,9,12,13],[1.0,2401233.0,1.0,1.0,1.0,1.0]) |2 |
|(14,[2,4,5,7,8,9,13],[0.3333333333333333,0.6666666666666666,0.6666666666666666,2429111.0,0.9166666666666666,1.3333333333333333,3.0])|2 |
|(14,[4,5,7,8,9,12,13],[1.0,1.0,2054748.0,0.15384615384615385,11.0,1.0,1.0]) |11 |
|(14,[2,7,8,9,13],[1.0,43921.0,1.0,1.0,1.0]) |1 |
+------------------------------------------------------------------------------------------------------------------------------------+----------+
only showing top 5 rows
As an aside, I found from another SO post that I can map the features to their names like below. It would be nice to have summary statistics (mean, median, std, min, max) for each feature of each cluster in one or more Pandas dataframes.
attr_list = [attr for attr in chain(*transformed.schema['features'].metadata['ml_attr']['attrs'].values())]
attr_list
Per request in the comments, here is a snapshot consisting of 2 records of the data (don't want to provide too many records -- proprietary information here)
+---------------------+------------------------+-----------------------+----------------------+----------------------+------------------------------+---------------------------------+------------+-------------------+--------------------+------------------------------------+--------------------------+-------------------------------+-----------------+--------------------+--------------------+
|device_type_robot_pct|device_type_smart_tv_pct|device_type_desktop_pct|device_type_tablet_pct|device_type_mobile_pct|device_type_mobile_persist_pct|visitors_seen_with_anonymiser_pct|ip_time_span| ip_weight|mean_ips_per_visitor|visitors_seen_with_multi_country_pct|international_visitors_pct|visitors_seen_with_multi_ua_pct|count_tuids_on_ip| features| scaledFeatures|
+---------------------+------------------------+-----------------------+----------------------+----------------------+------------------------------+---------------------------------+------------+-------------------+--------------------+------------------------------------+--------------------------+-------------------------------+-----------------+--------------------+--------------------+
| 0.0| 0.0| 0.0| 0.0| 1.0| 1.0| 0.0| 485014.0| 0.25| 2.0| 0.0| 0.0| 0.0| 1.0|(14,[4,5,7,8,9,13...|(14,[4,5,7,8,9,13...|
| 0.0| 0.0| 1.0| 0.0| 0.0| 0.0| 0.0| 2401233.0| 1.0| 1.0| 0.0| 0.0| 1.0| 1.0|(14,[2,7,8,9,12,1...|(14,[2,7,8,9,12,1...|
As Anony-Mousse has commented, (Py)Spark ML is indeed much more limited that scikit-learn or other similar packages, and such functionality is not trivial; nevertheless, here is a way to get what you want (cluster statistics):
Up to here, and regarding your first question:
The
features
column is just a replication of the same column in your original data.The
prediction
column is the cluster to which the respective data record belongs to; in my example, with 5 data records andk=3
clusters, I end up with 1 record in cluster #0, 1 record in cluster #1, and 3 records in cluster #2.Regarding your second question:
(Note: seems you have 14 features and not 13...)
This is a good example of a seemingly simple task for which, unfortunately, PySpark does not provide ready functionality - not least because all features are grouped in a single vector
features
; to do that, we must first "disassemble"features
, effectively coming up with the invert operation ofVectorAssembler
.The only way I can presently think of is to revert temporarily to an RDD and perform a
map
operation [EDIT: this is not really necessary - see UPDATE below]; here is an example with my cluster #2 above, which contains both dense and sparse vectors:(If you have your initial data in a Spark dataframe
initial_data
, you can change the last part totoDF(schema=initial_data.columns)
, in order to keep the original feature names.)From this point, you could either convert
cluster_2
dataframe to a pandas one (if it fits in your memory), or use thedescribe()
function of Spark dataframes to get your summary statistics:Using the above code with
dimensionality=14
in your case should do the job...Annoyed with all these (arguably useless) significant digits in
mean
andstddev
? As a bonus, here is a small utility function I had come up some time ago for a pretty summary:UPDATE: Thinking of it again, and seeing your sample data, I came up with a more straightforward solution, without the need to invoke an intermediate RDD (an operation that one would arguably prefer to avoid, if possible)...
The key observation is the complete contents of
transformed
, i.e. without theselect
statements; keeping the same toy dataset as above, we get:As you can see, whatever other columns are present in the dataframe
df
to be transformed (just one in my case -target
) just "pass-through" the transformation procedure and end-up being present in the final outcome...Hopefully you start getting the idea: if
df
contains your initial 14 features, each one in a separate column, plus a 15th column namedfeatures
(roughly as shown in your sample data, but without the last column), then the following code:will leave you with a Spark dataframe
transformed
containing 15 columns, i.e. your initial 14 features plus aprediction
column with the corresponding cluster number.From this point, you can proceed as I have shown above to
filter
specific clusters fromtransformed
and get your summary statistics, but you'll have avoided the (costly...) conversion to intermediate temporary RDDs, thus keeping all your operations in the more efficient context of Spark dataframes...