I am trying to Run the FPGrowth algorithm in PySpark on my Dataset.
from pyspark.ml.fpm import FPGrowth
fpGrowth = FPGrowth(itemsCol="name", minSupport=0.5,minConfidence=0.6)
model = fpGrowth.fit(df)
I am getting the following error:
An error occurred while calling o2139.fit.
: java.lang.IllegalArgumentException: requirement failed: The input
column must be ArrayType, but got StringType.
at scala.Predef$.require(Predef.scala:224)
My Dataframe df is in the form:
df.show(2)
+---+---------+--------------------+
| id| name| actor|
+---+---------+--------------------+
| 0|['ab,df']| tom|
| 1|['rs,ce']| brad|
+---+---------+--------------------+
only showing top 2 rows
The FP algorithm works if my data in column "name" is in the form:
name
[ab,df]
[rs,ce]
How do I get it in this form that is convert from StringType to ArrayType
I formed the Dataframe from my RDD:
rd2=rd.map(lambda x: (x[1], x[0][0] , [x[0][1]]))
rd3 = rd2.map(lambda p:Row(id=int(p[0]),name=str(p[2]),actor=str(p[1])))
df = spark.createDataFrame(rd3)
rd2.take(2):
[(0, 'tom', ['ab,df']), (1, 'brad', ['rs,ce'])]