I am trying to Run the FPGrowth algorithm in PySpark on my Dataset.
from pyspark.ml.fpm import FPGrowth
fpGrowth = FPGrowth(itemsCol="name", minSupport=0.5,minConfidence=0.6)
model = fpGrowth.fit(df)
I am getting the following error:
An error occurred while calling o2139.fit.
: java.lang.IllegalArgumentException: requirement failed: The input
column must be ArrayType, but got StringType.
at scala.Predef$.require(Predef.scala:224)
My Dataframe df is in the form:
df.show(2)
+---+---------+--------------------+
| id| name| actor|
+---+---------+--------------------+
| 0|['ab,df']| tom|
| 1|['rs,ce']| brad|
+---+---------+--------------------+
only showing top 2 rows
The FP algorithm works if my data in column "name" is in the form:
name
[ab,df]
[rs,ce]
How do I get it in this form that is convert from StringType to ArrayType
I formed the Dataframe from my RDD:
rd2=rd.map(lambda x: (x[1], x[0][0] , [x[0][1]]))
rd3 = rd2.map(lambda p:Row(id=int(p[0]),name=str(p[2]),actor=str(p[1])))
df = spark.createDataFrame(rd3)
rd2.take(2):
[(0, 'tom', ['ab,df']), (1, 'brad', ['rs,ce'])]
Split by comma for each row in the
name
column of your dataframe. e.g.Or better, don't defer this. Set name directly to the list.
Based on your previous question, it seems as though you are building
rdd2
incorrectly.Try this:
The change is that we call
str.split(",")
onx[0][1]
so that it will convert a string like 'a,b' to a list:['a', 'b']
.