Using :
http://spark.apache.org/docs/1.6.1/mllib-frequent-pattern-mining.html
Python Code:
from pyspark.mllib.fpm import FPGrowth
model = FPGrowth.train(dataframe,0.01,10)
Scala:
import org.apache.spark.mllib.fpm.FPGrowth
import org.apache.spark.rdd.RDD
val data = sc.textFile("data/mllib/sample_fpgrowth.txt")
val transactions: RDD[Array[String]] = data.map(s => s.trim.split(' '))
val fpg = new FPGrowth()
.setMinSupport(0.2)
.setNumPartitions(10)
val model = fpg.run(transactions)
model.freqItemsets.collect().foreach { itemset =>
println(itemset.items.mkString("[", ",", "]") + ", " + itemset.freq)
}
val minConfidence = 0.8
model.generateAssociationRules(minConfidence).collect().foreach { rule =>
println(
rule.antecedent.mkString("[", ",", "]")
+ " => " + rule.consequent .mkString("[", ",", "]")
+ ", " + rule.confidence)
}
From code here it shows that scala part doesn't have minimum confidence.
def trainFPGrowthModel(
data: JavaRDD[java.lang.Iterable[Any]],
minSupport: Double,
numPartitions: Int): FPGrowthModel[Any] = {
val fpg = new FPGrowth()
.setMinSupport(minSupport)
.setNumPartitions(numPartitions)
val model = fpg.run(data.rdd.map(_.asScala.toArray))
new FPGrowthModelWrapper(model)
}
How to add minConfidence to generate association rule in case of pyspark? We can see that scala has the example but python does not have the example.
You can generate and get association rules in PySpark using Spark <2.2 with a little bit of py4j code:
Spark >= 2.2
There is a
DataFrame
baseml
API which providesAssociationRules
:Spark < 2.2
As for now PySpark doesn't support extracting association rules (
DataFrame
basedFPGrowth
API with Python support is a work in progress SPARK-1450) but we can easily address that.First you'll have to install SBT (just go the downloads page) and follow the instructions for your operating system.
Next you'll have to create a simple Scala project with only two files:
You can adjust it later to follow the established directory structure.
Next add following to the
build.sbt
(adjust Scala version and Spark version to match the one you use):and following to the
AssociationRulesExtractor.scala
:Open terminal emulator of your choice, go to the root directory of the project and call:
It will generate a jar file in the target directory. For example in Scala 2.10 it will be:
Start PySpark shell or use
spark-submit
and pass path to the generated jar file as to--driver-class-path
:In non-local mode:
In cluster mode jar should be present on all nodes.
Add some convenience wrappers:
Finally you can use these helpers as a function:
or as a method:
This solution depends on internal PySpark methods so it is not guaranteed that it will be portable between versions.