我试图执行随机森林分类和使用交叉验证评估模型。 我pySpark工作。 输入CSV文件被加载为火花数据帧格式。 但我面临的问题,同时建立模型。
下面是代码。
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.mllib.evaluation import BinaryClassificationMetrics
sc = SparkContext()
sqlContext = SQLContext(sc)
trainingData =(sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load("/PATH/CSVFile"))
numFolds = 10
rf = RandomForestClassifier(numTrees=100, maxDepth=5, maxBins=5, labelCol="V5409",featuresCol="features",seed=42)
evaluator = MulticlassClassificationEvaluator().setLabelCol("V5409").setPredictionCol("prediction").setMetricName("accuracy")
paramGrid = ParamGridBuilder().build()
pipeline = Pipeline(stages=[rf])
paramGrid=ParamGridBuilder().build()
crossval = CrossValidator(
estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=evaluator,
numFolds=numFolds)
model = crossval.fit(trainingData)
print accuracy
我收到以下错误
Traceback (most recent call last):
File "SparkDF.py", line 41, in <module>
model = crossval.fit(trainingData)
File "/usr/local/spark-2.1.1/python/pyspark/ml/base.py", line 64, in fit
return self._fit(dataset)
File "/usr/local/spark-2.1.1/python/pyspark/ml/tuning.py", line 236, in _fit
model = est.fit(train, epm[j])
File "/usr/local/spark-2.1.1/python/pyspark/ml/base.py", line 64, in fit
return self._fit(dataset)
File "/usr/local/spark-2.1.1/python/pyspark/ml/pipeline.py", line 108, in _fit
model = stage.fit(dataset)
File "/usr/local/spark-2.1.1/python/pyspark/ml/base.py", line 64, in fit
return self._fit(dataset)
File "/usr/local/spark-2.1.1/python/pyspark/ml/wrapper.py", line 236, in _fit
java_model = self._fit_java(dataset)
File "/usr/local/spark-2.1.1/python/pyspark/ml/wrapper.py", line 233, in _fit_java
return self._java_obj.fit(dataset._jdf)
File "/home/hadoopuser/anaconda2/lib/python2.7/site-packages/py4j/java_gateway.py", line 1160, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/local/spark-2.1.1/python/pyspark/sql/utils.py", line 79, in deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: u'Field "features" does not exist.'
hadoopuser@rackserver-PowerEdge-R220:~/workspace/RandomForest_CV$
请帮我在pySpark来解决这个问题。 谢谢。
我显示的数据集的细节在这里。 不,我不具备的功能栏特别。 下面是trainingData.take的(5),其显示前5行的数据集的输出。
[行(V4366 = 0.0,V4460 = 0.232,V4916 = -0.017,V1495 = -0.104,V1639 = 0.005,V1967 = -0.008,V3049 = 0.177,V3746 = -0.675,V3869 = -3.451,V524 = 0.004,V5409 = 0),行(V4366 = 0.0,V4460 = 0.111,V4916 = -0.003,V1495 = -0.137,V1639 = 0.001,V1967 = -0.01,V3049 = 0.01,V3746 = -0.867,V3869 = -2.759,V524 = 0.0, V5409 = 0),行(V4366 = 0.0,V4460 = -0.391,V4916 = -0.003,V1495 = -0.155,V1639 = -0.006,V1967 = -0.019,V3049 = -0.706,V3746 = 0.166,V3869 = 0.189,V524 = 0.001,V5409 = 0),行(V4366 = 0.0,V4460 = 0.098,V4916 = -0.012,V1495 = -0.108,V1639 = 0.005,V1967 = -0.002,V3049 = 0.033,V3746 = -0.787,V3869 = -0.926 ,V524 = 0.002,V5409 = 0),行(V4366 = 0.0,V4460 = 0.026,V4916 = -0.004,V1495 = -0.139,V1639 = 0.003,V1967 = -0.006,V3049 = -0.045,V3746 = -0.208,V3869 = -0.782,V524 = 0.001,V5409 = 0)]
其中V433到V524的功能。 V5409是类的标签。
Answer 1:
星火dataframes没有使用像在星火ML; 所有的功能,需要在一列,通常命名为载体features
。 下面是如何使用您所提供的一个例子,5行做到这一点:
spark.version
# u'2.2.0'
from pyspark.sql import Row
from pyspark.ml.linalg import Vectors
# your sample data:
temp_df = spark.createDataFrame([Row(V4366=0.0, V4460=0.232, V4916=-0.017, V1495=-0.104, V1639=0.005, V1967=-0.008, V3049=0.177, V3746=-0.675, V3869=-3.451, V524=0.004, V5409=0), Row(V4366=0.0, V4460=0.111, V4916=-0.003, V1495=-0.137, V1639=0.001, V1967=-0.01, V3049=0.01, V3746=-0.867, V3869=-2.759, V524=0.0, V5409=0), Row(V4366=0.0, V4460=-0.391, V4916=-0.003, V1495=-0.155, V1639=-0.006, V1967=-0.019, V3049=-0.706, V3746=0.166, V3869=0.189, V524=0.001, V5409=0), Row(V4366=0.0, V4460=0.098, V4916=-0.012, V1495=-0.108, V1639=0.005, V1967=-0.002, V3049=0.033, V3746=-0.787, V3869=-0.926, V524=0.002, V5409=0), Row(V4366=0.0, V4460=0.026, V4916=-0.004, V1495=-0.139, V1639=0.003, V1967=-0.006, V3049=-0.045, V3746=-0.208, V3869=-0.782, V524=0.001, V5409=0)])
trainingData=temp_df.rdd.map(lambda x:(Vectors.dense(x[0:-1]), x[-1])).toDF(["features", "label"])
trainingData.show()
# +--------------------+-----+
# | features|label|
# +--------------------+-----+
# |[-0.104,0.005,-0....| 0|
# |[-0.137,0.001,-0....| 0|
# |[-0.155,-0.006,-0...| 0|
# |[-0.108,0.005,-0....| 0|
# |[-0.139,0.003,-0....| 0|
# +--------------------+-----+
在此之后,您的管道应该运行正常(我假设你的确具有多类分类,因为您的样本只包含0作为标签),只有在不断变化的标签栏rf
和evaluator
如下:
rf = RandomForestClassifier(numTrees=100, maxDepth=5, maxBins=5, labelCol="label",featuresCol="features",seed=42)
evaluator = MulticlassClassificationEvaluator().setLabelCol("label").setPredictionCol("prediction").setMetricName("accuracy")
最后, print accuracy
将无法工作-你需要model.avgMetrics
代替。
Answer 2:
我想补充我的5美分,至desertnaut的答案-因为现在(火花2.2.0)有相当方便VectorAssembler它处理多列转化为一个向量列类。 然后代码如下所示:
from pyspark.sql import Row
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
# your sample data:
temp_df = spark.createDataFrame([Row(V4366=0.0, V4460=0.232, V4916=-0.017, V1495=-0.104, V1639=0.005, V1967=-0.008, V3049=0.177, V3746=-0.675, V3869=-3.451, V524=0.004, V5409=0), Row(V4366=0.0, V4460=0.111, V4916=-0.003, V1495=-0.137, V1639=0.001, V1967=-0.01, V3049=0.01, V3746=-0.867, V3869=-2.759, V524=0.0, V5409=0), Row(V4366=0.0, V4460=-0.391, V4916=-0.003, V1495=-0.155, V1639=-0.006, V1967=-0.019, V3049=-0.706, V3746=0.166, V3869=0.189, V524=0.001, V5409=0), Row(V4366=0.0, V4460=0.098, V4916=-0.012, V1495=-0.108, V1639=0.005, V1967=-0.002, V3049=0.033, V3746=-0.787, V3869=-0.926, V524=0.002, V5409=0), Row(V4366=0.0, V4460=0.026, V4916=-0.004, V1495=-0.139, V1639=0.003, V1967=-0.006, V3049=-0.045, V3746=-0.208, V3869=-0.782, V524=0.001, V5409=0)])
assembler = VectorAssembler(
inputCols=['V4366', 'V4460', 'V4916', 'V1495', 'V1639', 'V1967', 'V3049', 'V3746', 'V3869', 'V524'],
outputCol='features')
trainingData = assembler.transform(temp_df)
trainingData.show()
# +------+------+------+------+------+------+-----+------+------+-----+-----+--------------------+
# | V1495| V1639| V1967| V3049| V3746| V3869|V4366| V4460| V4916| V524|V5409| features|
# +------+------+------+------+------+------+-----+------+------+-----+-----+--------------------+
# |-0.104| 0.005|-0.008| 0.177|-0.675|-3.451| 0.0| 0.232|-0.017|0.004| 0|[0.0,0.232,-0.017...|
# |-0.137| 0.001| -0.01| 0.01|-0.867|-2.759| 0.0| 0.111|-0.003| 0.0| 0|[0.0,0.111,-0.003...|
# |-0.155|-0.006|-0.019|-0.706| 0.166| 0.189| 0.0|-0.391|-0.003|0.001| 0|[0.0,-0.391,-0.00...|
# |-0.108| 0.005|-0.002| 0.033|-0.787|-0.926| 0.0| 0.098|-0.012|0.002| 0|[0.0,0.098,-0.012...|
# |-0.139| 0.003|-0.006|-0.045|-0.208|-0.782| 0.0| 0.026|-0.004|0.001| 0|[0.0,0.026,-0.004...|
# +------+------+------+------+------+------+-----+------+------+-----+-----+--------------------+
这样,它可以很容易地集成在管道中的处理步骤。 另外这里重要的区别是,新的features
列追加到数据帧。
文章来源: pyspark.sql.utils.IllegalArgumentException: u'Field “features” does not exist.'