How to use string variables in VectorAssembler in

I want to run Random Forests algorithm on Pyspark. It is mentioned in the Pyspark documentation that VectorAssembler accepts only numerical or boolean datatypes. So, if my data contains Stringtype variables, say names of cities, should I be one-hot encoding them in order to proceed further with Random Forests classification/regression?

Here is the code I have been trying, input file is here:

train=sqlContext.read.format('com.databricks.spark.csv').options(header='true').load('filename')
drop_list = ["Country", "Carrier", "TrafficType","Device","Browser","OS","Fraud","ConversionPayOut"]
from pyspark.sql.types import DoubleType
train = train.withColumn("ConversionPayOut", train["ConversionPayOut"].cast("double"))#only this variable is actually double, rest of them are strings
junk = train.select([column for column in train.columns if column in drop_list])
transformed = assembler.transform(junk)

I keep getting the errror that IllegalArgumentException: u'Data type StringType is not supported.'

P.S.: Apologies for asking a basic question. I come from R background. In R, when we do Random Forests, there is no need to convert the categorical variables into numeric variables.

标签： pyspark random-forest

2条回答

Emotional °昔

2楼-- · 2020-03-27 03:56

Following is the example -
Schema
 |-- age: integer (nullable = true)
 |-- workclass: string (nullable = true)
 |-- fnlwgt: double (nullable = true)
 |-- education: string (nullable = true)
 |-- education-num: double (nullable = true)
 |-- marital-status: string (nullable = true)
 |-- occupation: string (nullable = true)
 |-- relationship: string (nullable = true)
 |-- race: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- capital-gain: double (nullable = true)
 |-- capital-loss: double (nullable = true)
 |-- hours-per-week: double (nullable = true)
 |-- native-country: string (nullable = true)
 |-- income: string (nullable = true)

        // Deal with Categorical Columns
        // Transform string type columns to string indexer 
        val workclassIndexer = new StringIndexer().setInputCol("workclass").setOutputCol("workclassIndex")
        val educationIndexer = new StringIndexer().setInputCol("education").setOutputCol("educationIndex")
        val maritalStatusIndexer = new StringIndexer().setInputCol("marital-status").setOutputCol("maritalStatusIndex")
        val occupationIndexer = new StringIndexer().setInputCol("occupation").setOutputCol("occupationIndex")
        val relationshipIndexer = new StringIndexer().setInputCol("relationship").setOutputCol("relationshipIndex")
        val raceIndexer = new StringIndexer().setInputCol("race").setOutputCol("raceIndex")
        val sexIndexer = new StringIndexer().setInputCol("sex").setOutputCol("sexIndex")
        val nativeCountryIndexer = new StringIndexer().setInputCol("native-country").setOutputCol("nativeCountryIndex")
        val incomeIndexer = new StringIndexer().setInputCol("income").setOutputCol("incomeIndex")

        // Transform string type columns to string indexer 
        val workclassEncoder = new OneHotEncoder().setInputCol("workclassIndex").setOutputCol("workclassVec")
        val educationEncoder = new OneHotEncoder().setInputCol("educationIndex").setOutputCol("educationVec")
        val maritalStatusEncoder = new OneHotEncoder().setInputCol("maritalStatusIndex").setOutputCol("maritalVec")
        val occupationEncoder = new OneHotEncoder().setInputCol("occupationIndex").setOutputCol("occupationVec")
        val relationshipEncoder = new OneHotEncoder().setInputCol("relationshipIndex").setOutputCol("relationshipVec")
        val raceEncoder = new OneHotEncoder().setInputCol("raceIndex").setOutputCol("raceVec")
        val sexEncoder = new OneHotEncoder().setInputCol("sexIndex").setOutputCol("sexVec")
        val nativeCountryEncoder = new OneHotEncoder().setInputCol("nativeCountryIndex").setOutputCol("nativeCountryVec")
        val incomeEncoder = new StringIndexer().setInputCol("incomeIndex").setOutputCol("label")

    // Assemble everything together to be ("label","features") format
        val assembler = (new VectorAssembler()
          .setInputCols(Array("workclassVec", "fnlwgt", "educationVec", "education-num", "maritalVec", "occupationVec", "relationshipVec", "raceVec", "sexVec", "capital-gain", "capital-loss", "hours-per-week", "nativeCountryVec"))
          .setOutputCol("features"))

 ///////////////////////////////
    // Set Up the Pipeline ///////
    /////////////////////////////
    import org.apache.spark.ml.Pipeline

    val lr = new LogisticRegression()

    val pipeline = new Pipeline().setStages(Array(workclassIndexer, educationIndexer, maritalStatusIndexer, occupationIndexer, relationshipIndexer, raceIndexer, sexIndexer, nativeCountryIndexer, incomeIndexer, workclassEncoder, educationEncoder, maritalStatusEncoder, occupationEncoder, relationshipEncoder, raceEncoder, sexEncoder, nativeCountryEncoder, incomeEncoder, assembler, lr))

    // Fit the pipeline to training documents.
    val model = pipeline.fit(training)

0人赞添加讨论(0) 举报

爷的心禁止访问

3楼-- · 2020-03-27 03:57

Yes you should use StringIndexer, maybe together with OneHotEncoder. You can find more information on these two in the linked documentation.

0人赞添加讨论(0) 举报

How to use string variables in VectorAssembler in

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间