Pyspark dataframe convert multiple columns to floa

2020-08-11 04:39发布

问题:

I am trying to convert multiple columns of a dataframe from string to float like this

df_temp = sc.parallelize([("1", "2", "3.4555"), ("5.6", "6.7", "7.8")]).toDF(("x", "y", "z"))
df_temp.select(*(float(col(c)).alias(c) for c in df_temp.columns)).show()

but I am getting the error

select() argument after * must be a sequence, not generator

I cannot understand why this error is being thrown

回答1:

float() is not a Spark function, you need the function cast():

from pyspark.sql.functions import col
df_temp.select(*(col(c).cast("float").alias(c) for c in df_temp.columns))


回答2:

if you want to cast some columns without change the whole data frame, you can do that by withColumn function:

for col_name in cols:
    df = df.withColumn(col_name, col(col_name).cast('float'))

this will cast type of columns in cols list and keep another columns as is.
Note:
withColumn function used to replace or create new column based on name of column;
if column name is exist it will be replaced, else it will be created



回答3:

Here is another approach on how to do it :

cv = []   # list of columns you want to convert to Float
cf = []   # list of columns you don't want to change

l = ['float(x.'+c+')' for c in cv]
cst = '('+','.join(l)+')'

l2 = ['x.'+c for c in cf]
cst2 = '('+','.join(l2)+')'

df2rdd = df.map(lambda x : eval(cst2)+eval(cst))

df_output = sqlContext.createDataFrame(df2rdd,df.columns)

df_output is your required dataframe