I am trying to convert multiple columns of a dataframe from string to float like this
df_temp = sc.parallelize([("1", "2", "3.4555"), ("5.6", "6.7", "7.8")]).toDF(("x", "y", "z"))
df_temp.select(*(float(col(c)).alias(c) for c in df_temp.columns)).show()
but I am getting the error
select() argument after * must be a sequence, not generator
I cannot understand why this error is being thrown
float()
is not a Spark function, you need the function cast()
:
from pyspark.sql.functions import col
df_temp.select(*(col(c).cast("float").alias(c) for c in df_temp.columns))
if you want to cast some columns without change the whole data frame, you can do that by withColumn function:
for col_name in cols:
df = df.withColumn(col_name, col(col_name).cast('float'))
this will cast type of columns in cols list and keep another columns as is.
Note:
withColumn function used to replace or create new column based on name of column;
if column name is exist it will be replaced, else it will be created
Here is another approach on how to do it :
cv = [] # list of columns you want to convert to Float
cf = [] # list of columns you don't want to change
l = ['float(x.'+c+')' for c in cv]
cst = '('+','.join(l)+')'
l2 = ['x.'+c for c in cf]
cst2 = '('+','.join(l2)+')'
df2rdd = df.map(lambda x : eval(cst2)+eval(cst))
df_output = sqlContext.createDataFrame(df2rdd,df.columns)
df_output is your required dataframe