I have seen this question earlier here and I have took lessons from that. However I am not sure why I am getting an error when I feel it should work.
I want to create a new column in existing Spark DataFrame
by some rules. Here is what I wrote. iris_spark is the data frame with a categorical variable iris_spark with three distinct categories.
from pyspark.sql import functions as F
iris_spark_df = iris_spark.withColumn(
\"Class\",
F.when(iris_spark.iris_class == \'Iris-setosa\', 0, F.when(iris_spark.iris_class == \'Iris-versicolor\',1)).otherwise(2))
Throws the following error.
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-157-21818c7dc060> in <module>()
----> 1 iris_spark_df=iris_spark.withColumn(\"Class\",F.when(iris_spark.iris_class==\'Iris-setosa\',0,F.when(iris_spark.iris_class==\'Iris-versicolor\',1)))
TypeError: when() takes exactly 2 arguments (3 given)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-157-21818c7dc060> in <module>()
----> 1 iris_spark_df=iris_spark.withColumn(\"Class\",F.when(iris_spark.iris_class==\'Iris-setosa\',0,F.when(iris_spark.iris_class==\'Iris-versicolor\',1)))
TypeError: when() takes exactly 2 arguments (3 given)
Any idea why?
Correct structure is either:
(when(col(\"iris_class\") == \'Iris-setosa\', 0)
.when(col(\"iris_class\") == \'Iris-versicolor\', 1)
.otherwise(2))
which is equivalent to
CASE
WHEN (iris_class = \'Iris-setosa\') THEN 0
WHEN (iris_class = \'Iris-versicolor\') THEN 1
ELSE 2
END
or:
(when(col(\"iris_class\") == \'Iris-setosa\', 0)
.otherwise(when(col(\"iris_class\") == \'Iris-versicolor\', 1)
.otherwise(2)))
which is equivalent to:
CASE WHEN (iris_class = \'Iris-setosa\') THEN 0
ELSE CASE WHEN (iris_class = \'Iris-versicolor\') THEN 1
ELSE 2
END
END
with general syntax:
when(condition, value).when(...)
or
when(condition, value).otherwise(...)
You probably mixed up things with Hive IF
conditional:
IF(condition, if-true, if-false)
which can be used only in raw SQL with Hive support.
There are different ways you can achieve if-then-else.
Using when function in DataFrame API.
You can specify the list of conditions in when and also can specify otherwise what value you need. You can use this expression in nested form as well.
expr function.
Using \"expr\" function you can pass SQL expression in expr. PFB example. Here we are creating new column \"quarter\" based on month column.
cond = \"\"\"case when month > 9 then \'Q4\'
else case when month > 6 then \'Q3\'
else case when month > 3 then \'Q2\'
else case when month > 0 then \'Q1\'
end
end
end
end as quarter\"\"\"
newdf = df.withColumn(\"quarter\", expr(cond))
- selectExpr function.
We can also use the variant of select function which can take SQL expression. PFB example.
cond = \"\"\"case when month > 9 then \'Q4\'
else case when month > 6 then \'Q3\'
else case when month > 3 then \'Q2\'
else case when month > 0 then \'Q1\'
end
end
end
end as quarter\"\"\"
newdf = df.selectExpr(\"*\", cond)
Hope this helps.
Regards,
Neeraj