I am working with Spark and PySpark. I am trying to achieve the result equivalent to the following pseudocode:
df = df.withColumn('new_column',
IF fruit1 == fruit2 THEN 1, ELSE 0. IF fruit1 IS NULL OR fruit2 IS NULL 3.)
I am trying to do this in PySpark but I'm not sure about the syntax. Any pointers? I looked into expr()
but couldn't get it to work.
Note that df
is a pyspark.sql.dataframe.DataFrame
.
There are a few efficient ways to implement this. Let's start with required imports:
You can use Hive
IF
function inside expr:or
when
+otherwise
:Finally you could use following trick:
With example data:
you can use this as follows:
and the result is:
the withColumn function in pyspark enables you to make a new variable with conditions, add in the when and otherwise functions and you have a properly working if then else structure.For all of this you would need to import the sparrsql functions, as you will see that the following bit of code will not work without the col() function. In the first bit, we declare a new column -'new column', and then give the condition enclosed in when function (i.e. fruit1==fruit2) then give 1 if the condition is true, if untrue the control goes to the otherwise which then takes care of the second condition (fruit1 or fruit2 is Null) with the isNull() function and if true 3 is returned and if false, the otherwise is checked again giving 0 as the answer
You'll want to use a udf as below