scala- Outer join on 2 dataframe columns doesnt sh

2019-08-20 10:48发布

问题:

Im joining 2 dataframes like so: val joinCols = Array("first_name", "last_name") val df_subset_joined = df1_subset.as("a").join(df2_subset.as("b"), joinCols, "full_outer") df_subset_joined.show()

This is the result of the above code:

Dataframe of differences between 2 dataframes
+----------+---------+-------------+-------------+
|first_name|last_name|loyalty_score|loyalty_score|
+----------+---------+-------------+-------------+
|     will |    smith|           67|           67|
|   george |  clooney|           67|           67|
|   george |  clooney|           67|           88|
|    blake |   lively|           66|         null|
|    celena|    gomez|         null|            2|
|       eva|    green|           44|           56|
|      null|     null|             |         null|
|     jason|    momoa|           34|           34|
|        ed|  sheeran|           88|         null|
|    lionel|    messi|           88|           88|
|      kyle|   jenner|         null|           56|
|      tom |   cruise|           66|           34|
|      tom |   cruise|           66|           99|
|      brad|     pitt|           99|           78|
|      ryan| reynolds|           45|         null|
+----------+---------+-------------+-------------+

As you can see there are columns with null values.

I run the following code next:

val filter_str = s"a.$col"+" != "+s"b.$col"
val df_subset_filtered = df_subset_joined.filter(filter_str)
df_subset_filtered.show()

I get the foll dataframe:

Below is the dataframe of differences between DF1 and DF1 based on the comparison between:
a.loyalty_score != b.loyalty_score
+----------+---------+-------------+-------------+
|first_name|last_name|loyalty_score|loyalty_score|
+----------+---------+-------------+-------------+
|      tom |   cruise|           66|           99|
|      tom |   cruise|           66|           34|
|       eva|    green|           44|           56|
|      brad|     pitt|           99|           78|
|   george |  clooney|           67|           88|
+----------+---------+-------------+-------------+

Why dont I see the rows where there are null values in 1 column and a actual value in another. Shouldnt this satisfy value != null

How can I make my filter statement make the null values appear in the final dataframe

回答1:

The reason you don't get any rows where there is null in one column and non-null in the other is that the comparison returns FALSE.

To avoid this, use the null-safe comparison operator <=>, in conjunction with not.

val filter_str = "not(" + s"a.$col"+" <=> "+s"b.$col)" 
val df_subset_filtered = df_subset_joined.filter(filter_str)
df_subset_filtered.show()

From the documentation,

expr1 <=> expr2 - Returns same result as the EQUAL(=) operator for non-null operands, but returns true if both are null, false if one of the them is null.

Arguments:

expr1, expr2 - the two expressions must be same type or can be casted to a common type, and must be a type that can be used in equality comparison. Map type is not supported. For complex types such array/struct, the data types of fields must be orderable. Examples:

SELECT 2 <=> 2; true

SELECT 1 <=> '1'; true

SELECT true <=> NULL; false

SELECT NULL <=> NULL; true