I am trying to obtain all rows in a dataframe where two flags are set to '1' and subsequently all those that where only one of two is set to '1' and the other NOT EQUAL to '1'
With the following schema (three columns),
df = sqlContext.createDataFrame([('a',1,'null'),('b',1,1),('c',1,'null'),('d','null',1),('e',1,1)], #,('f',1,'NaN'),('g','bla',1)],
schema=('id', 'foo', 'bar')
)
I obtain the following dataframe:
+---+----+----+
| id| foo| bar|
+---+----+----+
| a| 1|null|
| b| 1| 1|
| c| 1|null|
| d|null| 1|
| e| 1| 1|
+---+----+----+
When I apply the desired filters, the first filter (foo=1 AND bar=1) works, but not the other (foo=1 AND NOT bar=1)
foobar_df = df.filter( (df.foo==1) & (df.bar==1) )
yields:
+---+---+---+
| id|foo|bar|
+---+---+---+
| b| 1| 1|
| e| 1| 1|
+---+---+---+
Here is the non-behaving filter:
foo_df = df.filter( (df.foo==1) & (df.bar!=1) )
foo_df.show()
+---+---+---+
| id|foo|bar|
+---+---+---+
+---+---+---+
Why is it not filtering? How can I get the columns where only foo is equal to '1'?
Because it is SQL and
NULL
indicates missing values. Because of that any comparison toNULL
, other thanIS NULL
andIS NOT NULL
is undefined. You need either:or
or (PySpark >= 2.3):
if you want null safe comparisons in PySpark.
Also
'null'
is not a valid way to introduceNULL
literal. You should useNone
to indicate missing objects.To filter null values try:
foo_df = df.filter( (df.foo==1) & (df.bar.isNull()) )
https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html#pyspark.sql.Column.isNull