Comparison operator in PySpark (not equal/ !=)

I am trying to obtain all rows in a dataframe where two flags are set to '1' and subsequently all those that where only one of two is set to '1' and the other NOT EQUAL to '1'

With the following schema (three columns),

df = sqlContext.createDataFrame([('a',1,'null'),('b',1,1),('c',1,'null'),('d','null',1),('e',1,1)], #,('f',1,'NaN'),('g','bla',1)],
                            schema=('id', 'foo', 'bar')
                            )

I obtain the following dataframe:

+---+----+----+
| id| foo| bar|
+---+----+----+
|  a|   1|null|
|  b|   1|   1|
|  c|   1|null|
|  d|null|   1|
|  e|   1|   1|
+---+----+----+

When I apply the desired filters, the first filter (foo=1 AND bar=1) works, but not the other (foo=1 AND NOT bar=1)

foobar_df = df.filter( (df.foo==1) & (df.bar==1) )

yields:

+---+---+---+
| id|foo|bar|
+---+---+---+
|  b|  1|  1|
|  e|  1|  1|
+---+---+---+

Here is the non-behaving filter:

foo_df = df.filter( (df.foo==1) & (df.bar!=1) )
foo_df.show()
+---+---+---+
| id|foo|bar|
+---+---+---+
+---+---+---+

Why is it not filtering? How can I get the columns where only foo is equal to '1'?

标签： sql apache-spark pyspark null apache-spark-sql

2条回答

Lonely孤独者°

2楼-- · 2019-02-17 04:42

Why is it not filtering

Because it is SQL and NULL indicates missing values. Because of that any comparison to NULL, other than IS NULL and IS NOT NULL is undefined. You need either:

col("bar").isNull() | (col("bar") != 1)

coalesce(col("bar") != 1, lit(True))

or (PySpark >= 2.3):

col("bar").eqNullSafe(1)

if you want null safe comparisons in PySpark.

Also 'null' is not a valid way to introduce NULL literal. You should use None to indicate missing objects.

from pyspark.sql.functions import col, coalesce, lit

df = spark.createDataFrame([
    ('a', 1, 1), ('a',1, None), ('b', 1, 1),
    ('c' ,1, None), ('d', None, 1),('e', 1, 1)
]).toDF('id', 'foo', 'bar')

df.where((col("foo") == 1) & (col("bar").isNull() | (col("bar") != 1))).show()

## +---+---+----+
## | id|foo| bar|
## +---+---+----+
## |  a|  1|null|
## |  c|  1|null|
## +---+---+----+

df.where((col("foo") == 1) & coalesce(col("bar") != 1, lit(True))).show()

## +---+---+----+
## | id|foo| bar|
## +---+---+----+
## |  a|  1|null|
## |  c|  1|null|
## +---+---+----+

0人赞添加讨论(0) 举报

我命由我不由天

3楼-- · 2019-02-17 04:43

To filter null values try:

foo_df = df.filter( (df.foo==1) & (df.bar.isNull()) )

https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html#pyspark.sql.Column.isNull

0人赞添加讨论(0) 举报

Comparison operator in PySpark (not equal/ !=)

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间