How to drop rows with nulls in one column pyspark

2020-02-25 22:52发布

I have a dataframe and I would like to drop all rows with NULL value in one of the columns (string). I can easily get the count of that:

df.filter(df.col_X.isNull()).count()

I have tried dropping it using following command. It executes but the count still returns as positive

df.filter(df.col_X.isNull()).drop()

I tried different attempts but it returns 'object is not callable' error.

4条回答
Anthone
2楼-- · 2020-02-25 23:10

Use either drop with subset:

df.na.drop(subset=["col_X"])

or isNotNull()

df.filter(df.col_X.isNotNull())
查看更多
虎瘦雄心在
3楼-- · 2020-02-25 23:17

Dataframes are immutable. so just applying a filter that removes not null values will create a new dataframe which wouldn't have the records with null values.

df = df.filter(df.col_X. isNotNull())
查看更多
Explosion°爆炸
4楼-- · 2020-02-25 23:21

you can add empty string condition also somtimes

df = df.filter(df.col_X. isNotNull() | df.col_X != "")
查看更多
Bombasti
5楼-- · 2020-02-25 23:28

another variation is:

from pyspark.sql.functions import col

df = df.where(col("columnName").isNotNull())
查看更多
登录 后发表回答