Pyspark dataframe operator “IS NOT IN”

2020-02-03 13:53发布

站内文章 / 移动开发

16 0

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I would like to rewrite this from R to Pyspark, any nice looking suggestions?

array <- c(1,2,3)
dataset <- filter(!(column %in% array))

In pyspark you can do it like this:

array = [1, 2, 3]
dataframe.filter(dataframe.column.isin(*array) == False)

Or using the binary NOT operator:

dataframe.filter(~dataframe.column.isin(*array))

Take the operator ~ which means contrary :

df_filtered = df.filter(~df["column_name"].isin([1, 2, 3]))

df_result = df[df.column_name.isin([1, 2, 3]) == False]

slightly different syntax and a "date" data set:

toGetDates={'2017-11-09', '2017-11-11', '2017-11-12'}
df= df.filter(df['DATE'].isin(toGetDates) == False)

You can also loop the array and filter:

array = [1, 2, 3]
for i in array:
    df = df.filter(df["column"] != i)

标签： pyspark

叼着烟拽天下

女 | 书童

私信

Ta的文章更多文章

0条评论

还没有人评论过~