How to use isin function with values from text fil

2019-08-02 03:05发布

问题:

I'd like to filter a dataframe using an external file.

This is how I use the filter now:

val Insert = Append_Ot.filter(
  col("Name2").equalTo("brazil") ||
  col("Name2").equalTo("france") ||
  col("Name2").equalTo("algeria") ||
  col("Name2").equalTo("tunisia") ||
  col("Name2").equalTo("egypte"))

Instead of using hardcoded string literals, I'd like to create an external file with the values to filter by.

So I create this file:

val filter_numfile = sc.textFile("/user/zh/worskspace/filter_nmb.txt")
  .map(_.split(" ")(1))
  .collect

This gives me:

filter_numfile: Array[String] = Array(brazil, france, algeria, tunisia, egypte)

And then, I use isin function on Name2 column.

val Insert = Append_Ot.where($"Name2".isin(filter_numfile: _*))

But this gives me an empty dataframe. Why?

回答1:

I am just adding some information to philantrovert answer in filter dataframe from external file

His answer is perfect but there might be some case unmatch so you will have to check for case mismatch as well


tl;dr Make sure that the letters use consistent case, i.e. they are all in upper or lower case. Simply use upper or lower standard functions.


lets say you have input file as

1 Algeria
2 tunisia
3 brazil
4 Egypt

you read the text file and change all the countries to lowercase as

val countries = sc.textFile("path to input file").map(_.split(" ")(1).trim)
  .collect.toSeq
val array = Array(countries.map(_.toLowerCase) : _*)

Then you have your dataframe

val Append_Ot = sc.parallelize(Seq(("brazil"),("tunisia"),("algeria"),("name"))).toDF("Name2")

where you apply following condition

import org.apache.spark.sql.functions._
val Insert = Append_Ot.where(lower($"Name2").isin(array : _* ))

you should have output as

+-------+
|Name2  |
+-------+
|brazil |
|tunisia|
|algeria|
+-------+

The empty dataframe might be due to spelling mismatch too.