How can I replace numbers by nulls in a DataFrame?

2019-07-21 06:30发布

问题:

It might be strange, but I was wondering how to replace any number of a whole DataFrame's Column for null using Scala.

Imagine I have a nullable DoubleType column named col. There, I want to replace all numbers different to (1.0 ~ 10.0) by a null.

I tried unsatisfactorily the next code.

val xf = df.na.replace("col", Map(0.0 -> null.asInstanceOf[Double]).toMap)

But, as you realize in Scala when you convert a null into a Double it becomes represented as a 0.0, and this is not what I want. Besides, I can't realize any way to do it with a range of values. Therefore, I am thinking if there is any way to achieve this?

回答1:

How about when clause instead?

import org.apache.spark.sql.functions.when

val df = sc.parallelize(
  (1L, 0.0) :: (2L, 3.6) :: (3L, 12.0) :: (4L, 5.0) ::  Nil
).toDF("id", "val")

df.withColumn("val", when($"val".between(1.0, 10.0), $"val")).show

// +---+----+
// | id| val|
// +---+----+
// |  1|null|
// |  2| 3.6|
// |  3|null|
// |  4| 5.0|
// +---+----+

Any value which doesn't satisfy the predicate (here val BETWEEN 1.0 AND 10.0) will be replaced with NULL.

See also Create new Dataframe with empty/null field values