How can I print nulls when converting a dataframe

2020-02-26 12:26发布

问题:

I have a dataframe that I read from a csv.

CSV:
name,age,pets
Alice,23,dog
Bob,30,dog
Charlie,35,

Reading this into a DataFrame called myData:
+-------+---+----+
|   name|age|pets|
+-------+---+----+
|  Alice| 23| dog|
|    Bob| 30| dog|
|Charlie| 35|null|
+-------+---+----+

Now, I want to convert each row of this dataframe to a json using myData.toJSON. What I get are the following jsons.

{"name":"Alice","age":"23","pets":"dog"}
{"name":"Bob","age":"30","pets":"dog"}
{"name":"Charlie","age":"35"}

I would like the 3rd row's json to include the null value. Ex.

{"name":"Charlie","age":"35", "pets":null}

However, this doesn't seem to be possible. I debugged through the code and saw that Spark's org.apache.spark.sql.catalyst.json.JacksonGenerator class has the following implementation

  private def writeFields(
    row: InternalRow, schema: StructType, fieldWriters: 
    Seq[ValueWriter]): Unit = {
    var i = 0
    while (i < row.numFields) {
      val field = schema(i)
      if (!row.isNullAt(i)) {
        gen.writeFieldName(field.name)
        fieldWriters(i).apply(row, i)
      }
      i += 1
    }
  }

This seems to be skipping a column if it is null. I am not quite sure why this is the default behavior but is there a way to print null values in json using Spark's toJSON?

I am using Spark 2.1.0

回答1:

To print the null values in JSON using Spark's toJSON method, you can use following code:

myData.na.fill("null").toJSON

It will give you expected result:

+-------------------------------------------+
|value                                      |
+-------------------------------------------+
|{"name":"Alice","age":"23","pets":"dog"}   |
|{"name":"Bob","age":"30","pets":"dog"}     |
|{"name":"Charlie","age":"35","pets":"null"}|
+-------------------------------------------+

I hope it helps!



回答2:

I have modified JacksonGenerator.writeFields function and included in my project. Below are the steps-

1) Create package 'org.apache.spark.sql.catalyst.json' inside 'src/main/scala/'

2) Copy JacksonGenerator class

3) Create JacksonGenerator.scala class in '' package and paste the copied code

4) modify writeFields function

private def writeFields(row: InternalRow, schema: StructType, fieldWriters:Seq[ValueWriter]): Unit = {
var i = 0
while (i < row.numFields) {
  val field = schema(i)
  if (!row.isNullAt(i)) {
    gen.writeFieldName(field.name)
    fieldWriters(i).apply(row, i)
  }
  else{
    gen.writeNullField(field.name)
  }
  i += 1
}}


回答3:

import org.apache.spark.sql.types._
import scala.util.parsing.json.JSONObject

def convertRowToJSON(row: Row): String = {
    val m = row.getValuesMap(row.schema.fieldNames).filter(_._2 != null)
    JSONObject(m).toString()
  }