I'm reading a Hive table which has two columns, id
and jsonString
. I can easily transform the jsonString
into a Spark Data Structure calling the spark.read.json
function, but I have to add the column id
as well.
val jsonStr1 = """{"fruits":[{"fruit":"banana"},{"fruid":"apple"},{"fruit":"pera"}],"bar":{"foo":"[\"daniel\",\"pedro\",\"thing\"]"},"daniel":"daniel data random","cars":["montana","bagulho"]}"""
val jsonStr2 = """{"fruits":[{"dt":"banana"},{"fruid":"apple"},{"fruit":"pera"}],"bar":{"foo":"[\"daniel\",\"pedro\",\"thing\"]"},"daniel":"daniel data random","cars":["montana","bagulho"]}"""
val jsonStr3 = """{"fruits":[{"a":"banana"},{"fruid":"apple"},{"fruit":"pera"}],"bar":{"foo":"[\"daniel\",\"pedro\",\"thing\"]"},"daniel":"daniel data random","cars":["montana","bagulho"]}"""
case class Foo(id: Integer, json: String)
val ds = Seq(new Foo(1,jsonStr1), new Foo(2,jsonStr2), new Foo(3,jsonStr3)).toDS
val jsonDF = spark.read.json(ds.select($"json").rdd.map(r => r.getAs[String](0)).toDS)
jsonDF.show()
jsonDF.show
+--------------------+------------------+------------------+--------------------+
| bar| cars| daniel| fruits|
+--------------------+------------------+------------------+--------------------+
|[["daniel","pedro...|[montana, bagulho]|daniel data random|[[,,, banana], [,...|
|[["daniel","pedro...|[montana, bagulho]|daniel data random|[[, banana,,], [,...|
|[["daniel","pedro...|[montana, bagulho]|daniel data random|[[banana,,,], [,,...|
+--------------------+------------------+------------------+--------------------+
I would like to add the column id
from the Hive table, like this:
+--------------------+------------------+------------------+--------------------+---------------
| bar| cars| daniel| fruits| id
+--------------------+------------------+------------------+--------------------+--------------
|[["daniel","pedro...|[montana, bagulho]|daniel data random|[[,,, banana], [,...|1
|[["daniel","pedro...|[montana, bagulho]|daniel data random|[[, banana,,], [,...|2
|[["daniel","pedro...|[montana, bagulho]|daniel data random|[[banana,,,], [,,...|3
+--------------------+------------------+------------------+--------------------+
I will not use regular expressions
I created a udf which take this two fields as argument and using a proper JSON library include the desired field(id)
and return a new JSON string. It works like a charm but I hope Spark API offers a better way to do it. I'm using Apache Spark 2.3.0.