I'm trying to import the following json in hive
[{"time":1521115600,"latitude":44.3959,"longitude":26.1025,"altitude":53,"pm1":21.70905,"pm25":16.5,"pm10":14.60085,"gas1":0,"gas2":0.12,"gas3":0,"gas4":0,"temperature":null,"pressure":0,"humidity":0,"noise":0},{"time":1521115659,"latitude":44.3959,"longitude":26.1025,"altitude":53,"pm1":24.34045,"pm25":18.5,"pm10":16.37065,"gas1":0,"gas2":0.08,"gas3":0,"gas4":0,"temperature":null,"pressure":0,"humidity":0,"noise":0},{"time":1521115720,"latitude":44.3959,"longitude":26.1025,"altitude":53,"pm1":23.6826,"pm25":18,"pm10":15.9282,"gas1":0,"gas2":0,"gas3":0,"gas4":0,"temperature":null,"pressure":0,"humidity":0,"noise":0},{"time":1521115779,"latitude":44.3959,"longitude":26.1025,"altitude":53,"pm1":25.65615,"pm25":19.5,"pm10":17.25555,"gas1":0,"gas2":0.04,"gas3":0,"gas4":0,"temperature":null,"pressure":0,"humidity":0,"noise":0}]
CREATE TABLE json_serde (
s array<struct<time: timestamp, latitude: string, longitude: string, pm1: string>>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'mapping.value' = 'value'
)
STORED AS TEXTFILE
location '/user/hduser';
the import works but if i try
Select * from json_serde;
it will return from every document that is on hadoop/user/hduser only the first element per file.
there is a good documentation on working with json array??
If I may suggest you another approach to just load the whole JSON
string into a column as String
datatype into an external table. The only restriction is to define LINES TERMINATED BY
properly. e.g. If you may have each json in one line, then you can create table as below:
e.g.
CREATE EXTERNAL TABLE json_data_table (
json_data String
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\u0001' LINES TERMINATED BY '\n' STORED AS TEXTFILE
LOCATION '/path/to/json';
Use Hive get_json_object
to extract individual columns. this commands support basic xPath
like query to json string E.g.
If json_data column has below JSON string
{"store":
{"fruit":\[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],
"bicycle":{"price":19.95,"color":"red"}
},
"email":"amy@only_for_json_udf_test.net",
"owner":"amy"
}
The below query fetches
SELECT get_json_object(json_data, '$.owner') FROM json_data_table;
returns amy
In this way you could extract each json
element as column from the table.
You have an array of structs. What you pasted is only one line.
If you want to see all the elements, you need to use inline
SELECT inline(s) FROM json_table;
Alternatively, you need to rewrite your files such that each object within that array is a single JSON object on its own line of the file
Also, I don't see a value field in your data, so I'm not sure what you're mapping in the serde properties
The JSON that you provided is not correct. A JSON always starts with an opening curly brace "{"
and ends with an ending curly brace "}"
.
So, the first thing to look out here is that your JSON is wrong.
Your JSON should have been like the one below:
{"key":[{"key1":"value1","key2":"value2"},{"key1":"value1","key2":"value2""},{"key1":"value1","key2":"value2"}]}
And, the second thing is you have declared the data-type of the "time" field as timestamp. But the data (1521115600) is in milliseconds. The timestamp data-type need data in the format YYYY-MM-DD HH:MM:SS[.fffffffff].
So, your data should ideally be in the below format:
{"myjson":[{"time":"1970-01-18
20:01:55","latitude":44.3959,"longitude":26.1025,"altitude":53,"pm1":21.70905,"pm25":16.5,"pm10":14.60085,"gas1":0,"gas2":0.12,"gas3":0,"gas4":0,"temperature":null,"pressure":0,"humidity":0,"noise":0},{"time":"1970-01-18
20:01:55","latitude":44.3959,"longitude":26.1025,"altitude":53,"pm1":24.34045,"pm25":18.5,"pm10":16.37065,"gas1":0,"gas2":0.08,"gas3":0,"gas4":0,"temperature":null,"pressure":0,"humidity":0,"noise":0},{"time":"1970-01-18
20:01:55","latitude":44.3959,"longitude":26.1025,"altitude":53,"pm1":23.6826,"pm25":18,"pm10":15.9282,"gas1":0,"gas2":0,"gas3":0,"gas4":0,"temperature":null,"pressure":0,"humidity":0,"noise":0},{"time":"1970-01-18
20:01:55","latitude":44.3959,"longitude":26.1025,"altitude":53,"pm1":25.65615,"pm25":19.5,"pm10":17.25555,"gas1":0,"gas2":0.04,"gas3":0,"gas4":0,"temperature":null,"pressure":0,"humidity":0,"noise":0}]}
Now, you can use query to select the records from the table.
hive> select * from json_serde;
OK
[{"time":"1970-01-18 20:01:55","latitude":"44.3959","longitude":"26.1025","pm1":"21.70905"},{"time":"1970-01-18 20:01:55","latitude":"44.3959","longitude":"26.1025","pm1":"24.34045"},{"time":"1970-01-18 20:01:55","latitude":"44.3959","longitude":"26.1025","pm1":"23.6826"},{"time":"1970-01-18 20:01:55","latitude":"44.3959","longitude":"26.1025","pm1":"25.65615"}]
Time taken: 0.069 seconds, Fetched: 1 row(s)
hive>
If you want each value separately displayed in tabular format, you can use the below query.
select b.* from json_serde a lateral view outer inline (a.myjson) b;
The result of the above query would be like this:
+------------------------+-------------+--------------+-----------+--+
| b.time | b.latitude | b.longitude | b.pm1 |
+------------------------+-------------+--------------+-----------+--+
| 1970-01-18 20:01:55.0 | 44.3959 | 26.1025 | 21.70905 |
| 1970-01-18 20:01:55.0 | 44.3959 | 26.1025 | 24.34045 |
| 1970-01-18 20:01:55.0 | 44.3959 | 26.1025 | 23.6826 |
| 1970-01-18 20:01:55.0 | 44.3959 | 26.1025 | 25.65615 |
+------------------------+-------------+--------------+-----------+--+
Beautiful. Is n't it?
Happy Learning.
If you can not use update your input file format you can directly import in spark and use it, once data is finalized write back to Hive table.
scala> val myjs = spark.read.format("json").option("path","file:///root/tmp/test5").load()
myjs: org.apache.spark.sql.DataFrame = [altitude: bigint, gas1: bigint ... 13 more fields]
scala> myjs.show()
+--------+----+----+----+----+--------+--------+---------+-----+--------+--------+----+--------+-----------+----------+
|altitude|gas1|gas2|gas3|gas4|humidity|latitude|longitude|noise| pm1| pm10|pm25|pressure|temperature| time|
+--------+----+----+----+----+--------+--------+---------+-----+--------+--------+----+--------+-----------+----------+
| 53| 0|0.12| 0| 0| 0| 44.3959| 26.1025| 0|21.70905|14.60085|16.5| 0| null|1521115600|
| 53| 0|0.08| 0| 0| 0| 44.3959| 26.1025| 0|24.34045|16.37065|18.5| 0| null|1521115659|
| 53| 0| 0.0| 0| 0| 0| 44.3959| 26.1025| 0| 23.6826| 15.9282|18.0| 0| null|1521115720|
| 53| 0|0.04| 0| 0| 0| 44.3959| 26.1025| 0|25.65615|17.25555|19.5| 0| null|1521115779|
+--------+----+----+----+----+--------+--------+---------+-----+--------+--------+----+--------+-----------+----------+
scala> myjs.write.json("file:///root/tmp/test_output")
Alternatively you can directly hive table
scala> myjs.createOrReplaceTempView("myjs")
scala> spark.sql("select * from myjs").show()
scala> spark.sql("create table tax.myjs_hive as select * from myjs")