Parse json arrays using HIVE

2019-01-14 08:23发布

问题:

I have many json arrays stored in a table (jt) that looks like this:

[{"ts":1403781896,"id":14,"log":"show"},{"ts":1403781896,"id":14,"log":"start"}]
[{"ts":1403781911,"id":14,"log":"press"},{"ts":1403781911,"id":14,"log":"press"}]

Each array is a record.

I would like to parse this table in order to get a new table (logs) with 3 fields: ts, id, log. I tried to use the get_json_object method, but it seems that method is not compatible with json arrays because I only get null values.

This is the code I have tested:

CREATE TABLE logs AS 
SELECT get_json_object(jt.value, '$.ts') AS ts, 
get_json_object(jt.value, '$.id') AS id,
get_json_object(jt.value, '$.log') AS log
FROM jt;

I tried to use other functions but they seem really complicated. Thank you! :)

Update! I solved my issue by performing a regexp:

CREATE TABLE jt_reg AS
select regexp_replace(regexp_replace(value,'\\}\\,\\{','\\}\\\n\\{'),'\\[|\\]','') as valuereg  from jt;


CREATE TABLE logs AS 
SELECT get_json_object(jt_reg.valuereg, '$.ts') AS ts, 
get_json_object(jt_reg.valuereg, '$.id') AS id,
get_json_object(jt_reg.valuereg, '$.log') AS log
FROM ams_json_reg;

回答1:

Use explode() function

 hive (default)> CREATE TABLE logs AS
                  >   SELECT get_json_object(single_json_table.single_json, '$.ts') AS ts,
                  >   get_json_object(single_json_table.single_json, '$.id') AS id,
                  >   get_json_object(single_json_table.single_json, '$.log') AS log
                  >   FROM
                  >     (SELECT explode(json_array_col) as single_json FROM jt) single_json_table ;

Automatically selecting local only mode for query
Total MapReduce jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator

hive (default)> select * from logs;
OK
ts      id      log
1403781896      14      show
1403781896      14      start
1403781911      14      press
1403781911      14      press
Time taken: 0.118 seconds, Fetched: 4 row(s)
hive (default)>

where json_array_col is column in jt which holds your array of jsons.

hive (default)> select json_array_col from jt;
json_array_col
["{"ts":1403781896,"id":14,"log":"show"}","{"ts":1403781896,"id":14,"log":"start"}"]
["{"ts":1403781911,"id":14,"log":"press"}","{"ts":1403781911,"id":14,"log":"press"}"]


回答2:

I just ran into this problem, with the JSON array stored as a string in the hive table.

The solution is a bit hacky and ugly, but it works and doesn't require serdes or external UDFs

SELECT 
       get_json_object(single_json_table.single_json, '$.ts') AS ts,
       get_json_object(single_json_table.single_json, '$.id') AS id,
       get_json_object(single_json_table.single_json, '$.log') AS log
FROM ( SELECT explode (
  split(regexp_replace(substr(json_array_col, 2, length(json_array_col)-2),
            '"}","', '"}",,,,"'), ',,,,')
      ) FROM src_table) single_json_table;

I broke the lines up so that it would be a little easier to read. I'm using substr() to strip the first and last characters, removing [ and ] . I'm then using regex_replace to match the separator between records in the json array and adding or changing the separator to be something unique that can then be used easily with split() to turn the string into a hive array of json objects which can then be used with explode() as described in the previous solution.

Note, the separator regex used here ( "}"," ) wouldn't work with the original data set...the regex would have to be ( "},\{" ) and the replacement would then need to be "},,,,{" eg..

  split(regexp_replace(substr(json_array_col, 2, length(json_array_col)-2),
            '"},\\{"', '"},,,,{"'), ',,,,')


回答3:

because get_json_object doesn't support json array string, so you can concat to a json object, like this:

SELECT 
    get_json_object(concat(concat('{"root":', jt.value), '}'), '$.root')
FROM jt;


标签: arrays json hive