Hive for complex nested Json

2020-07-22 10:40发布

问题:

I have an original input json snippet ('/home/user/testsample.json') -

{"key": "somehashvalue","columns": [["Event:2014-03-26 00\\:29\\:13+0200:json","{\"user\":{\"credType\":\"ADDRESS\",\"credValue\":\"01:AA:A4:G1:HH:UU\",\"cAgent\":null,\"cType\":\"ACE\"},\"timestamp\":1395786553,\"sessionId\":1395785353,\"className\":\"Event\",\"subtype\":\"CURRENTLYACTIVE\",\"vType\":\"TEST\",\"vId\":1235080,\"eType\":\"CURRENTLYACTIVE\",\"eData\":\"1\"}",1395786553381001],["Event:2014-03-26 00\\:29\\:13+0200:","",1395786553381001]]}

I tried to use Json serde s to parse the above json to my hive columns. However, 1395786553381001 above is not present in a format which SerDe can map to a Hive column i.e this value is present without a Key (since Hive understands Json columns/values present after :)

So instead I took the Array type approach and created a table -

CREATE TABLE mytesttable (
  key string, 
  columns array < array< string > >
  )
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';

LOAD DATA LOCAL INPATH '/home/user/testsample.json'
OVERWRITE INTO TABLE mytesttable;

select columns[0][1] from mytesttable; gives -

{"user":{"credType":"ADDRESS","credValue":"01:AA:A4:G1:HH:UU","cAgent":null,"cType":"ACE"},"timestamp":1395786553,"sessionId":1395785353,"className":"Event","subtype":"CURRENTLYACTIVE","vType":"TEST","vId":1235080,"eType":"CURRENTLYACTIVE","eData":"1"}

The above appears clean, but then I also need columns[*][2] i.e in a Json hive column for further transformations.

I wrote a regex hive query to cleanse the original Json present in '/home/user/testsample.json' (assume it is present in a table tablewithinputjson)

SELECT
REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(ij.columna, '["][{]', '{'),'[}]["]', '}'), '\\\\', '') AS columna
FROM tablewithinputjson ij;

The above query returns -

{"key": "somehashvalue","columns": [["Event:2014-03-26 00:29:13+0200:json",{"user":{"credType":"ADDRESS","credValue":"01:AA:A4:G1:HH:UU","cAgent":null,"cType":"ACE"},"timestamp":1395786553,"sessionId":1395785353,"className":"Event","subtype":"CURRENTLYACTIVE","vType":"TEST","vId":1235080,"eType":"CURRENTLYACTIVE","eData":"1"},1395786553381001],["Event:2014-03-26 00:29:13+0200:","",1395786553381001]]}

But here again, 1395786553381001 cannot be mapped to a hive column since it appears after , and not after : or more specifically this value is present without a key. (I could add "test": before 1395786553381001 , but I do not want to customize the input data - since a) Too much customization is something I am not comfortable with b) does not seem to be a good solution c) it would unnecessary waste my hadoop cluster space and time)

Not to confuse any further, I am not able to come up with a Hive table format that fully parses and maps all the fields in the original Json snippet. Any suggestions are welcome. Please let me know If it seems too confusing.

回答1:

posting End-to-End solution. Step by step procedure to convert JSON to hive table:

step 1) install maven if not there already

>$ sudo apt-get install maven

step 2) install git if not there already

>sudo git clone https://github.com/rcongiu/Hive-JSON-Serde.git

step 3) go into the $HOME/HIVE-JSON_Serde folder

step 4) build the serde package

>sudo mvn -Pcdh5 clean package

step 5) The serde file will be in $HOME/Hive-JSON-Serde/json-serde/target/json-serde-1.3.7-SNAPSHOT-jar-with-dependencies.jar

step 6) Add serde as dependency jar in hive

 hive> ADD JAR $HOME/Hive-JSON-Serde/json-serde/target/json-serde-1.3.7- SNAPSHOT-jar-with-dependencies.jar;

step 7) create json file in $HOME/books.json (Example)

{"value": [{"id": "1","bookname": "A","properties": {"subscription": "1year","unit": "3"}},{"id": "2","bookname":"B","properties":{"subscription": "2years","unit": "5"}}]}

step 8) create tmp1 table in hive

 hive>CREATE TABLE tmp1 (
      value ARRAY<struct<id:string,bookname:string,properties:struct<subscription:string,unit:string>>>   
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ( 
    'mapping.value' = 'value'   
) 
STORED AS TEXTFILE;

step 9) load the data from json to tmp1 table

>LOAD DATA LOCAL INPATH '$HOME/books.json' INTO TABLE tmp1;

step 10) create a tmp2 table to do explode operation form tmp1, this intermediate step is to break multi level json structure into multiple rows Note: if your JSON structure is simple and single level , avoid this step

hive>create table tmp2 as 
 SELECT *
 FROM tmp1
 LATERAL VIEW explode(value) itemTable AS items;

step 11) create hive table and load the values from tmp2 table

hive>create table books as 
select value[0].id as id, value[0].bookname as name, value[0].properties.subscription as subscription, value[0].properties.unit as unit from tmp2;

step 12) drop tmp tables

hive>drop table tmp1;
hive>drop table tmp2;

step 13) test the hive table

hive>select * from books;

output:

id name subscription unit

1 B 1year 3

2 B 2years 5



标签: json hadoop hive