I have an original input json snippet ('/home/user/testsample.json') -
{"key": "somehashvalue","columns": [["Event:2014-03-26 00\\:29\\:13+0200:json","{\"user\":{\"credType\":\"ADDRESS\",\"credValue\":\"01:AA:A4:G1:HH:UU\",\"cAgent\":null,\"cType\":\"ACE\"},\"timestamp\":1395786553,\"sessionId\":1395785353,\"className\":\"Event\",\"subtype\":\"CURRENTLYACTIVE\",\"vType\":\"TEST\",\"vId\":1235080,\"eType\":\"CURRENTLYACTIVE\",\"eData\":\"1\"}",1395786553381001],["Event:2014-03-26 00\\:29\\:13+0200:","",1395786553381001]]}
I tried to use Json serde s to parse the above json to my hive columns. However, 1395786553381001 above is not present in a format which SerDe can map to a Hive column i.e this value is present without a Key (since Hive understands Json columns/values present after :)
So instead I took the Array type approach and created a table -
CREATE TABLE mytesttable (
key string,
columns array < array< string > >
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';
LOAD DATA LOCAL INPATH '/home/user/testsample.json'
OVERWRITE INTO TABLE mytesttable;
select columns[0][1] from mytesttable; gives -
{"user":{"credType":"ADDRESS","credValue":"01:AA:A4:G1:HH:UU","cAgent":null,"cType":"ACE"},"timestamp":1395786553,"sessionId":1395785353,"className":"Event","subtype":"CURRENTLYACTIVE","vType":"TEST","vId":1235080,"eType":"CURRENTLYACTIVE","eData":"1"}
The above appears clean, but then I also need columns[*][2] i.e in a Json hive column for further transformations.
I wrote a regex hive query to cleanse the original Json present in '/home/user/testsample.json'
(assume it is present in a table tablewithinputjson)
SELECT
REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(ij.columna, '["][{]', '{'),'[}]["]', '}'), '\\\\', '') AS columna
FROM tablewithinputjson ij;
The above query returns -
{"key": "somehashvalue","columns": [["Event:2014-03-26 00:29:13+0200:json",{"user":{"credType":"ADDRESS","credValue":"01:AA:A4:G1:HH:UU","cAgent":null,"cType":"ACE"},"timestamp":1395786553,"sessionId":1395785353,"className":"Event","subtype":"CURRENTLYACTIVE","vType":"TEST","vId":1235080,"eType":"CURRENTLYACTIVE","eData":"1"},1395786553381001],["Event:2014-03-26 00:29:13+0200:","",1395786553381001]]}
But here again, 1395786553381001 cannot be mapped to a hive column since it appears after , and not after : or more specifically this value is present without a key. (I could add "test": before 1395786553381001 , but I do not want to customize the input data - since a) Too much customization is something I am not comfortable with b) does not seem to be a good solution c) it would unnecessary waste my hadoop cluster space and time)
Not to confuse any further, I am not able to come up with a Hive table format that fully parses and maps all the fields in the original Json snippet. Any suggestions are welcome. Please let me know If it seems too confusing.