I want to insert JSON data from one table to other tables based on the key fields on the data.
my data looks like this
{"Rtype":{"ver":"1","os":"ms","type":"ns","vehicle":"Mh-3412","MOD":{"Version":[{"ABC":{"XYZ":"123.dfer","founder":"3.0","GHT":"Florida","fashion":"fg45","cdc":"new","dof":"yes","ts":"2000-04-01T00:00:00.171Z"}}]}}}
{"Rtype":{"ver":"1","os":"ms","type":"ns","vehicle":"Mh-3412","MOD":{"Version":[{"GAP":{"XVY":"123.dfer","FAH":"3.0","GHT":"Florida","fashion":"fg45","cdc":"new","dof":"yes","ts":"2000-04-01T00:00:00.171Z"}}]}}}
{"Rtype":{"ver":"1","os":"ms","type":"ns","vehicle":"Mh-3412","MOD":{"Version":[{"BOX":{"VOG":"123.dfer","FAH":"3.0","FAX":"Florida","fashion":"fg45","cdc":"new","dof":"yes","ts":"2000-04-01T00:00:00.171Z"}}]}}}
Here based on Version, whther it is "BOX" or "GAP" or "ABC" i want to populate the fields on particular JSON rows to another table
for ex: if version is "GAP" then populate particular rows in one table if it is "BOX" then populate to another tables... I mean all rows for BOX...
how can I achieve this using HIVE. Please help.
NOTE: My JSON data is in one table as a column with type string
Demo
create table src (myjson string);
insert into src values
('{"Rtype":{"ver":"1","os":"ms","type":"ns","vehicle":"Mh-3412","MOD":{"Version":[{"ABC":{"XYZ":"123.dfer","founder":"3.0","GHT":"Florida","fashion":"fg45","cdc":"new","dof":"yes","ts":"2000-04-01T00:00:00.171Z"}}]}}}')
,('{"Rtype":{"ver":"1","os":"ms","type":"ns","vehicle":"Mh-3412","MOD":{"Version":[{"GAP":{"XVY":"123.dfer","FAH":"3.0","GHT":"Florida","fashion":"fg45","cdc":"new","dof":"yes","ts":"2000-04-01T00:00:00.171Z"}}]}}}')
,('{"Rtype":{"ver":"1","os":"ms","type":"ns","vehicle":"Mh-3412","MOD":{"Version":[{"BOX":{"VOG":"123.dfer","FAH":"3.0","FAX":"Florida","fashion":"fg45","cdc":"new","dof":"yes","ts":"2000-04-01T00:00:00.171Z"}}]}}}')
;
create table trg_abc (myjson string);
create table trg_gap (myjson string);
create table trg_box (myjson string);
from src
insert into trg_abc select myjson where get_json_object(myjson,'$.Rtype.MOD.Version[0].ABC') is not null
insert into trg_gap select myjson where get_json_object(myjson,'$.Rtype.MOD.Version[0].GAP') is not null
insert into trg_box select myjson where get_json_object(myjson,'$.Rtype.MOD.Version[0].BOX') is not null
;
First of all you will need to store your data as json in hive tables:
I assume your hive table is EXTERNAL (usually are - check it with SHOW CREATE TABLE your_table
).
If so the entire dataset lies in some hdfs/s3 path, for example s3a://your_bucket/your_jsons_location/
Download json-udf-1.3.7-jar-with-dependencies.jar and run ADD JARS s3a://your_bucket/lib/json-udf-1.3.7-jar-with-dependencies.jar;
Then you must create a dedicated json table for each json schema you have:
CREATE EXTERNAL TABLE boxes
(Rtype struct<ver:string,os:string,type:string,vehicle:string,MOD:struct<Version:array<struct<BOX:struct<VOG:string,FAH:string,FAX:string,fashion:string,cdc:string,dof:string,ts:string>>>>>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' location 's3a://your_bucket/your_jsons_location/';
CREATE EXTERNAL TABLE gaps
(Rtype struct<ver:string,os:string,type:string,vehicle:string,MOD:struct<Version:array<struct<GAP:struct<XVY:string,FAH:string,GHT:string,fashion:string,cdc:string,dof:string,ts:string>>>>>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' location 's3a://your_bucket/your_jsons_location/';
CREATE EXTERNAL TABLE abcs
(Rtype struct<ver:string,os:string,type:string,vehicle:string,MOD:struct<Version:array<struct<ABC:struct<XYZ:string,founder:string,GHT:string,fashion:string,cdc:string,dof:string,ts:string>>>>>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' location 's3a://your_jsons_location/';
Now if you'll run:
SELECT * FROM boxes;
SELECT * FROM gaps;
SELECT * FROM abcs;
you will see that each table parsed correctly only the matching jsons (according to the schema that was specified in the create statment).
The unmatching ones in each table are NULL.
To filter out the unrelevant records run:
SELECT * FROM abcs WHERE Rtype.mod.version[0].abc IS NOT NULL;
Note: that this whole explanation assumes that your jsons are stored externally to the hive table (specifically I used S3 but it can also be HDFS)