How can I parse a Json column of a Hive table usin

2019-09-19 12:37发布

问题:

I am trying to load de-serialized json events into different tables, based on the name of the event.

Right now I have all the events in the same table, the table has only two columns EventName and Payload (the payload stores the json representation of the event):

CREATE TABLE event( EventName STRING, Payload STRING)

So basically what i want is to load the data in the following table:

CREATE TABLE TempEvent ( Column1 STRING, Column2 STRING, Column3 STRING )
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS TEXTFILE;

And load the events with something like:

INSERT INTO TempEvent select Payload from event where EventName='TempEvent';

But hive is throwing an exception saying that the destination table has 3 columns, and the select statement just 1.

Is there other way to acomplish this or i am doing something wrong?.

回答1:

The JSON serde requires a table with one JSON per line in order to use it. So it won't work with your input table because the line

TempEvent, {"Column1":"value1","Column2":"value2","Column3":"value3"}

is not a valid JSON. So first you need to move it into a new intermediate table which just contains valid JSON, and then populate the JSON serde table from there using load data:

create table event_json (Payload string)
stored as textfile;

insert into table event_json 
select Payload from event
where EventName='TempEvent';

create table TempEvent (Column1 string, Column2 string, Column3 string)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';

load data inpath '/user/hive/warehouse/event_json' overwrite into table TempEvent;

Then you can extract the columns like this:

select Column1, Column2, Column3
from TempEvent;

Of course all of this processing would not be necessary if your source table was valid JSON originally, you could just create the TempEvent table as as an external table and pull data directly from it.



标签: json hadoop hive