Loading JSON file with serde in Cloudera

2019-03-06 09:27发布

问题:

I am trying to work with a JSON file with this bag structure :

{
   "user_id": "kim95",
   "type": "Book",
   "title": "Modern Database Systems: The Object Model, Interoperability, and Beyond.",
   "year": "1995",
   "publisher": "ACM Press and Addison-Wesley",
   "authors": [
      {
         "name": "null"
      }
   ],
   "source": "DBLP"
}
{
   "user_id": "marshallo79",
   "type": "Book",
   "title": "Inequalities: Theory of Majorization and Its Application.",
   "year": "1979",
   "publisher": "Academic Press",
   "authors": [
      {
         "name": "Albert W. Marshall" 
      },
      {
         "name": "Ingram Olkin"
      }
   ],
   "source": "DBLP"
}

I tried to use serde to load JSON data for Hive. I followed both ways that I saw here : http://blog.cloudera.com/blog/2012/12/how-to-use-a-serde-in-apache-hive/

With this code :

CREATE EXTERNAL TABLE IF NOT EXISTS serd (
           user_id:string, 
           type:string, 
           title:string,
           year:string,
           publisher:string,
           authors:array<struct<name:string>>,
           source:string)       
    ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
    LOCATION '/user/hdfs/data/book-seded_workings-reduced.json';

I got this error:

error while compiling statement: failed: parseexception line 2:17 cannot recognize input near ':' 'string' ',' in column type

I alson tried this version : https://github.com/rcongiu/Hive-JSON-Serde

which gave a different error :

Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Cannot validate serde: org.openx.data.jsonserde.JsonSerde

Any idea?

I also want to know what are alternatives to work with a JSON like this to make queries on 'name' field in 'authors'. Whether it's Pig or Hive?

I have already converted it in to a "tsv" file. But, since my authors column is a tuple, I don't know how make requests on 'name' with Hive, If I build a table from this file. Should I change my script for "tsv" conversion or keep it? Or are there any alternatives with Hive or Pig?

回答1:

Hive does not have built in support for JSON. So for using JSON with Hive we need to use third part jars like: https://github.com/rcongiu/Hive-JSON-Serde

You have couple of issues with the create table statement. It should look like this:

CREATE EXTERNAL TABLE IF NOT EXISTS serd ( 
user_id string,type string,title string,year string,publisher string,authors array<string>,source:string)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION...

The JSON records your are using keep each record in a single line like this:

{"user_id": "kim95", "type": "Book", "title": "Modern Database Systems: The Object Model, Interoperability, and Beyond.", "year": "1995", "publisher": "ACM Press and Addison-Wesley", "authors": [{"name":"null"}], "source": "DBLP"} 
{"user_id": "marshallo79", "type": "Book", "title": "Inequalities: Theory of Majorization and Its Application.", "year": "1979", "publisher": "Academic Press","authors": [{"name":"Albert W. Marshall"},{"name":"Ingram Olkin"}], "source": "DBLP"}

After downloading the project from GIT you need to compile the the project which will create a jar you need to add this jar in the Hive session before running the create table statement.

Hope it helps...!!!



回答2:

add jar only add to session which won't be available and finally it is getting error. Get the JAR loaded on all the nodes at Hive and Map Reduce path like the below location so that HIVE and Map Reduce component will pick this whenever it’s been called.

  1. /hadoop/CDH_5.2.0_Linux_parcel/parcels/CDH-5.2.0- 1.cdh5.2.0.p0.36/lib/hive/lib/json-serde-1.3.6-jar-with-dependencies.jar

  2. /hadoop/CDH_5.2.0_Linux_parcel/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hadoop-mapreduce/lib/json-serde-1.3.6-jar-with-dependencies.jar

Note: This path varies to cluster.