I want to create a Hive table out of some JSON data (nested) and run queries on it? Is this even possible?
I've gotten as far as uploading the JSON file to S3 and launching an EMR instance but I don't know what to type in the hive console to get the JSON file to be a Hive table?
Does anyone have some example command to get me started, I can't find anything useful with Google ...
You'll need to use a JSON serde in order for Hive to map your JSON to the columns in your table.
A really good example showing you how is here:
http://aws.amazon.com/articles/2855
Unfortunately the JSON serde supplied doesn't handle nested JSON very well so you might need to flatten your JSON in order to use it.
Here's an example of the correct syntax from the article:
It's actually not necessary to use the JSON SerDe. There is a great blog post here (I'm not affiliated with the author in any way):
http://pkghosh.wordpress.com/2012/05/06/hive-plays-well-with-json/
Which outlines a strategy using the builtin-function json_tuple to parse the json at time of query (NOT at the time of table definition):
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-json_tuple
So basically, your table schema is simply to load each line as a single 'string' column and then extract the relevant json fields as needed on a per query basis. e.g. this query from that blog post:
In my humble experience, this has proven more reliable (I encountered various cryptic issues dealing with the JSON serdes, especially with nested objects).
Hive 0.12 and later in hcatalog-core has JsonSerDe which will serialize and deserialize your JSON data. So, all you need to do is create an external table like the following example:
The corresponsing json data file should look like the following example:
I just had to solve the same problem, and none of the as of yet linked to JSON SerDes seemed good enough. Amazon's might be good, but I can't find the source for it anywhere (does anyone have a link?).
HCatalog's built in JsonSerDe is working for me, even though I'm not actually using HCatalog anywhere else.
https://github.com/apache/hcatalog/blob/branch-0.5/core/src/main/java/org/apache/hcatalog/data/JsonSerDe.java
To use HCatalog's JsonSerDe, add the hcatalog-core .jar to Hive's auxpath and create your hive table:
I wrote a post here with more details
http://ottomata.org/tech/too-many-hive-json-serdes/
Generating SerDe schema from .json file
If your .json file is big, it might be tedious to write the the schema by hand. If so, you can use this handy tool to generate it automatically.
https://github.com/strelec/hive-serde-schema-gen