I have a table that contains JSON objects. Each JSON object contains an array in square brackets, separated by commas.
How can I access any element in the square bracket array, for example "Matt", using SQL?
{"str":
[
1,
134,
61,
"Matt",
{"action.type":"registered","application":491,"value":423,"value2":12344},
["application"],
[],
"49:0"
]
}
I am using 'Hive' ontop of Hadoop. If you know how to do this in SQL, that is fine :)
You can do this in Hive as follows:
First you need a JSON SerDe (Serializer / Deserializer). The most functional one I have seen is https://github.com/rcongiu/Hive-JSON-Serde/. The SerDe from Peter Sankauskas can't handle JSON this complex it seems. As of this writing you will need to compile the SerDe with Maven and place the JAR where your Hive session can reach it.
Next you are going to need to change your JSON format. The reason is Hive takes a strongly-typed view of arrays, so mixing integers and other things won't work. Consider switching to a struct like this:
{"str": {
n1 : 1,
n2 : 134,
n3 : 61,
s1: "Matt",
st1: {"type":"registered","app":491,"value":423,"value2":12344},
ar1: ["application"],
ar2: [],
s2: "49:0"
} }
Next you will need to put the JSON all one one line. I'm not sure if this is a quirk of Hive or the SerDe but you need to do it.
Then copy the data into HDFS.
Now you're ready to define a table and query away:
ADD JAR /path/to/jar/json-serde-1.1.2-jar-with-dependencies.jar;
CREATE EXTERNAL TABLE json (
str struct<
n1 : int, n2 : int, n3 : int,
s1 : string,
st1 : struct < type : string, app : int, value : int, value2 : int>,
ar1 : array<string>,
ar2 : array<string>,
s2 : string
>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION '/hdfs/path/to/file';
With this in place you can run interesting nested queries like:
select str.st1.type from json;
Last but not least since this is so specific to Hive it would be worthwhile to update the question and tags.
You can not unless you use something very specific to your database engine, and you haven't specified what database engine you are using.
The reason for "you can not" is that SQL/RDBMS are not designed for that type of storage. I recommend either using proper normalization or NoSQL solution depending on your needs.