I am creating a Hive table on Amazon's Elastic MapReduce by using a gzipped JSON encoded file. I am using this JSON SerDe: http://code.google.com/p/hive-json-serde/
The unencoded file looks like this:
{"id":"101", "items":"A:231,234,119,12"}
{"id":"102", "items":"B:13,89,121"}
...
I'd like to create an array of the "items" column for user with a table generating function. The array I want would be the "exploded" CSV of ints ignoring the ":" and the letter before it. I want to be able to GROUP BY the item ints (231,234,etc.) or JOIN them against other tables. Changing around the format of this file would be difficult and all my old data is stored in this format so I'd like to solve this in Hive without making changes to how to store my data.
Is there a way for me to do this?
Edit:
Thanks @mark-grover, this worked!
I did something like this:
SELECT id, item FROM my_table LATERAL VIEW explode(split(substr(items, 3, length(items) - 2), ',')) t1 AS item
to get:
101 231
101 234
101 119
101 12
102 13
102 89
102 121
In doing this I found out I'd like to do a little more. I'd like to also select the position that each item was in in the CSV. So I'd like the output to look like:
101 231 0
101 234 1
101 119 2
101 12 3
102 13 1
102 89 2
102 121 3
Any ideas here?
Edit again:
I came up with something for my followup question doing something like this:
SELECT id, item, find_in_set(item, substr(items, 3, length(items) - 2)) AS position
FROM my_table LATERAL VIEW explode(split(substr(items, 3, length(items) - 2), ',')) t1 AS item