We are collecting JSON data (public social media posts in particular) via REST API invocations, which we plan to dump into HDFS, then abstract a Hive table on top it using SerDe. I wonder though what would be the appropriate delimiter per JSON entry in a file? Is it new line ("\n")? So it would look like this:
{ id: entry1 ... post: }
{ id: entry2 ... post: }
...
{ id: entryn ... post: }
How about if we encounter a new line character within the JSON data itself, for example in post
?
The best way would be one record per line, separated by "\n" exactly as you guessed.
This also means that you should be careful to escape "\n" that may be inside the JSON elements.
Indented JSON won't work well with hadoop/hive, since to distribute processing, hadoop must be able to tell when a records ends, so it can split processing of a file with N bytes with W workers in W chunks of size roughly N/W.
The splitting is done by the particular InputFormat that's been used, in case of text, TextInputFormat.
TextInputFormat will basically split the file at the first instance of "\n" found after byte i*N/W (for i from 1 to W-1).
For this reason, having other "\n" around would confuse Hadoop and it will give you incomplete records.
As an alternative, I wouldn't recommend it, but if you really wanted you could use a character other than "\n" by configuring the property "textinputformat.record.delimiter" when reading the file through hadoop/hive, using a character that won't be in JSON (for instance, \001 or CTRL-A is commonly used by Hive as a field delimiter) but that can be tricky since it has to also be supported by the SerDe.
Also, if you change the record delimiter, anybody who copies/uses the file on HDFS must be aware of the delimiter, or they won't be able to parse it correctly, and will need special code to do it, while keeping "\n" as a delimiter, the files will still be normal text files and can be used by other tools.
As for the SerDe, I'd recommend this one, with the disclaimer that I wrote it :)
https://github.com/rcongiu/Hive-JSON-Serde