I understand that the majority of JSON SerDe formats expect .json
files to be stored with one record per line.
I have an S3 bucket with multi-line indented .json
files (don't control the source) that I'd like to query using Amazon Athena (though I suppose this applies just as well to Hive generally).
- Is there a SerDe format out there that is able to parse multi-line indented
.json
files? - If there isn't a SerDe format to do this:
- Is there a best practice for dealing with files like this?
- Should I plan on flattening these records out using a different tool like python?
- Is there a standard way of writing custom SerDe formats, so I can write one myself?
- Is there a best practice for dealing with files like this?
Example file body:
[
{
"id": 1,
"name": "ryan",
"stuff: {
"x": true,
"y": [
123,
456
]
},
},
...
]