Multi-line JSON file querying in hive

2019-08-04 06:54发布

问题:

I understand that the majority of JSON SerDe formats expect .json files to be stored with one record per line.

I have an S3 bucket with multi-line indented .json files (don't control the source) that I'd like to query using Amazon Athena (though I suppose this applies just as well to Hive generally).

  1. Is there a SerDe format out there that is able to parse multi-line indented .json files?
  2. If there isn't a SerDe format to do this:
    • Is there a best practice for dealing with files like this?
      • Should I plan on flattening these records out using a different tool like python?
    • Is there a standard way of writing custom SerDe formats, so I can write one myself?

Example file body:

[
  {
    "id": 1,
    "name": "ryan",
    "stuff: {
      "x": true,
      "y": [
        123,
        456
      ]
    },
  },
  ...
]

回答1:

There is unfortunately no serde that supports multiline JSON content. There is the specialized CloudTrail serde that supports a format similar to yours, but it's hard-coded only for the CloudTrail JSON format – but at least it shows that it's at least theoretically possible. Currently there is no way to write your own serdes to use with Athena, though.

You won't be able to consume these files with Athena, you will have to use EMR, Glue, or some other tool to reformat them into JSON stream files first.