Multi-line JSON file querying in hive

2019-08-04 06:07发布

I understand that the majority of JSON SerDe formats expect .json files to be stored with one record per line.

I have an S3 bucket with multi-line indented .json files (don't control the source) that I'd like to query using Amazon Athena (though I suppose this applies just as well to Hive generally).

Is there a SerDe format out there that is able to parse multi-line indented .json files?
If there isn't a SerDe format to do this:
- Is there a best practice for dealing with files like this?
  - Should I plan on flattening these records out using a different tool like python?
- Is there a standard way of writing custom SerDe formats, so I can write one myself?

Example file body:

[
  {
    "id": 1,
    "name": "ryan",
    "stuff: {
      "x": true,
      "y": [
        123,
        456
      ]
    },
  },
  ...
]

标签： json hive amazon-athena

1条回答

Animai°情兽

2楼-- · 2019-08-04 06:53

There is unfortunately no serde that supports multiline JSON content. There is the specialized CloudTrail serde that supports a format similar to yours, but it's hard-coded only for the CloudTrail JSON format – but at least it shows that it's at least theoretically possible. Currently there is no way to write your own serdes to use with Athena, though.

You won't be able to consume these files with Athena, you will have to use EMR, Glue, or some other tool to reformat them into JSON stream files first.

0人赞添加讨论(0) 举报

Multi-line JSON file querying in hive

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间