Hadoop Streaming Job with binary input?

2019-08-01 22:21发布

I wish to convert a binary file in one format to a SequenceFile.

I have a Python script that takes that format on stdin and can output whatever I want.

The input format is not line-based. The individual records are binary themselves, hence the output format cannot be \t delimited or broken into lines with \n.

Can I use the Hadoop Streaming interface to consume a binary format? How do I produce a binary output format?

I assume the answer is "No" unless I hear otherwise.

标签： python hadoop hadoop-streaming

1条回答

一纸荒年 Trace。

2楼-- · 2019-08-01 23:16

You may consider using NullWritable as output, and generating the SequenceFile directly inside of your python script. You can look up the hadoop-python project in github to see candidate code: though it is admittedly bit large-ish/heavy it does handle the sequencefile generation.

0人赞添加讨论(0) 举报

Hadoop Streaming Job with binary input?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间