hadoop converting \r\n to \n and breaking ARC form

I am trying to parse data from commoncrawl.org using hadoop streaming. I set up a local hadoop to test my code, and have a simple Ruby mapper which uses a streaming ARCfile reader. When I invoke my code myself like

cat 1262876244253_18.arc.gz | mapper.rb | reducer.rb

It works as expected.

It seems that hadoop automatically sees that the file has a .gz extension and decompresses it before handing it to a mapper - however while doing so it converts \r\n linebreaks in the stream to \n. Since ARC relies on a record length in the header line, the change breaks the parser (because the data length has changed).

To double check, I changed my mapper to expect uncompressed data, and did:

cat 1262876244253_18.arc.gz | zcat | mapper.rb | reducer.rb

And it works.

I don't mind hadoop automatically decompressing (although I can quite happily deal with streaming .gz files), but if it does I need it to decompress in 'binary' without doing any linebreak conversion or similar. I believe that the default behaviour is to feed decompressed files to one mapper per file, which is perfect.

How can I either ask it not to decompress .gz (renaming the files is not an option) or make it decompress properly? I would prefer not to use a special InputFormat class which I have to ship in a jar, if at all possible.

All of this will eventually run on AWS ElasticMapReduce.

标签： hadoop mapreduce elastic-map-reduce

1条回答

再贱就再见

2楼-- · 2019-02-24 07:16

Looks like the Hadoop PipeMapper.java is to blame (at least in 0.20.2):

PipeMapper.java (0.20.2)

Around line 106, the input from TextInputFormat is passed to this mapper (at which stage the \r\n has been stripped), and the PipeMapper is writing it out to stdout with just a \n.

A suggestion would be to amend the source for your PipeMapper.java, check this 'feature' still exists, and amend as required (maybe allow it to be set via a configuration property).

0人赞添加讨论(0) 举报

hadoop converting \r\n to \n and breaking ARC form

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间