how to limit size of Hadoop Sequence file?

2019-06-14 11:04发布

站内文章 / 后端开发

41 0

姐就是有狂的资本

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I am writing Hadoop seq file using txt as input. I know how to write Sequence file from text file.

But i want to limit the output sequence file to some specific size say, 256MB.

Is there any inbuilt method to do this?

回答1:

AFIAK you'll need to write your own custom output format to limit output file sizes - by default FileOutputFormats create a single output file per reducer.

Another option is to create your sequence files as normal, then then a second job (map only), with identity mappers and then amend the minimum / maximum input split size to ensure that each mapper only processes ¬256MB each. This will mean a input file og 1GB would be processed by 4 mappers and create output files of ¬256MB. You will get smaller files where an input file is say 300MB (256MB mapper and a 44MB mapper will run).

The properties you are looking for are: