how to limit size of Hadoop Sequence file?

2019-06-14 11:04发布

问题:

I am writing Hadoop seq file using txt as input. I know how to write Sequence file from text file.

But i want to limit the output sequence file to some specific size say, 256MB.

Is there any inbuilt method to do this?

回答1:

AFIAK you'll need to write your own custom output format to limit output file sizes - by default FileOutputFormats create a single output file per reducer.

Another option is to create your sequence files as normal, then then a second job (map only), with identity mappers and then amend the minimum / maximum input split size to ensure that each mapper only processes ¬256MB each. This will mean a input file og 1GB would be processed by 4 mappers and create output files of ¬256MB. You will get smaller files where an input file is say 300MB (256MB mapper and a 44MB mapper will run).

The properties you are looking for are:

  • mapred.min.split.size
  • mapred.max.split.size

They are both configured as byte sizes, so set them both to 268435456