I am writing Hadoop seq file using txt as input. I know how to write Sequence file from text file.
But i want to limit the output sequence file to some specific size say, 256MB.
Is there any inbuilt method to do this?
I am writing Hadoop seq file using txt as input. I know how to write Sequence file from text file.
But i want to limit the output sequence file to some specific size say, 256MB.
Is there any inbuilt method to do this?
AFIAK you'll need to write your own custom output format to limit output file sizes - by default FileOutputFormats create a single output file per reducer.
Another option is to create your sequence files as normal, then then a second job (map only), with identity mappers and then amend the minimum / maximum input split size to ensure that each mapper only processes ¬256MB each. This will mean a input file og 1GB would be processed by 4 mappers and create output files of ¬256MB. You will get smaller files where an input file is say 300MB (256MB mapper and a 44MB mapper will run).
The properties you are looking for are:
mapred.min.split.size
mapred.max.split.size
They are both configured as byte sizes, so set them both to 268435456