Custom Binary Input - Hadoop

2019-09-08 06:50发布

I am developing a demo application in Hadoop and my input is .mrc image files. I want to load them to hadoop and do some image processing over them.

These are binary files that contain a large header with metadata followed by the data of a set of images. The information on how to read the images is also contained in the header (eg. number_of_images, number_of_pixels_x, number_of_pixels_y, bytes_per_pixel, so after the header bytes, the first [number_of_pixels_x*number_of_pixels_y*bytes_per_pixel] are the first image, then the second and so on].

What is a good Input format for these kinds of files? I thought two possible solutions:

Convert them to sequence files by placing the metadata in the sequence file header and have pairs for each image. In this case can I access the metadata from all mappers?
Write a custom InputFormat and RecordReader and create splits for each image while placing the metadata in distributed cache.

I am new in Hadoop, so I may be missing something. Which approach you think is better? is any other way that I am missing?

标签： java hadoop mapreduce

1条回答

女痞

2楼-- · 2019-09-08 07:25

Without knowing your file formats, the first option seems to be the better option. Using sequence files you can leverage a lot of SequenceFile related tools to get better performance. However, there are two things that do concern me with this approach.

How will you get your .mrc files into a .seq format?
You mentioned that the header is large, this may reduce the performance of SequenceFiles

But even with those concerns, I think that representing your data in SequenceFile's is the best option.

0人赞添加讨论(0) 举报

Custom Binary Input - Hadoop

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间