I have 10M+ photos saved on the local file system. Now I want to go through each of them to analyze the binary of the photo to see if it's a dog. I basically want to do the analysis on a clustered hadoop environment. The problem is, how should I design the input for the map method? let's say, in the map method,
new FaceDetection(photoInputStream).isDog()
is all the underlying logic for the analysis.
Specifically,
Should I upload all of the photos to HDFS
? Assume yes,
how can I use them in the
map
method?Is it ok to make the input(to the
map
) as a text file containing all of the photo path(inHDFS
) with each a line, and in the map method, load the binary like:photoInputStream = getImageFromHDFS(photopath);
(Actually, what is the right method to load file from HDFS during the execution of the map method?)
It seems I miss some knowledges about the basic principle for hadoop
, map/reduce
and hdfs
, but can you please point me out in terms of the above question, Thanks!
I was on a project a while back (2008?) where we did something very similar with Hadoop. I believe we initially used HDFS to store the pics, then we created a text file that listed the files to process. The concept is that you're using map/reduce to break the text file into pieces and spreading that out across the cloud, letting each node process some of the files based on the portion of the list that they receive. Sorry I don't remember more explicit details, but this was the general approach.
Assuming it takes a sec to put each file into the sequence file. It will take ~115 days for the conversion of individual files into a sequence file. With parallel processing also on a single machine, I don't see much improvement because disk read/write will be a bottle neck with reading the photo files and writing the sequence file. Check this Cloudera article on small files problem. There is also a reference to a script which converts a tar file into a sequence file and how much time it took for the conversion.
Basically the photos have to be processed in a distributed way for converting them into sequence. Back to Hadoop :)
According to the Hadoop - The Definitive Guide
So, directly loading 10M of files will require around 3,000 MB of memory for just storing the namespace on the NameNode. Forget about streaming the photos across nodes during the execution of the job.
There should be a better way of solving this problem.
Another approach is to load the files as-is into HDFS and use CombineFileInputFormat which combines the small files into a input split and considers data locality while calculating the input splits. Advantage of this approach is that the files can be loaded into HDFS as-is without any conversion and there is also not much data shuffling across nodes.
The major problem is that each file is going to be in one file. So if you have 10M files, you'll have 10M mappers, which doesn't sound terribly reasonable. You may want to considering pre-serializing the files into
SequenceFiles
(one image per key-value pair). This will make loading the data into the MapReduce job native, so you don't have to write any tricky code. Also, you'll be able to store all of your data into one SequenceFile, if you so desire. Hadoop handles splitting SequenceFiles quite well.Basically, the way this works is, you will have a separate Java process that takes several image files, reads the ray bytes into memory, then stores the data into a key-value pair in a SequenceFile. Keep going and keep writing into HDFS. This may take a while, but you'll only have to do it once.
This is not ok if you have any sort of reasonable cluster (which you should if you are considering Hadoop for this) and you actually want to be using the power of Hadoop. Your MapReduce job will fire off, and load the files, but the mappers will be running data-local to the text files, not the images! So, basically, you are going to be shuffling the image files everywhere since the JobTracker is not placing tasks where the files are. This will incur a significant amount of network overhead. If you have 1TB of images, you can expect that a lot of them will be streamed over the network if you have more than a few nodes. This may not be so bad depending on your situation and cluster size (less than a handful of nodes).
If you do want to do this, you can use the
FileSystem
API to create files (you want theopen
method).