We have particular algorithm that we want to integrate with HDFS. The algorithm requires us to access data locally (the work would be done exclusively in the Mapper
). However, we do want to take advantage of HDFS in terms of distributing the file (providing reliability and striping). After the calculation is performed, we'd use the Reducer
to simply send back the answer, rather than perform any additional work. Avoiding network use is an explicit goal. Is there a configuration setting that would allow us to restrict network data access, so that when a MapReduce job is started it will only access it's local DataNode?
UPDATE: Adding a bit of context
We're attempting to analyze this problem with string matching. Assume our cluster has N nodes and a file is stored with N GB of text. The file is stored into HDFS and distributed in even parts to the nodes (1 part per node). Can we create a MapReduce job that launches one process on each node to access the part of the file that's sitting on the same host? Or, would the MapReduce framework unevenly distribute the work? (e.g. 1 job accessing all N part of the data, or .5N nodes attempting to process the whole file?
If you set the number of reduce tasks to zero you can skip the shuffling and therefore the network cost of your algorithm.
While creating your job this can be done with the following line of code
job.setNumReduceTasks(0);
I don't know what you algorithm will do but say it is a pattern matching algorithm looking for the occurrence of a particular word, then the mappers would report the number of matches per split. If you want to add the counts you need network communication and a reducer.
First google match on a map-only example I found:
Map-Only MR jobs
Setting the reducers to zero would increase the data locality. This means the intermediate data that have been generated by the Mappers will be stored on HDFS. Of course, you will not have any control of choosing the which nodes will store the intermediate data and if its size is greater than the number of the mapper slots * block size, then the remote access would be attempt to avoid starvation. My advice is to use delay scheduler and set locality-delay-node-ms and locality-delay-rack-ms to a large value (i.e. the maximum expected running time for your mappers). This will make the delay scheduler waits as much as possible before requesting data remotely. However, this may lead to resource under-utilization and increasing the running time (e.g. any node that does not store any data block will be idle for a long time locality-delay-node-ms + locality-delay-rack-ms).