Expected consumption of open file descriptors in H

2019-07-23 22:28发布

问题:

Given Hadoop 0.21.0, what assumptions does the framework make regarding the number of open file descriptors relative to each individual map and reduce operation? Specifically, what suboperations cause Hadoop to open a new file descriptor during job execution or spill to disk?

(This is deliberately ignoring use of MultipleOutputs, as it very clearly screws with the guarantees provided by the system.)

My rationale here is simple: I'd like to ensure each job I write for Hadoop guarantees a finite number of required file descriptors for each mapper or reducer. Hadoop cheerfully abstracts this away from the programmer, which would normally be A Good Thing, if not for the other shoe dropping during server management.

I'd originally asked this question on Server Fault from the cluster management side of things. Since I'm also responsible for programming, this question is equally pertinent here.

回答1:

Here's a post that offers some insight into the problem:

This happens because more small files are created when you use MultipleOutputs class. Say you have 50 mappers then assuming that you don't have skewed data, Test1 will always generate exactly 50 files but Test2 will generate somewhere between 50 to 1000 files (50Mappers x 20TotalPartitionsPossible) and this causes a performance hit in I/O. In my benchmark, 199 output files were generated for Test1 and 4569 output files were generated for Test2.

This implies that, for normal behavior, the number of mappers is exactly equivalent to the number of open file descriptors. MultipleOutputs obviously skews this number by the number of mappers multiplied by the number of available partitions. Reducers then proceed as normal, generating one file (and thus, one file descriptor) per reduce operation.

The problem then becomes: during a spill operation, most of these files are being held open by each mapper as output is cheerfully martialled by split. Hence the available file descriptors problem.

Thus, the currently-assumed, maximum file descriptor limit should be:

Map phase: number of mappers * total partitions possible

Reduce phase: number of reduce operations * total partitions possible

And that, as we say, is that.