which files ignored as input by mapper?

2019-01-25 19:42发布

问题:

I'm chaining multiple MapReduce jobs and want to pass along/store some meta information (e.g. configuration or name of original input) with the results. At least the file "_SUCCESS" and also anything in the directory "_logs" seams to be ignored.

Are there any filename patterns which are by default ignored by the InputReader? Or is this just a fixed limited list?

回答1:

The FileInputFormat uses the following hiddenFileFilter by default:

  private static final PathFilter hiddenFileFilter = new PathFilter(){
      public boolean accept(Path p){
        String name = p.getName(); 
        return !name.startsWith("_") && !name.startsWith("."); 
      }
    }; 

So if you uses any FileInputFormat (such as TextInputFormat, KeyValueTextInputFormat, SequenceFileInputFormat), the hidden files (the file name starts with "_" or ".") will be ignored.

You can use FileInputFormat.setInputPathFilter to set your custom PathFilter. Remember that the hiddenFileFilter is always active.