I have a folder in hdfs which has two subfolders each one has about 30 subfolders which,finally,each one contains xml files. I want to list all xml files giving only the main folder's path. Locally I can do this with apache commons-io's FileUtils.listFiles(). I have tried this
FileStatus[] status = fs.listStatus( new Path( args[ 0 ] ) );
but it only lists the two first subfolders and it doesn't go further. Is there any way to do this in hadoop?
Here is a code snippet, that counts number of files in a particular HDFS directory (I used this to determine how many reducers to use in a particular ETL code). You can easily modify this to suite your needs.
Have you tried this:
You'll need to use the FileSystem object and perform some logic on the resultant FileStatus objects to manually recurse into the subdirectories.
You can also apply a PathFilter to only return the xml files using the listStatus(Path, PathFilter) method
The hadoop FsShell class has examples of this for the hadoop fs -lsr command, which is a recursive ls - see the source, around line 590 (the recursive step is triggered on line 635)