I have a folder in hdfs which has two subfolders each one has about 30 subfolders which,finally,each one contains xml files. I want to list all xml files giving only the main folder's path. Locally I can do this with apache commons-io's FileUtils.listFiles(). I have tried this
FileStatus[] status = fs.listStatus( new Path( args[ 0 ] ) );
but it only lists the two first subfolders and it doesn't go further. Is there any way to do this in hadoop?
Code snippet for both recursive and non-recursive approaches:
If you are using hadoop 2.* API there are more elegant solutions:
Now, one can use Spark to do the same and its way faster than other approaches (such as Hadoop MR). Here is the code snippet.
Quick Example : Suppose you have the following file structure:
Using the code above, you get:
If you want only the leaf (i.e. fileNames), use the following code in
else
block :This will give:
Thanks Radu Adrian Moldovan for the suggestion.
Here is an implementation using queue:
don't use recursive approach (heap issues) :) use a queue
that was easy, enjoy!