tl;dr:
To be able to use wildcards (globs) in the listed paths, one simply has to use globStatus(...)
instead of listStatus(...)
.
Context
Files on my HDFS cluster are organized in partitions, with the date being the "root" partition. A simplified example of the files structure would look like this:
/schemas_folder
├── date=20140101
│ ├── A-schema.avsc
│ ├── B-schema.avsc
├── date=20140102
│ ├── A-schema.avsc
│ ├── B-schema.avsc
│ ├── C-schema.avsc
└── date=20140103
├── B-schema.avsc
└── C-schema.avsc
In my case, the directory stores Avro schemas for different types of data (A, B and C in this example) at different dates. The schema might start existing, evolve and stop existing... as time passes.
Goal
I need to be able to get all the schemas that exist for a given type, as quickly as possible. In the example where I would like to get all the schemas that exist for type A, I would like to do the following:
hdfs dfs -ls /schemas_folder/date=*/A-schema.avsc
That would give me
Found 1 items
-rw-r--r-- 3 user group 1234 2014-01-01 12:34 /schemas_folder/date=20140101/A-schema.avsc
Found 1 items
-rw-r--r-- 3 user group 2345 2014-01-02 23:45 /schemas_folder/date=20140102/A-schema.avsc
Problem
I don't want to be using the shell command, and cannot seem to find the equivalent to that command above in the Java APIs. When I try to implement the looping myself, I get terrible performance. I want at least the performance of the command line (around 3 seconds in my case)...
What I found so far
One can notice that it prints twice Found 1 items
, once before each result. It does not print Found 2 items
once at the beginning. That probably hints that wildcards are not implemented on the FileSystem
side but somehow handled by the client. I can't seem to find the right source code to look at to see how that was implemented.
Below are my first shots, probably a bit too naïve...
Using listFiles(...)
Code:
RemoteIterator<LocatedFileStatus> files = filesystem.listFiles(new Path("/schemas_folder"), true);
Pattern pattern = Pattern.compile("^.*/date=[0-9]{8}/A-schema\\.avsc$");
while (files.hasNext()) {
Path path = files.next().getPath();
if (pattern.matcher(path.toString()).matches())
{
System.out.println(path);
}
}
Result:
This prints exactly what I would expect, but since it first lists everything recursively and then filters, the performance is really poor. With my current dataset, it takes almost 25 seconds...
Using listStatus(...)
Code:
FileStatus[] statuses = filesystem.listStatus(new Path("/schemas_folder"), new PathFilter()
{
private final Pattern pattern = Pattern.compile("^date=[0-9]{8}$");
@Override
public boolean accept(Path path)
{
return pattern.matcher(path.getName()).matches();
}
});
Path[] paths = new Path[statuses.length];
for (int i = 0; i < statuses.length; i++) { paths[i] = statuses[i].getPath(); }
statuses = filesystem.listStatus(paths, new PathFilter()
{
@Override
public boolean accept(Path path)
{
return "A-schema.avsc".equals(path.getName());
}
});
for (FileStatus status : statuses)
{
System.out.println(status.getPath());
}
Result:
Thanks to the PathFilter
s and the use of arrays, it seems to perform faster (around 12 seconds). The code is more complex, though, and more difficult to adapt to different situations. Most importantly, the performance is still 3 to 4 times slower than the command-line version!
Question
What am I missing here? What is the fastest way to get the results I want?
Updates
2014.07.09 - 13:38
The proposed answer of Mukesh S is apparently the best possible API approach.
In the example I gave above, the code end-up looking like this:
FileStatus[] statuses = filesystem.globStatus(new Path("/schemas_folder/date=*/A-schema.avsc"));
for (FileStatus status : statuses)
{
System.out.println(status.getPath());
}
This is the best looking and best performing code I could come up with so far, but is still not performing as well as the shell version.