As getLocalCacheFiles() is deprecated, I'm trying to find an alternative. getCacheFiles() seems to be one, but I doubt whether they are the same.
When you call addCacheFile(), the file in HDFS would be downloaded to every node and using getLocalCacheFiles()
you can get the localized file path and you can read it from local file system. However, what getCacheFiles()
returns is the URI of the file in HDFS. If you read file by this URI, I doubt that you still read from HDFS instead of local file system.
The above is my understanding, I don't know whether it's correct. If so, what's the alternative for getLocalCacheFiles()
? And why Hadoop deprecate it in the first place?
It's open source. You can always find the git blame that introduced the @Deprectated
: commit 735b50e8bd23f7fbeff3a08cf8f3fff8cbff7449, which is for MAPREDUCE-4493. At the tail of the JIRA you'll find this discussion:
Omkar Vinit Joshi added a comment - 13/Jul/13 00:18
Robert Joseph Evans if we are deprecating getLocalCacheFiles and getCacheFiles in
jobContext() then how the user is going to get local cached files in
map task? YARN-916 is the related issue.. Thanks.
Robert Joseph Evans added a comment - 19/Jul/13 15:27
Omkar Vinit Joshi By opening the symbolic link in the current working directory. Prior to YARN the
default behavior was to not create symlinks in the current working
directory pointing to the items in the distributed cache. If you
wanted links you had to specifically turn that option on and provide
the name of the symlink you wanted. The only way to get to files
without symlinks was to call getLocalCacheFiles and getCacheFiles. In
YARN all files will have a symlink created. The name of the
file/directory will be the name of the symlink. However, it is
possible to have a name collision where I wanted hdfs://foo/bar.zip
and hdfs://bar/bar.zip. In 1.0 both of these would have been
downloaded and accessible through the deprecated APIs, but in YARN a
warning will be output and only one of them will be downloaded. Also
because of the way these APIs were written the mapper code may not
know that only one of them was downloaded and will not be able to find
the missing one and blow up. That is why I deprecated them in favor of
nudging people to always use the symlinks so the behavior is always
consistent.
Omkar Vinit Joshi added a comment - 19/Jul/13 16:56
Robert Joseph Evans sounds good.. however by this we will be putting
limitation based on file name ..but that sounds reasonable considering
the fact that this will stop potential bugs in map code and users can
definitely version them to avoid it... Thanks...
So you're supposed to just open the file, it will be there. No dedicated API.