I would like to create a file during my program. However, I don't want this file to be written on HDFS but on the datanode filesystem where the map
operation is executed.
I tried the following approach :
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
// do some hadoop stuff, like counting words
String path = "newFile.txt";
try {
File f = new File(path);
f.createNewFile();
} catch (IOException e) {
System.out.println("Message easy to look up in the logs.");
System.err.println("Error easy to look up in the logs.");
e.printStackTrace();
throw e;
}
}
With an absolute path, I get the file where it's supposed to be. With a relative path, howver, this code doesn't produce any error, neither in the console from which I run the program nor in the job logs. However, I can't manage to find the file which should be created (Right now, I'm working on a local cluster).
Any ideas where to find either the file or the error message ? If there is indeed an error message, how should I proceed to write files to local filesystem of datanodes ?
newFile.txt is a relative path, so the file would show up relative to your map task process's working directory. This will land somewhere under the directories used by NodeManager for containers. This is configuration property yarn.nodemanager.local-dirs
in yarn-site.xml, or the default inherited from yarn-default.xml, which is under /tmp:
<property>
<description>List of directories to store localized files in. An
application's localized file directory will be found in:
${yarn.nodemanager.local-dirs}/usercache/${user}/appcache/application_${appid}.
Individual containers' work directories, called container_${contid}, will
be subdirectories of this.
</description>
<name>yarn.nodemanager.local-dirs</name>
<value>${hadoop.tmp.dir}/nm-local-dir</value>
</property>
Here is a concrete example of one such directory in my test environment:
/tmp/hadoop-cnauroth/nm-local-dir/usercache/cnauroth/appcache/application_1363932793646_0002/container_1363932793646_0002_01_000001
These directories are scratch space for container execution, so they aren't something that you can rely on for persistence. A background thread periodically deletes these files for completed containers. It is possible to delay the cleanup by setting the configuration property yarn.nodemanager.delete.debug-delay-sec
in yarn-site.xml:
<property>
<description>
Number of seconds after an application finishes before the nodemanager's
DeletionService will delete the application's localized file directory
and log directory.
To diagnose Yarn application problems, set this property's value large
enough (for example, to 600 = 10 minutes) to permit examination of these
directories. After changing the property's value, you must restart the
nodemanager in order for it to have an effect.
The roots of Yarn applications' work directories is configurable with
the yarn.nodemanager.local-dirs property (see below), and the roots
of the Yarn applications' log directories is configurable with the
yarn.nodemanager.log-dirs property (see also below).
</description>
<name>yarn.nodemanager.delete.debug-delay-sec</name>
<value>0</value>
</property>
However, please keep in mind that this configuration is intended only for troubleshooting issues so that you can see the directories more easily. It's not recommended as a permanent production configuration. If application logic depends on the delete delay, then that's likely to cause a race condition between the application logic attempting to access the directory and the NodeManager attempting to delete it. Leaving files lingering from old container executions also risks cluttering the local disk space.
The log messages would go to the stdout/stderr of the map task logs, but I suspect execution isn't hitting those log messages. Instead, I suspect that you're creating the file successfully, but either it's not easily findable (the directory structure will have somewhat unpredictable things like application ID and container ID managed by YARN), or the file is getting cleaned up before you can get to it.
If you changed the code to use an absolute path pointing to some other directory, then that would help. However, I don't expect this approach to work well in real practice. Since Hadoop is distributed, you may have a hard time finding which node in a cluster of hundreds or thousands contains the local file that you want. Instead, you might be better off writing to HDFS and then pulling the files you need locally to the node where you launched the job.