I have two processes running. One is writing files to an HDFS and the other is loading those files.
The first process (The one that writes the file) is using:
private void writeFileToHdfs(byte[] sourceStream, Path outFilePath) {
FSDataOutputStream out = null;
try {
// create the file
out = getFileSystem().create(outFilePath);
out.write(sourceStream);
} catch (Exception e) {
LOG.error("Error while trying to write a file to hdfs", e);
} finally {
try {
if (null != out)
out.close();
} catch (IOException e) {
LOG.error("Could not close output stream to hdfs", e);
}
}
}
The second process reads those files for further processing.
When creating a file, it is first created and then populated with content. This process takes time (a few milliseconds, but still) and during this time the second process may pick up the file before it is fully written and closed.
Notice that HDFS does not keep locking info in the namenode - so there is no daemon out there that can check if the file is locked before accessing it.
I wonder what is the best way to resolve this issue.
Here are my thoughts:
- Copying the files to a new folder once they are fully written and closed, then the second process
will read from this new folder.
- Renaming a file according to some naming convention once it is fully written and closed then the second process
will read according to this naming convention.
I have a feeling I'm trying to solve a well know problem and I'm missing something out. Is there a best practice for such a problem?
The Apache commons has some stuff for that. Just touch
the file and an error will tell you if it's already locked.
import org.apache.commons.io.*
boolean fileAvail = false;
try {
FileUtils.touch(fileName); //throws IOException if being used
fileAvail = true;
} catch (IOException e) {
fileAvail = false;
}
(also) Try with Resources
In Java 7 you can use this functionality on anything that implements Closable
like files, sockets and database connections where it will auto close as soon as the scope of the try block is ended by doing this
try (FSDataOutputStream out = getFileSystem().create(outFilePath))
{
//use out in here
}
//No finally required - catch is optional
...saves all that extra code
Are you talking about two separate processes here or about two separate threads within the same (JVM) process?
Both ways, this is a consumer-producer problem and what you are missing is some proper synchronization between the producer and the consumer. If you are running two threads within the same JVM process, you could use a BlockingQueue
in order to transfer some sort of file-transfer-finished token from the producer to the consumer such as for example the file's name once a file is fully written and its stream closed. Once a file name was found in the queue, the consumer could be certain that the file was fully written and closed because this is was confirmed by the producer.
However, if you are using two different processes, the problem is a little bit harder to solve, depending on the other component's language and the networking setup, but you would have to implement some sort of queue that could be used by both processes for example by sending some information over a local networking port such that the processes would know of each other's work.
No matter what, I would always avoid moving around files on the file system since this is a rather expensive operation compared to sending simple tokens. And also moving arround files might expose files that were not yet completely moved, depending on the language you are using.
Do you really need two processes here ? why dont you create two threads and then join it.