reading a file from HDFS only after it is fully wr

2019-05-08 06:32发布

问题:

I have two processes running. One is writing files to an HDFS and the other is loading those files.

The first process (The one that writes the file) is using:

private void writeFileToHdfs(byte[] sourceStream, Path outFilePath) {
FSDataOutputStream out = null;
try {
    // create the file
    out = getFileSystem().create(outFilePath);
    out.write(sourceStream);
} catch (Exception e) {
    LOG.error("Error while trying to write a file to hdfs", e);
} finally {
    try {
    if (null != out)
        out.close();
    } catch (IOException e) {
    LOG.error("Could not close output stream to hdfs", e);
    }
}
}

The second process reads those files for further processing. When creating a file, it is first created and then populated with content. This process takes time (a few milliseconds, but still) and during this time the second process may pick up the file before it is fully written and closed.

Notice that HDFS does not keep locking info in the namenode - so there is no daemon out there that can check if the file is locked before accessing it.

I wonder what is the best way to resolve this issue.

Here are my thoughts:

  1. Copying the files to a new folder once they are fully written and closed, then the second process will read from this new folder.
  2. Renaming a file according to some naming convention once it is fully written and closed then the second process will read according to this naming convention.

I have a feeling I'm trying to solve a well know problem and I'm missing something out. Is there a best practice for such a problem?

回答1:

The Apache commons has some stuff for that. Just touch the file and an error will tell you if it's already locked.

import org.apache.commons.io.*

boolean fileAvail = false;

try {
    FileUtils.touch(fileName); //throws IOException if being used
    fileAvail = true;
} catch (IOException e) {
    fileAvail = false;
}

(also) Try with Resources

In Java 7 you can use this functionality on anything that implements Closable like files, sockets and database connections where it will auto close as soon as the scope of the try block is ended by doing this

 try (FSDataOutputStream out = getFileSystem().create(outFilePath))
 {
   //use out in here
 }
 //No finally required - catch is optional

...saves all that extra code



回答2:

Are you talking about two separate processes here or about two separate threads within the same (JVM) process?

Both ways, this is a consumer-producer problem and what you are missing is some proper synchronization between the producer and the consumer. If you are running two threads within the same JVM process, you could use a BlockingQueue in order to transfer some sort of file-transfer-finished token from the producer to the consumer such as for example the file's name once a file is fully written and its stream closed. Once a file name was found in the queue, the consumer could be certain that the file was fully written and closed because this is was confirmed by the producer.

However, if you are using two different processes, the problem is a little bit harder to solve, depending on the other component's language and the networking setup, but you would have to implement some sort of queue that could be used by both processes for example by sending some information over a local networking port such that the processes would know of each other's work.

No matter what, I would always avoid moving around files on the file system since this is a rather expensive operation compared to sending simple tokens. And also moving arround files might expose files that were not yet completely moved, depending on the language you are using.



回答3:

Do you really need two processes here ? why dont you create two threads and then join it.