Reading and writing huge files in java

2019-08-20 06:41发布

My idea is to make a little software that reads a file (which can't be read "naturally", but it contains some images), turns its data into hex, looks for the PNG chunks (a kind of marks that are at the beginning and end of a .png file), and saves the resulting data in different files (after getting it back from hex). I am doing this in Java, using a code like this:

// out is where to show the result and file is the source
public static void hexDump(PrintStream out, File file) throws IOException {
    InputStream is = new FileInputStream(file);
    StringBuffer Buffer = new StringBuffer();

    while (is.available() > 0) {
        StringBuilder sb1 = new StringBuilder();

        for (int j = 0; j < 16; j++) {
            if (is.available() > 0) {
                int value = (int) is.read();
                // transform the current data into hex
                sb1.append(String.format("%02X ", value));
            }
        }

        Buffer.append(sb1);

        // Should I look for the PNG here? I'm not sure
    }
    is.close();
    // Print the result in out (that may be the console or a file)
    out.print(Buffer);

}

I'm sure there are another ways to do this using less "machine-resources" while opening huge files. If you have any idea, please tell me. Thanks!

This is the first time I post, so if there is any error, please help me to correct it.

标签: java file hex
3条回答
叛逆
2楼-- · 2019-08-20 07:21

As Erwin Bolwidt says in the comments, first thing is don't convert to hex. If for some reason you must convert to hex, quit appending the content to two buffers, and always use StringBuilder, not StringBuffer. StringBuilder can be as much as 3x faster than StringBuffer.

Also, buffer your file reads with BufferedReader. Reading one character at a time with FileInputStream.read() is very slow.

查看更多
一纸荒年 Trace。
3楼-- · 2019-08-20 07:39

Reading the file a byte at a time would be taking substantial time here. You can improve that by orders of magnitude. You should be using a DataInputStream around a BufferedInputStream around the FileInputStream, and reading 16 bytes at a time with readFully.

And then processing them, without conversion to and from hex, which is quite unnecessary here, and writing them to the output(s) as you go, via a BufferedOutputStream around the FileOutputStream, rather than concatenating the entire file into memory and having to write it all out in one go. Of course that takes time, but that's because it does, not because you have to do it that way.

查看更多
时光不老,我们不散
4楼-- · 2019-08-20 07:45

A very simple way to do this, which is probably quite fast, is to read the entire file into memory (as binary data, not as a hex dump) and then search for the markers.

This has two limitations:

  • it only handles files up to 2 GiB in length (max size of Java arrays)
  • it requires large chunks of memory - it is possible to optimize this by reader smaller chunks but that makes the algorithm more complex

The basic code to do that is like this:

import java.io.File;
import java.io.IOException;
import java.nio.file.Files;

public class Png {

    static final String PNG_MARKER_HEX = "abcdef0123456789"; // TODO: replace with real marker
    static final byte[] PNG_MARKER = hexStringToByteArray(PNG_MARKER_HEX);

    public void splitPngChunks(File file) throws IOException {
        byte[] bytes = Files.readAllBytes(file.toPath());
        int offset = KMPMatch.indexOf(bytes, 0, PNG_MARKER);
        while (offset >= 0) {
            int nextOffset = KMPMatch.indexOf(bytes, 0, PNG_MARKER);
            if (nextOffset < 0) {
                writePngChunk(bytes, offset, bytes.length - offset);
            } else {
                writePngChunk(bytes, offset, nextOffset - offset);
            }
            offset = nextOffset;
        }
    }

    public void writePngChunk(byte[] bytes, int offset, int length) {
        // TODO: implement - where do you want to write the chunks?
    }
}

I'm not sure how these PNG chunk markers work exactly, I'm assuming above that they start the section of the data that you're interested in, and that the next marker starts the next section of the data.

There are two things missing in standard Java: code to convert a hex string to a byte array and code to search for a byte array inside another byte array. Both can be found in various apache-commons libraries but I'll include that answers the people posted to earlier questions on StackOverflow. You can copy these verbatim into the Png class to make the above code work.

Convert a string representation of a hex dump to a byte array using Java?

public static byte[] hexStringToByteArray(String s) {
    int len = s.length();
    byte[] data = new byte[len / 2];
    for (int i = 0; i < len; i += 2) {
        data[i / 2] = (byte) ((Character.digit(s.charAt(i), 16) << 4) + Character.digit(s.charAt(i + 1), 16));
    }
    return data;
}

Searching for a sequence of Bytes in a Binary File with Java

/**
 * Knuth-Morris-Pratt Algorithm for Pattern Matching
 */
static class KMPMatch {
    /**
     * Finds the first occurrence of the pattern in the text.
     */
    public static int indexOf(byte[] data, int offset, byte[] pattern) {
        int[] failure = computeFailure(pattern);

        int j = 0;
        if (data.length - offset <= 0)
            return -1;

        for (int i = offset; i < data.length; i++) {
            while (j > 0 && pattern[j] != data[i]) {
                j = failure[j - 1];
            }
            if (pattern[j] == data[i]) {
                j++;
            }
            if (j == pattern.length) {
                return i - pattern.length + 1;
            }
        }
        return -1;
    }

    /**
     * Computes the failure function using a boot-strapping process, where the pattern is matched against itself.
     */
    private static int[] computeFailure(byte[] pattern) {
        int[] failure = new int[pattern.length];

        int j = 0;
        for (int i = 1; i < pattern.length; i++) {
            while (j > 0 && pattern[j] != pattern[i]) {
                j = failure[j - 1];
            }
            if (pattern[j] == pattern[i]) {
                j++;
            }
            failure[i] = j;
        }

        return failure;
    }
}

I modified this last piece of code to make it possible to start the search at an offset other than zero.

查看更多
登录 后发表回答