I want to read the last n lines of a very big file without reading the whole file into any buffer/memory area using Java.
I looked around the JDK APIs and Apache Commons I/O and am not able to locate one which is suitable for this purpose.
I was thinking of the way tail or less does it in UNIX. I don't think they load the entire file and then show the last few lines of the file. There should be similar way to do the same in Java too.
A
RandomAccessFile
allows for seeking (http://download.oracle.com/javase/1.4.2/docs/api/java/io/RandomAccessFile.html). TheFile.length
method will return the size of the file. The problem is determining number of lines. For this, you can seek to the end of the file and read backwards until you have hit the right number of lines.If you use a
RandomAccessFile
, you can uselength
andseek
to get to a specific point near the end of the file and then read forward from there.If you find there weren't enough lines, back up from that point and try again. Once you've figured out where the
N
th last line begins, you can seek to there and just read-and-print.An initial best-guess assumption can be made based on your data properties. For example, if it's a text file, it's possible the line lengths won't exceed an average of 132 so, to get the last five lines, start 660 characters before the end. Then, if you were wrong, try again at 1320 (you can even use what you learned from the last 660 characters to adjust that - example: if those 660 characters were just three lines, the next try could be 660 / 3 * 5, plus maybe a bit extra just in case).
I had similar problem, but I don't understood to another solutions.
I used this. I hope thats simple code.
Here is the best way I've found to do it. Simple and pretty fast and memory efficient.
RandomAccessFile is a good place to start, as described by the other answers. There is one important caveat though.
If your file is not encoded with an one-byte-per-character encoding, the
readLine()
method is not going to work for you. AndreadUTF()
won't work in any circumstances. (It reads a string preceded by a character count ...)Instead, you will need to make sure that you look for end-of-line markers in a way that respects the encoding's character boundaries. For fixed length encodings (e.g. flavors of UTF-16 or UTF-32) you need to extract characters starting from byte positions that are divisible by the character size in bytes. For variable length encodings (e.g. UTF-8), you need to search for a byte that must be the first byte of a character.
In the case of UTF-8, the first byte of a character will be
0xxxxxxx
or110xxxxx
or1110xxxx
or11110xxx
. Anything else is either a second / third byte, or an illegal UTF-8 sequence. See The Unicode Standard, Version 5.2, Chapter 3.9, Table 3-7. This means, as the comment discussion points out, that any 0x0A and 0x0D bytes in a properly encoded UTF-8 stream will represent a LF or CR character. Thus, simply counting the 0x0A and 0x0D bytes is a valid implementation strategy (for UTF-8) if we can assume that the other kinds of Unicode line separator (0x2028, 0x2029 and 0x0085) are not used. You can't assume that, then the code would be more complicated.Having identified a proper character boundary, you can then just call
new String(...)
passing the byte array, offset, count and encoding, and then repeatedly callString.lastIndexOf(...)
to count end-of-lines.