Here is my problem, I have a set of big gz
log files, the very first info in the line is a datetime text, e.g.: 2014-03-20 05:32:00.
I need to check what set of log files holds a specific data. For the init I simply do a:
'-query-data-'
zgrep -m 1 '^20140320-04' 20140320-0{3,4}*gz
BUT HOW to do the same with the last line without process the whole file as would be done with zcat (too heavy):
zcat foo.gz | tail -1
Additional info, those logs are created with the data time of it's initial record, so if I want to query logs at 14:00:00 I have to search, also, in files created BEFORE 14:00:00, as a file would be created at 13:50:00 and closed at 14:10:00.
The easiest solution would be to alter your log rotation to create smaller files.
The second easiest solution would be to use a compression tool that supports random access.
Projects like dictzip, BGZF, and csio each add sync flush points at various intervals within gzip-compressed data that allow you to seek to in a program aware of that extra information. While it exists in the standard, the vanilla
gzip
does not add such markers either by default or by option.Files compressed by these random-access-friendly utilities are slightly larger (by perhaps 2-20%) due to the markers themselves, but fully support decompression with
gzip
or another utility that is unaware of these markers.You can learn more at this question about random access in various compression formats.
There's also a "Blasted Bioinformatics" blog by Peter Cock with several posts on this topic, including:
Experiments with
xz
xz
(an LZMA compression format) actually has random access support on a per-block level, but you will only get a single block with the defaults.File creation
xz
can concatenate multiple archives together, in which case each archive would have its own block. The GNUsplit
can do this easily:This tells
split
to breakbig.log
into 50MB chunks (before compression) and run each one throughxz -c
, which outputs the compressed chunk to standard output. We then collect that standard output into a single file namedbig.log.sp.xz
.To do this without GNU, you'd need a loop:
Parsing
You can get the list of block offsets with
xz --verbose --list FILE.xz
. If you want the last block, you need its compressed size (column 5) plus 36 bytes for overhead (found by comparing the size tohd big.log.sp0.xz |grep 7zXZ
). Fetch that block usingtail -c
and pipe that throughxz
. Since the above question wants the last line of the file, I then pipe that throughtail -n1
:Side note
Version 5.1.1 introduced support for the
--block-size
flag:However, I have not been able to extract a specific block since it doesn't include full headers between blocks. I suspect this is nontrivial to do from the command line.
Experiments with
gzip
gzip
also supports concatenation. I (briefly) tried mimicking this process forgzip
without any luck.gzip --verbose --list
doesn't give enough information and it appears the headers are too variable to find.This would require adding sync flush points, and since their size varies on the size of the last buffer in the previous compression, that's too hard to do on the command line (use dictzip or another of the previously discussed tools).
I did
apt-get install dictzip
and played with dictzip, but just a little. It doesn't work without arguments, creating a (massive!).dz
archive that neitherdictunzip
norgunzip
could understand.Experiments with
bzip2
bzip2
has headers we can find. This is still a bit messy, but it works.Creation
This is just like the
xz
procedure above:I should note that this is considerably slower than
xz
(48 min for bzip2 vs 17 min for xz vs 1 min forxz -0
) as well as considerably larger (97M for bzip2 vs 25M forxz -0
vs 15M for xz), at least for my test log file.Parsing
This is a little harder because we don't have the nice index. We have to guess at where to go, and we have to err on the side of scanning too much, but with a massive file, we'd still save I/O.
My guess for this test was 50000000 (out of the original 52428800, a pessimistic guess that isn't pessimistic enough for e.g. an H.264 movie.)
This takes just the last 50 million bytes, finds the binary offset of the last BZIP2 header, subtracts that from the guess size, and pulls that many bytes off of the end of the file. Just that part is decompressed and thrown into
tail
.Because this has to query the compressed file twice and has an extra scan (the
grep
call seeking the header, which examines the whole guessed space), this is a suboptimal solution. See also the below section on how slowbzip2
really is.Perspective
Given how fast
xz
is, it's easily the best bet; using its fastest option (xz -0
) is quite fast to compress or decompress and creates a smaller file thangzip
orbzip2
on the log file I was testing with. Other tests (as well as various sources online) suggest thatxz -0
is preferable tobzip2
in all scenarios.Timing tests were not comprehensive, I did not average anything and disk caching was in use. Still, they look correct; there is a very small amount of overhead from
split
plus launching 145 compression instances rather than just one (this may even be a net gain if it allows an otherwise non-multithreaded utility to consume multiple threads).