My company uses a legacy file format for Electromiography data, which is no longer in production. However, there is some interest in maintaining retro-compatibility, so I am studying the possibility to write a reader for that file format.
By analyzing a very convoluted former source code written in Delphi, the file reader/writer uses ZLIB, and inside a HexEditor it looks like there is a file header in binary ASCII (with fields like "Player", "Analyzer" readily readable), followed by a compressed string containing raw data.
My doubt is: how should I proceed in order to identify:
- If it is a compressed stream;
- Where does the compressed stream start and where does it end;
From Wikipedia:
zlib compressed data is typically written with a gzip or a zlib wrapper. The wrapper encapsulates the raw DEFLATE data by adding a header and trailer. This provides stream identification and error detection
Is this relevant?
I'll be glad to post more information, but I don't know what would be most relevant.
Thanks for any hint.
EDIT: I have the working application, and can use it to record actual data of any time length, getting files even smaller than 1kB if necessary.
Some sample files:
A freshly created one, without datastream: https://dl.dropbox.com/u/4849855/Mio_File/HeltonEmpty.mio
The same as above after a very short (1 second?) datastream has been saved: https://dl.dropbox.com/u/4849855/Mio_File/HeltonFilled.mio
A different one, from a patient named "manco" instead of "Helton", with an even shorter stream (ideal for Hex viewing): https://dl.dropbox.com/u/4849855/Mio_File/manco_short.mio
Instructions: each file should be the file of a patient (a person). Inside these files, one or more exams are saved, each exam consisting of one or more time series. The provided files contain only one exam, with one data series.
To start, why not scan the files for all valid zip streams (it's good enough for small files and to figure out the format):
Looks like the data streams contain little-endian double precision floating point data:
zlib is a thin wrapper around data compressed with the DEFLATE algorithm and is defined in RFC1950:
So it adds at least two, possibly six bytes before and 4 bytes with an ADLER32 checksum after the raw DEFLATE compressed data.
The first byte contains the CMF (Compression Method and flags), which is split into CM (Compression method) (first 4 bits) and CINFO (Compression info) (last 4 bits).
From this it's quite clear that unfortunately already the first two bytes of a zlib stream can vary a lot depending on what compression method and settings have been used.
Luckily, I stumbled upon a post by Mark Adler, the author of the ADLER32 algorithm, where he lists the most common and less common combinations of those two starting bytes.
With that out of the way, let's look at how we can use Python to examine zlib:
So the zlib data created by Python's
zlib
module (using default options) starts with78 9c
. We'll use that to create a script that writes a custom file format cointaining a preamble, some zlib compressed data and a footer.We then write a second script that scans a file for that two byte pattern, starts decompressing everything that follows as a zlib stream and figures out where the stream ends and the footer starts.
create.py
Here we take
msg
, compress it with zlib, and surround it with a header and footer before we write it out to a file.Header and footer are of fixed length in this example, but they could of course have arbitrary, unknown lengths.
Now for the script that tries to find a zlib stream in such a file. Because for this example we know exactly what marker to expect I'm using only one, but obviously the list
ZLIB_MARKERS
could be filled with all the markers from the post mentioned above.ident.py
The idea is this:
Start at the beginning of the file and create a two byte search window.
Move the search window forward in one-byte increments.
For every window check if it matches any of the two byte markers we defined.
If a match is found, record the starting position, stop searching and try to decompress everything that follows.
Now, finding the end of the stream isn't as trivial as looking for two marker bytes. zlib streams are neither terminated by a fixed byte sequence nor is their length indicated in any of the header fields. Instead it's terminated by a four byte ADLER32 checksum that must match the data up to this point.
The way it works is that the internal C function
inflate()
continously keeps trying to decompress the stream as it reads it, and if it comes across a matching checksum, signals that to its caller, indicating that the rest of the data isn't part of the zlib stream anymore.In Python this behavior is exposed when using decompression objects instead of simply calling
zlib.decompress()
. Callingdecompress(string)
on aDecompress
object will decompress a zlib stream instring
and return the decompressed data that was part of the stream. Everything that follows the stream will be stored inunused_data
and can be retrieved afterwards.This should produce the following output on a file created with the first script:
The example can easily be modified to write the uncompressed message to a file instead of printing it. Then you can further analyze the formerly zlib compressed data, and try to identify known fields in the metadata in the header and footer you separated out.