My company uses a legacy file format for Electromiography data, which is no longer in production. However, there is some interest in maintaining retro-compatibility, so I am studying the possibility to write a reader for that file format.
By analyzing a very convoluted former source code written in Delphi, the file reader/writer uses ZLIB, and inside a HexEditor it looks like there is a file header in binary ASCII (with fields like "Player", "Analyzer" readily readable), followed by a compressed string containing raw data.
My doubt is: how should I proceed in order to identify:
- If it is a compressed stream;
- Where does the compressed stream start and where does it end;
From Wikipedia:
zlib compressed data is typically written with a gzip or a zlib
wrapper. The wrapper encapsulates the raw DEFLATE data by adding a
header and trailer. This provides stream identification and error
detection
Is this relevant?
I'll be glad to post more information, but I don't know what would be most relevant.
Thanks for any hint.
EDIT: I have the working application, and can use it to record actual data of any time length, getting files even smaller than 1kB if necessary.
Some sample files:
A freshly created one, without datastream: https://dl.dropbox.com/u/4849855/Mio_File/HeltonEmpty.mio
The same as above after a very short (1 second?) datastream has been saved: https://dl.dropbox.com/u/4849855/Mio_File/HeltonFilled.mio
A different one, from a patient named "manco" instead of "Helton", with an even shorter stream (ideal for Hex viewing): https://dl.dropbox.com/u/4849855/Mio_File/manco_short.mio
Instructions: each file should be the file of a patient (a person). Inside these files, one or more exams are saved, each exam consisting of one or more time series. The provided files contain only one exam, with one data series.
To start, why not scan the files for all valid zip streams (it's good enough for small files and to figure out the format):
import zlib
from glob import glob
def zipstreams(filename):
"""Return all zip streams and their positions in file."""
with open(filename, 'rb') as fh:
data = fh.read()
i = 0
while i < len(data):
try:
zo = zlib.decompressobj()
yield i, zo.decompress(data[i:])
i += len(data[i:]) - len(zo.unused_data)
except zlib.error:
i += 1
for filename in glob('*.mio'):
print(filename)
for i, data in zipstreams(filename):
print (i, len(data))
Looks like the data streams contain little-endian double precision floating point data:
import numpy
from matplotlib import pyplot
for filename in glob('*.mio'):
for i, data in zipstreams(filename):
if data:
a = numpy.fromstring(data, '<f8')
pyplot.plot(a[1:])
pyplot.title(filename + ' - %i' % i)
pyplot.show()
zlib is a thin wrapper around data compressed with the
DEFLATE algorithm and is defined in RFC1950:
A zlib stream has the following structure:
0 1
+---+---+
|CMF|FLG| (more-->)
+---+---+
(if FLG.FDICT set)
0 1 2 3
+---+---+---+---+
| DICTID | (more-->)
+---+---+---+---+
+=====================+---+---+---+---+
|...compressed data...| ADLER32 |
+=====================+---+---+---+---+
So it adds at least two, possibly six bytes before and 4 bytes with an
ADLER32 checksum after the raw DEFLATE compressed data.
The first byte contains the CMF (Compression Method and flags), which is split
into CM (Compression method) (first 4 bits) and CINFO (Compression info) (last
4 bits).
From this it's quite clear that unfortunately already the first two bytes
of a zlib stream can vary a lot depending on what compression method and
settings have been used.
Luckily, I stumbled upon a post by Mark Adler, the author of the ADLER32
algorithm, where he lists the most common and less common combinations of those
two starting bytes.
With that out of the way, let's look at how we can use Python to examine zlib:
>>> import zlib
>>> msg = 'foo'
>>> [hex(ord(b)) for b in zlib.compress(msg)]
['0x78', '0x9c', '0x4b', '0xcb', '0xcf', '0x7', '0x0', '0x2', '0x82', '0x1', '0x45']
So the zlib data created by Python's zlib
module (using default options) starts with
78 9c
. We'll use that to create a script that writes a custom file format
cointaining a preamble, some zlib compressed data and a footer.
We then write a second script that scans a file for that two byte pattern,
starts decompressing everything that follows as a zlib stream and figures out
where the stream ends and the footer starts.
create.py
import zlib
msg = 'foo'
filename = 'foo.compressed'
compressed_msg = zlib.compress(msg)
data = 'HEADER' + compressed_msg + 'FOOTER'
with open(filename, 'wb') as outfile:
outfile.write(data)
Here we take msg
, compress it with zlib, and surround it with a header and
footer before we write it out to a file.
Header and footer are of fixed length in this example, but they could of course
have arbitrary, unknown lengths.
Now for the script that tries to find a zlib stream in such a file. Because for
this example we know exactly what marker to expect I'm using only one, but
obviously the list ZLIB_MARKERS
could be filled with all the markers from the
post mentioned above.
ident.py
import zlib
ZLIB_MARKERS = ['\x78\x9c']
filename = 'foo.compressed'
infile = open(filename, 'r')
data = infile.read()
pos = 0
found = False
while not found:
window = data[pos:pos+2]
for marker in ZLIB_MARKERS:
if window == marker:
found = True
start = pos
print "Start of zlib stream found at byte %s" % pos
break
if pos == len(data):
break
pos += 1
if found:
header = data[:start]
rest_of_data = data[start:]
decomp_obj = zlib.decompressobj()
uncompressed_msg = decomp_obj.decompress(rest_of_data)
footer = decomp_obj.unused_data
print "Header: %s" % header
print "Message: %s" % uncompressed_msg
print "Footer: %s" % footer
if not found:
print "Sorry, no zlib streams starting with any of the markers found."
The idea is this:
Start at the beginning of the file and create a two byte search
window.
Move the search window forward in one-byte increments.
For every window check if it matches any of the two byte markers we
defined.
If a match is found, record the starting position, stop searching and
try to decompress everything that follows.
Now, finding the end of the stream isn't as trivial as looking for two marker
bytes. zlib streams are neither terminated by a fixed byte sequence nor is
their length indicated in any of the header fields. Instead it's terminated by
a four byte ADLER32 checksum that must match the data up to this point.
The way it works is that the internal C function inflate()
continously keeps
trying to decompress the stream as it reads it, and if it comes across a
matching checksum, signals that to its caller, indicating that the rest of the
data isn't part of the zlib stream anymore.
In Python this behavior is exposed when using decompression objects instead of simply
calling zlib.decompress()
. Calling decompress(string)
on a Decompress
object
will decompress a zlib stream in string
and return the decompressed data that was part of the stream. Everything that follows the stream will be stored in unused_data
and can be
retrieved afterwards.
This should produce the following output on a file created with the first
script:
Start of zlib stream found at byte 6
Header: HEADER
Message: foo
Footer: FOOTER
The example can easily be modified to write the uncompressed message to a file
instead of printing it. Then you can further analyze the formerly zlib
compressed data, and try to identify known fields in the metadata in the
header and footer you separated out.