Extract zlib compressed data from binary file in p

2020-02-10 07:45发布

问题:

My company uses a legacy file format for Electromiography data, which is no longer in production. However, there is some interest in maintaining retro-compatibility, so I am studying the possibility to write a reader for that file format.

By analyzing a very convoluted former source code written in Delphi, the file reader/writer uses ZLIB, and inside a HexEditor it looks like there is a file header in binary ASCII (with fields like "Player", "Analyzer" readily readable), followed by a compressed string containing raw data.

My doubt is: how should I proceed in order to identify:

  • If it is a compressed stream;
  • Where does the compressed stream start and where does it end;

From Wikipedia:

zlib compressed data is typically written with a gzip or a zlib wrapper. The wrapper encapsulates the raw DEFLATE data by adding a header and trailer. This provides stream identification and error detection

Is this relevant?

I'll be glad to post more information, but I don't know what would be most relevant.

Thanks for any hint.

EDIT: I have the working application, and can use it to record actual data of any time length, getting files even smaller than 1kB if necessary.


Some sample files:

A freshly created one, without datastream: https://dl.dropbox.com/u/4849855/Mio_File/HeltonEmpty.mio

The same as above after a very short (1 second?) datastream has been saved: https://dl.dropbox.com/u/4849855/Mio_File/HeltonFilled.mio

A different one, from a patient named "manco" instead of "Helton", with an even shorter stream (ideal for Hex viewing): https://dl.dropbox.com/u/4849855/Mio_File/manco_short.mio

Instructions: each file should be the file of a patient (a person). Inside these files, one or more exams are saved, each exam consisting of one or more time series. The provided files contain only one exam, with one data series.

回答1:

To start, why not scan the files for all valid zip streams (it's good enough for small files and to figure out the format):

import zlib
from glob import glob

def zipstreams(filename):
    """Return all zip streams and their positions in file."""
    with open(filename, 'rb') as fh:
        data = fh.read()
    i = 0
    while i < len(data):
        try:
            zo = zlib.decompressobj()
            yield i, zo.decompress(data[i:])
            i += len(data[i:]) - len(zo.unused_data)
        except zlib.error:
            i += 1

for filename in glob('*.mio'):
    print(filename)
    for i, data in zipstreams(filename):
        print (i, len(data))

Looks like the data streams contain little-endian double precision floating point data:

import numpy
from matplotlib import pyplot

for filename in glob('*.mio'):
    for i, data in zipstreams(filename):
        if data:
            a = numpy.fromstring(data, '<f8')
            pyplot.plot(a[1:])
            pyplot.title(filename + ' - %i' % i)
            pyplot.show()


回答2:

zlib is a thin wrapper around data compressed with the DEFLATE algorithm and is defined in RFC1950:

  A zlib stream has the following structure:

       0   1
     +---+---+
     |CMF|FLG|   (more-->)
     +---+---+

  (if FLG.FDICT set)

       0   1   2   3
     +---+---+---+---+
     |     DICTID    |   (more-->)
     +---+---+---+---+

     +=====================+---+---+---+---+
     |...compressed data...|    ADLER32    |
     +=====================+---+---+---+---+

So it adds at least two, possibly six bytes before and 4 bytes with an ADLER32 checksum after the raw DEFLATE compressed data.

The first byte contains the CMF (Compression Method and flags), which is split into CM (Compression method) (first 4 bits) and CINFO (Compression info) (last 4 bits).

From this it's quite clear that unfortunately already the first two bytes of a zlib stream can vary a lot depending on what compression method and settings have been used.

Luckily, I stumbled upon a post by Mark Adler, the author of the ADLER32 algorithm, where he lists the most common and less common combinations of those two starting bytes.

With that out of the way, let's look at how we can use Python to examine zlib:

>>> import zlib
>>> msg = 'foo'
>>> [hex(ord(b)) for b in zlib.compress(msg)]
['0x78', '0x9c', '0x4b', '0xcb', '0xcf', '0x7', '0x0', '0x2', '0x82', '0x1', '0x45']

So the zlib data created by Python's zlib module (using default options) starts with 78 9c. We'll use that to create a script that writes a custom file format cointaining a preamble, some zlib compressed data and a footer.

We then write a second script that scans a file for that two byte pattern, starts decompressing everything that follows as a zlib stream and figures out where the stream ends and the footer starts.

create.py

import zlib

msg = 'foo'
filename = 'foo.compressed'

compressed_msg = zlib.compress(msg)
data = 'HEADER' + compressed_msg + 'FOOTER'

with open(filename, 'wb') as outfile:
    outfile.write(data)

Here we take msg, compress it with zlib, and surround it with a header and footer before we write it out to a file.

Header and footer are of fixed length in this example, but they could of course have arbitrary, unknown lengths.

Now for the script that tries to find a zlib stream in such a file. Because for this example we know exactly what marker to expect I'm using only one, but obviously the list ZLIB_MARKERS could be filled with all the markers from the post mentioned above.

ident.py

import zlib

ZLIB_MARKERS = ['\x78\x9c']
filename = 'foo.compressed'

infile = open(filename, 'r')
data = infile.read()

pos = 0
found = False

while not found:
    window = data[pos:pos+2]
    for marker in ZLIB_MARKERS:
        if window == marker:
            found = True
            start = pos
            print "Start of zlib stream found at byte %s" % pos
            break
    if pos == len(data):
        break
    pos += 1

if found:
    header = data[:start]

    rest_of_data = data[start:]
    decomp_obj = zlib.decompressobj()
    uncompressed_msg = decomp_obj.decompress(rest_of_data)

    footer = decomp_obj.unused_data

    print "Header: %s" % header
    print "Message: %s" % uncompressed_msg
    print "Footer: %s" % footer

if not found:
    print "Sorry, no zlib streams starting with any of the markers found."

The idea is this:

  • Start at the beginning of the file and create a two byte search window.

  • Move the search window forward in one-byte increments.

  • For every window check if it matches any of the two byte markers we defined.

  • If a match is found, record the starting position, stop searching and try to decompress everything that follows.

Now, finding the end of the stream isn't as trivial as looking for two marker bytes. zlib streams are neither terminated by a fixed byte sequence nor is their length indicated in any of the header fields. Instead it's terminated by a four byte ADLER32 checksum that must match the data up to this point.

The way it works is that the internal C function inflate() continously keeps trying to decompress the stream as it reads it, and if it comes across a matching checksum, signals that to its caller, indicating that the rest of the data isn't part of the zlib stream anymore.

In Python this behavior is exposed when using decompression objects instead of simply calling zlib.decompress(). Calling decompress(string) on a Decompress object will decompress a zlib stream in string and return the decompressed data that was part of the stream. Everything that follows the stream will be stored in unused_data and can be retrieved afterwards.

This should produce the following output on a file created with the first script:

Start of zlib stream found at byte 6
Header: HEADER
Message: foo
Footer: FOOTER

The example can easily be modified to write the uncompressed message to a file instead of printing it. Then you can further analyze the formerly zlib compressed data, and try to identify known fields in the metadata in the header and footer you separated out.