I would like to read in a JPEG-Header and analyze it.
According to Wikipedia, the header consists of a sequences of markers. Each Marker starts with FF xx
, where xx
is a specific Marker-ID.
So my idea, was to simply read in the image in binary format, and seek for the corresponding character-combinations in the binary stream. This should enable me to split the header in the corresponding marker-fields.
For instance, this is, what I receive, when I read in the first 20 bytes of an image:
binary_data = open('picture.jpg','rb').read(20)
print(binary_data)
b'\xff\xd8\xff\xe1-\xfcExif\x00\x00MM\x00*\x00\x00\x00\x08'
My questions are now:
1) Why does python not return me nice chunks of 2 bytes (in hex-format).
Somthing like this I would expect:
b'\xff \xd8 \xff \xe1 \x-' ... and so on
. Some blocks delimited by '\x' are much longer than 2 bytes.
2) Why are there symbols like -, M, *
in the returned string? Those are no characters of a hex representation I expect from a byte string (only: 0-9, a-f, I think).
Both observations hinder me in writing a simple parser. So ultimately my question summarizes to: How do I properly read-in and parse a JPEG Header in Python?
You seem overly worried about how your binary data is represented on your console. Don't worry about that.
The default built-in string-based representation that
print(..)
applies to abytes
object is just "printable ASCII characters as such (except a few exceptions), all others as an escaped hex sequence". The exceptions are semi-special characters such as\
,"
, and'
, which could mess up the string representation. But this alternative representation does not change the values in any way!See how the entire object is printed 'as if' it's a string, but its individual elements are still perfectly normal bytes?
If you have a byte array and you don't like the appearance of this default, then you can write your own. But – for clarity – this still doesn't have anything to do with parsing a file.
Because you don't ask it to. You are asking for a sequence of
bytes
, and that's what you get. If you want chunks of two-bytes, transform it after reading.The code above only prints the data; to create a new list that contains 2-byte words, loop over it and convert each 2 bytes or use
unpack
(there are actually several ways):I'm using the little-endian specifier
<
and unsigned shortH
inunpack
, because (I assume) these are the conventional ways to represent JPEG 2-byte codes. Check the documentation if you want to derive from this.