Parsing Binary Data in C?

2019-01-30 19:53发布

Are there any libraries or guides for how to read and parse binary data in C?

I am looking at some functionality that will receive TCP packets on a network socket and then parse that binary data according to a specification, turning the information into a more useable form by the code.

Are there any libraries out there that do this, or even a primer on performing this type of thing?

9条回答
▲ chillily
2楼-- · 2019-01-30 20:30

Let me restate your question to see if I understood properly. You are looking for software that will take a formal description of a packet and then will produce a "decoder" to parse such packets?

If so, the reference in that field is PADS. A good article introducing it is PADS: A Domain-Specific Language for Processing Ad Hoc Data. PADS is very complete but unfortunately under a non-free licence.

There are possible alternatives (I did not mention non-C solutions). Apparently, none can be regarded as completely production-ready:

If you read French, I summarized these issues in Génération de décodeurs de formats binaires.

查看更多
我只想做你的唯一
3楼-- · 2019-01-30 20:31

I'm not really understand what kind of library you are looking for ? Generic library that will take any binary input and will parse it to unknown format? I'm not sure there is such library can ever exist in any language. I think you need elaborate your question a little bit.

Edit:
Ok, so after reading Jon's answer seems there is a library, well kind of library it's more like code generation tool. But as many stated just casting the data to the appropriate data structure, with appropriate carefulness i.e using packed structures and taking care of endian issues you are good. Using such tool with C it's just an overkill.

查看更多
手持菜刀,她持情操
4楼-- · 2019-01-30 20:31

Basically suggestions about casting to struct work but please be aware that numbers can be represented differently on different architectures.

To deal with endian issues network byte order was introduced - common practice is to convert numbers from host byte order to network byte order before sending the data and to convert back to host order on receipt. See functions htonl, htons, ntohl and ntohs.

And really consider kervin's advice - read UNP. You won't regret it!

查看更多
何必那么认真
5楼-- · 2019-01-30 20:35

In my experience, the best way is to first write a set of primitives, to read/write a single value of some type from a binary buffer. This gives you high visibility, and a very simple way to handle any endianness-issues: just make the functions do it right.

Then, you can for instance define structs for each of your protocol messages, and write pack/unpack (some people call them serialize/deserialize) functions for each.

As a base case, a primitive to extract a single 8-bit integer could look like this (assuming an 8-bit char on the host machine, you could add a layer of custom types to ensure that too, if needed):

const void * read_uint8(const void *buffer, unsigned char *value)
{
  const unsigned char *vptr = buffer;
  *value = *buffer++;
  return buffer;
}

Here, I chose to return the value by reference, and return an updated pointer. This is a matter of taste, you could of course return the value and update the pointer by reference. It is a crucial part of the design that the read-function updates the pointer, to make these chainable.

Now, we can write a similar function to read a 16-bit unsigned quantity:

const void * read_uint16(const void *buffer, unsigned short *value)
{
  unsigned char lo, hi;

  buffer = read_uint8(buffer, &hi);
  buffer = read_uint8(buffer, &lo);
  *value = (hi << 8) | lo;
  return buffer;
}

Here I assumed incoming data is big-endian, this is common in networking protocols (mainly for historical reasons). You could of course get clever and do some pointer arithmetic and remove the need for a temporary, but I find this way makes it clearer and easier to understand. Having maximal transparency in this kind of primitive can be a good thing when debugging.

The next step would be to start defining your protocol-specific messages, and write read/write primitives to match. At that level, think about code generation; if your protocol is described in some general, machine-readable format, you can generate the read/write functions from that, which saves a lot of grief. This is harder if the protocol format is clever enough, but often doable and highly recommended.

查看更多
看我几分像从前
6楼-- · 2019-01-30 20:39

You might be interested in Google Protocol Buffers, which is basically a serialization framework. It's primarily for C++/Java/Python (those are the languages supported by Google) but there are ongoing efforts to port it to other languages, including C. (I haven't used the C port at all, but I'm responsible for one of the C# ports.)

查看更多
霸刀☆藐视天下
7楼-- · 2019-01-30 20:39

You don't really need to parse binary data in C, just cast some pointer to whatever you think it should be.

struct SomeDataFormat
{
    ....
}

SomeDataFormat* pParsedData = (SomeDataFormat*) pBuffer;

Just be wary of endian issues, type sizes, reading off the end of buffers, etc etc

查看更多
登录 后发表回答