This is kind of a branch off of my other question. Read it if you like, but it's not necessary.
Basically, I realized that in order to effectively use C#'s BeginReceive() on large messages, I need to either (a) read the packet length first, then read exactly that many bytes or (b) use an end-of-packet delimiter. My question is, are either of these present in protocol buffers? I haven't used them yet, but going over the documentation it doesn't seem like there is a length header or a delimiter.
If not, what should I do? Should I just build the message then prefix/suffix it with the length header/EOP delimiter?
You need to include the size or end marker in your protocol. Nothing is built into stream based sockets (TCP/IP) other than supporting an indefinite stream of octets arbitrarily broken up into separate packets (and packets can be spilt in transit as well).
A simple approach would be for each "message" to have a fixed size header, include both a protocol version and a payload size and any other fixed data. Then the message content (payload).
Optionally a message footer (fixed size) could be added with a checksum or even a cryptographic signature (depending on your reliability/security requirements).
Knowing the payload size allows you to keep reading a number of bytes that will be enough for the rest of the message (and if a read completes with less, doing another read for the remaining bytes until the whole message has been received).
Having a end message indicator also works, but you need to define how to handle your message containing that same octet sequence...
Apologies for arriving late at the party. I am the author of protobuf-net, one of the C# implementations. For network usage, you should consider the "[De]SerializeWithLengthPrefix" methods - that way, it will automatically handle the lengths for you. There are examples in the source.
I won't go into huge detail on an old post, but if you want to know more, add a comment and I'll get back to you.
I agree with Matt that a header is better than a footer for Protocol Buffers, for the primary reason that as PB is a binary protocol it's problematic to come up with a footer that would not also be a valid message sequence. A lot of footer-based protocols (typically EOL ones) work because the message content is in a defined range (typically 0x20 - 0x7F ASCII).
A useful approach is to have your lowest level code just read buffers off of the socket and present them up to a framing layer that assembles complete messages and remembers partial ones (I present an async approach to this (using the CCR) here, albeit for a line protocol).
For consistency, you could always define your message as a PB message with three fields: a fixed-int as the length, an enum as the type, and a byte sequence that contains the actual data. This keeps your entire network protocol transparent.
TCP/IP, as well as UDP, packets include some reference to their size. The IP header contains a 16-bit field that specifies the length of the IP header and data in bytes. The TCP header contains a 4-bit field that specifies the size of the TCP header in 32-bit words. The UDP header contains a 16-bit field that specifies the length of the UDP header and data in bytes.
Here's the thing.
Using the standard run-of-the-mill sockets in Windows, whether you're using the System.Net.Sockets namespace in C# or the native Winsock stuff in Win32, you never see the IP/TCP/UDP headers. These headers are stripped off so that what you get when you read the socket is the actual payload, i.e., the data that was sent.
The typical pattern from everything I've ever seen and done using sockets is that you define an application-level header that precedes the data you want to send. At a minimum, this header should include the size of the data to follow. This will allow you to read each "message" in its entirety without having to guess as to its size. You can get as fancy as you want with it, e.g., sync patterns, CRCs, version, type of message, etc., but the size of the "message" is all you really need.
And for what it's worth, I would suggest using a header instead of an end-of-packet delimiter. I'm not sure if there is a signficant disadvantage to the EOP delimiter, but the header is the approach used by most IP protocols I've seen. In addition, it just seems more intuitive to me to process a message from the beginning rather than wait for some pattern to appear in my stream to indicate that my message is complete.
EDIT: I have only just become aware of the Google Protocol Buffers project. From what I can tell, it is a binary serialization/de-serialization scheme for WCF (I'm sure that's a gross oversimplification). If you are using WCF, you don't have to worry about the size of the messages being sent because the WCF plumbing takes care of this behind the scenes, which is probably why you haven't found anything related to message length in the Protocol Buffers documentation. However, in the case of sockets, knowing the size will help out tremendously as discussed above. My guess is that you will serialize your data using the Protocol Buffers and then tack on whatever application header you come up with before sending it. On the receive side, you'll pull off the header and then de-serialize the remainder of the message.