I'm using google protocol buffer to serialize equity market data (ie. timestamp, bid,ask fields).
I can store one message into a file and deserialize it without issue.
How can I store multiple messages into a single file? Not sure how I can separate the messages. I need to be able to append new messages to the file on the fly.
I would recommend using the writeDelimitedTo(OutputStream)
and parseDelimitedFrom(InputStream)
methods on Message
objects. writeDelimitedTo
writes the length of the message before the message itself; parseDelimitedFrom
then uses that length to read only one message and no farther. This allows multiple messages to be written to a single OutputStream
to then be parsed separately. For more information, see https://developers.google.com/protocol-buffers/docs/reference/java/com/google/protobuf/MessageLite#writeDelimitedTo(java.io.OutputStream)
From the docs:
http://code.google.com/apis/protocolbuffers/docs/techniques.html#streaming
Streaming Multiple Messages
If you want to write multiple messages to a single file or stream, it
is up to you to keep track of where one message ends and the next
begins. The Protocol Buffer wire format is not self-delimiting, so
protocol buffer parsers cannot determine where a message ends on their
own. The easiest way to solve this problem is to write the size of
each message before you write the message itself. When you read the
messages back in, you read the size, then read the bytes into a
separate buffer, then parse from that buffer. (If you want to avoid
copying bytes to a separate buffer, check out the CodedInputStream
class (in both C++ and Java) which can be told to limit reads to a
certain number of bytes.)
Protobuf does not include a terminator per outermost record, so you need to do that yourself. The simplest approach is to prefix the data with the length of the record that follows. Personally, I tend to use the approach of writing a string-header (for an arbitrary field number), then the length as a "varint" - this means the entire document is then itself a valid protobuf, and could be consumed as an object with a "repeated" element, however, just a fixed-length (typically 32-bit little-endian) marker would do just as well. With any such storage, it is appendable as you require.
If you're looking for a C++ solution, Kenton Varda submitted a patch to protobuf around August 2015 that adds support for writeDelimitedTo() and readDelimitedFrom() calls that will serialize/deserialize a sequence of proto messages to/from a file in a way that's compatible with the Java version of these calls. Unfortunately this patch hasn't been approved yet, so if you want the functionality you'll need to merge it yourself.
Another option is Google has open sourced protobuf file reading/writing code through other projects. The or-tools library, for example, contains the classes RecordReader and RecordWriter that serialize/deserialize a proto stream to a file.
If you would like stand-alone versions of these classes that have almost no external dependencies, I have a fork of or-tools that contains only these classes. See: https://github.com/moof2k/recordio
Reading and writing with these classes is straightforward:
File* file = File::Open("proto.log", "w");
RecordWriter writer(file);
writer.WriteProtocolMessage(msg1);
writer.WriteProtocolMessage(msg2);
...
writer.Close();
An easier way is to base64 encode each message and store it as a record per line.