Performant Entity Serialization: BSON vs MessagePa

2019-01-16 00:17发布

问题:

Recently I've found MessagePack, an alternative binary serialization format to Google's Protocol Buffers and JSON which also outperforms both.

Also there's the BSON serialization format that is used by MongoDB for storing data.

Can somebody elaborate the differences and the dis-/advantages of BSON vs MessagePack?


Just to complete the list of performant binary serialization formats: There are also Gobs which are going to be the successor of Google's Protocol Buffers. However in contrast to all the other mentioned formats those are not language-agnostic and rely on Go's built-in reflection there are also Gobs libraries for at least on other language than Go.

回答1:

// Please note that I'm author of MessagePack. This answer may be biased.

Format design

  1. Compatibility with JSON

    In spite of its name, BSON's compatibility with JSON is not so good compared with MessagePack.

    BSON has special types like "ObjectId", "Min key", "UUID" or "MD5" (I think these types are required by MongoDB). These types are not compatible with JSON. It means some type information are lost when you convert objects from BSON to JSON. It can be disadvantage to use both JSON and BSON in single service.

    MessagePack is designed to be transparently converted from/to JSON.

  2. MessagePack is smaller than BSON

    MessagePack's format is less verbose than BSON. As the result, MessagePack can serialize objects smaller than BSON.

    For example, a simple map {"a":1, "b":2} is serialized in 7 bytes with MessagePack, while BSON uses 19 bytes.

  3. BSON supports in-place updating

    With BSON, you can modify part of stored object without re-serializing whole of the object. Let's suppose a map {"a":1, "b":2} is stored in a file and you want to update the value of "a" from 1 to 2000.

    With MessagePack, 1 uses only 1 byte but 2000 uses 3 bytes. So "b" must be moved backward by 2 bytes, while "b" is not modified.

    With BSON, both 1 and 2000 use 5 bytes. Because of this verbosity, you don't have to move "b".

  4. MessagePack has RPC

    MessagePack, Protocol Buffers, Thrift and Avro support RPC. But BSON doesn't.

These differences imply that MessagePack is originally designed for network communication while BSON is designed for storages.

Implementation and API design

  1. MessagePack has type-checking APIs (Java, C++ and D)

    MessagePack supports static-typing.

    Dynamic-typing used with JSON or BSON are useful for dynamic languages like Ruby, Python or JavaScript. But troublesome for static languages. You must write boring type-checking codes.

    MessagePack provides type-checking API. It converts dynamically-typed objects into statically-typed objects. Here is a simple example (C++):

    #include <msgpack.hpp>

    class myclass {
    private:
        std::string str;
        std::vector<int> vec;
    public:
        // This macro enables this class to be serialized/deserialized
        MSGPACK_DEFINE(str, vec);
    };

    int main(void) {
        // serialize
        myclass m1 = ...;

        msgpack::sbuffer buffer;
        msgpack::pack(&buffer, m1);

        // deserialize
        msgpack::unpacked result;
        msgpack::unpack(&result, buffer.data(), buffer.size());

        // you get dynamically-typed object
        msgpack::object obj = result.get();

        // convert it to statically-typed object
        myclass m2 = obj.as<myclass>();
    }
  1. MessagePack has IDL

    It's related to the type-checking API, MessagePack supports IDL. (specification is available from: http://wiki.msgpack.org/display/MSGPACK/Design+of+IDL)

    Protocol Buffers and Thrift require IDL (don't support dynamic-typing) and provide more mature IDL implementation.

  2. MessagePack has streaming API (Ruby, Python, Java, C++, ...)

    MessagePack supports streaming deserializers. This feature is useful for network communication. Here is an example (Ruby):

    require 'msgpack'

    # write objects to stdout
    $stdout.write [1,2,3].to_msgpack
    $stdout.write [1,2,3].to_msgpack

    # read objects from stdin using streaming deserializer
    unpacker = MessagePack::Unpacker.new($stdin)
    # use iterator
    unpacker.each {|obj|
      p obj
    }


回答2:

I know that this question is a bit dated at this point... I think it's very important to mention that it depends on what your client/server environment look like.

If you are passing bytes multiple times without inspection, such as with a message queue system or streaming log entries to disk, then you may well prefer a binary encoding to emphasize the compact size. Otherwise it's a case by case issue with different environments.

Some environments can have very fast serialization and deserialization to/from msgpack/protobuf's, others not so much. In general, the more low-level the language/environment the better binary serialization will work. In higher level languages (node.js, .Net, JVM) you will often see that JSON serialization is actually faster. The question then becomes is your network overhead more or less constrained than your memory/cpu?

With regards to msgpack vs bson vs protocol buffers... msgpack is the least bytes of the group, protocol buffers being about the same. BSON defines more broad native types than the other two, and may be a better match to your object mode, but this makes it more verbose. Protocol buffers have the advantage of being designed to stream... which makes it a more natural format for a binary transfer/storage format.

Personally, I would lean towards the transparency that JSON offers directly, unless there is a clear need for lighter traffic. Over HTTP with gzipped data, the difference in network overhead are even less of an issue between the formats.



回答3:

Quick test shows minified JSON is deserialized faster than binary MessagePack. In the tests Article.json is 550kb minified JSON, Article.mpack is 420kb MP-version of it. May be an implementation issue of course.

MessagePack:

//test_mp.js
var msg = require('msgpack');
var fs = require('fs');

var article = fs.readFileSync('Article.mpack');

for (var i = 0; i < 10000; i++) {
    msg.unpack(article);    
}

JSON:

// test_json.js
var msg = require('msgpack');
var fs = require('fs');

var article = fs.readFileSync('Article.json', 'utf-8');

for (var i = 0; i < 10000; i++) {
    JSON.parse(article);
}

So times are:

Anarki:Downloads oleksii$ time node test_mp.js 

real    2m45.042s
user    2m44.662s
sys     0m2.034s

Anarki:Downloads oleksii$ time node test_json.js 

real    2m15.497s
user    2m15.458s
sys     0m0.824s

So space is saved, but faster? No.

Tested versions:

Anarki:Downloads oleksii$ node --version
v0.8.12
Anarki:Downloads oleksii$ npm list msgpack
/Users/oleksii
└── msgpack@0.1.7