Efficient Go serialization of struct to disk

2020-01-29 07:38发布

问题:

I've been tasked to replace C++ code to Go and I'm quite new to the Go APIs. I am using gob for encoding hundreds of key/value entries to disk pages but the gob encoding has too much bloat that's not needed.

package main

import (
    "bytes"
    "encoding/gob"
    "fmt"
)
type Entry struct {
    Key string
    Val string
}

func main() {
    var buf bytes.Buffer
    enc := gob.NewEncoder(&buf)
    e := Entry { "k1", "v1" }
    enc.Encode(e)
    fmt.Println(buf.Bytes())
}

This produces a lot of bloat that I don't need:

[35 255 129 3 1 1 5 69 110 116 114 121 1 255 130 0 1 2 1 3 75 101 121 1 12 0 1 3 86 97 108 1 12 0 0 0 11 255 130 1 2 107 49 1 2 118 49 0] 

I want to serialize each string's len followed by the raw bytes like:

[0 0 0 2 107 49 0 0 0 2 118 49]

I am saving millions of entries so the additional bloat in the encoding increases the file size by roughly x10.

How can I serialize it to the latter without manual coding?

回答1:

Use protobuf to efficiently encode your data.

https://github.com/golang/protobuf

Your main would look like this:

package main

import (
    "fmt"
    "log"

    "github.com/golang/protobuf/proto"
)

func main() {
    e := &Entry{
        Key: proto.String("k1"),
        Val: proto.String("v1"),
    }
    data, err := proto.Marshal(e)
    if err != nil {
        log.Fatal("marshaling error: ", err)
    }
    fmt.Println(data)
}

You create a file, example.proto like this:

package main;

message Entry {
    required string Key = 1;
    required string Val = 2;
}

You generate the go code from the proto file by running:

$ protoc --go_out=. *.proto

You can examine the generated file, if you wish.

You can run and see the results output:

$ go run *.go
[10 2 107 49 18 2 118 49]


回答2:

If you zip a file named a.txt containing the text "hello" (which is 5 characters), the result zip will be around 115 bytes. Does this mean the zip format is not efficient to compress text files? Certainly not. There is an overhead. If the file contains "hello" a hundred times (500 bytes), zipping it will result in a file being 120 bytes! 1x"hello" => 115 bytes, 100x"hello" => 120 bytes! We added 495 byes, and yet the compressed size only increased by 5 bytes.

Something similar is happening with the encoding/gob package:

The implementation compiles a custom codec for each data type in the stream and is most efficient when a single Encoder is used to transmit a stream of values, amortizing the cost of compilation.

When you "first" serialize a value of a type, the definition of the type also has to be included / transmitted, so the decoder can properly interpret and decode the stream:

A stream of gobs is self-describing. Each data item in the stream is preceded by a specification of its type, expressed in terms of a small set of predefined types.

Let's return to your example:

var buf bytes.Buffer
enc := gob.NewEncoder(&buf)
e := Entry{"k1", "v1"}
enc.Encode(e)
fmt.Println(buf.Len())

It prints:

48

Now let's encode a few more of the same type:

enc.Encode(e)
fmt.Println(buf.Len())
enc.Encode(e)
fmt.Println(buf.Len())

Now the output is:

60
72

Try it on the Go Playground.

Analyzing the results:

Additional values of the same Entry type only cost 12 bytes, while the first is 48 bytes because the type definition is also included (which is ~26 bytes), but that is a one-time overhead.

So basically you transmit 2 strings: "k1" and "v1" which are 4 bytes, and the length of strings also has to be included, using 4 bytes (size of int on 32-bit architectures) gives you the 12 bytes, which is the "minimum". (Yes, you could use a smaller type for length, but that would have its limitations. A variable-length encoding would be a better choice for small numbers, see encoding/binary package.)

All in all, encoding/gob does a pretty good job for your needs. Don't get fooled by initial impressions.

If this 12 bytes for one Entry is too "much" for you, you can always wrap the stream into a compress/flate or compress/gzip writer to further reduce the size (in exchange for slower encoding/decoding and slightly higher memory requirement for the process).

Demonstration:

Let's test the 3 solutions:

  • Using a "naked" output (no compression)
  • Using compress/flate to compress the output of encoding/gob
  • Using compress/gzip to compress the output of encoding/gob

We will write a thousand entries, changing keys and values of each, being "k000", "v000", "k001", "v001" etc. This means the uncompressed size of an Entry is 4 byte + 4 byte + 4 byte + 4 byte = 16 bytes (2x 4 bytes text, 2x4 byte lengths).

The code looks like this:

names := []string{"Naked", "flate", "gzip"}
for _, name := range names {
    buf := &bytes.Buffer{}

    var out io.Writer
    switch name {
    case "Naked":
        out = buf
    case "flate":
        out, _ = flate.NewWriter(buf, flate.DefaultCompression)
    case "gzip":
        out = gzip.NewWriter(buf)
    }

    enc := gob.NewEncoder(out)
    e := Entry{}
    for i := 0; i < 1000; i++ {
        e.Key = fmt.Sprintf("k%3d", i)
        e.Val = fmt.Sprintf("v%3d", i)
        enc.Encode(e)
    }

    if c, ok := out.(io.Closer); ok {
        c.Close()
    }
    fmt.Printf("[%5s] Length: %5d, average: %5.2f / Entry\n",
        name, buf.Len(), float64(buf.Len())/1000)
}

Output:

[Naked] Length: 16036, average: 16.04 / Entry
[flate] Length:  4123, average:  4.12 / Entry
[ gzip] Length:  4141, average:  4.14 / Entry

Try it on the Go Playground.

As you can see: the "naked" output is 16.04 bytes/Entry, just little over the calculated size (due to the one-time tiny overhead discussed above).

When you use flate or gzip to compress the output, you can reduce the output size to about 4.13 bytes/Entry, which is about ~26% of the theoretical size, I'm sure that satisfies you. (Note that with "real-life" data the compression ratio would probably be a lot higher as the keys and values I used in the test are very similar and thus really well compressible; still ratio should be around 50% with real-life data).



回答3:

"Manual coding", you're so afraid of, is trivially done in Go using the standard encoding/binary package.

You appear to store string length values as 32-bit integers in big-endian format, so you can just go on and do just that in Go:

package main

import (
    "bytes"
    "encoding/binary"
    "fmt"
    "io"
)

func encode(w io.Writer, s string) (n int, err error) {
    var hdr [4]byte
    binary.BigEndian.PutUint32(hdr[:], uint32(len(s)))
    n, err = w.Write(hdr[:])
    if err != nil {
        return
    }
    n2, err := io.WriteString(w, s)
    n += n2
    return
}

func main() {
    var buf bytes.Buffer

    for _, s := range []string{
        "ab",
        "cd",
        "de",
    } {
        _, err := encode(&buf, s)
        if err != nil {
            panic(err)
        }
    }
    fmt.Printf("%v\n", buf.Bytes())
}

Playground link.

Note that in this example I'm writing to a byte buffer, but that's for demonstration purposes only—since encode() writes to an io.Writer, you can pass it an opened file, a network socket and anything else implementing that interface.