Why isn't the AvroCoder deterministic?

2019-08-06 14:08发布

AvroCoder.isDeterministic returns false.

Why isn't the AvroCoder deterministic? Wouldn't Avro records always be encoded into the same byte stream?

Since the Avro Coder isn't deterministic an Avro record can't be used as a Key for a group by operation. What's the best way to turn an Avro record into a key? Should we just use the json representation of the Avro record?

2条回答
甜甜的少女心
2楼-- · 2019-08-06 14:30

Based on the Avro specification it looks like only Arrays and Maps have non deterministic binary encoding.

Maps look like they are non deterministically encoded for two reasons

  • The order of the elements isn't specified
  • The blocks can be encoded two different ways either by specifying the number of elements or the number of bytes in the block.

Arrays look like they are non deterministically encoded because

  • The block can be encoded two different ways either by specifying the number of elements or the number of bytes in the block.

So for any schema without an array or a map, I think the binary encoding should be deterministic. So I think we could create a deterministic encoder just by subclassing AvroCoder and overriding AvroCoder.isDeterministic to return true.

AvroDeterministicCoder is my first attempt at creating such a coder.

查看更多
不美不萌又怎样
3楼-- · 2019-08-06 14:36

AvroCoder can inspect the schema and type being coded and decide when it is deterministic. It was added in GitHub commit #a806df.

It includes support for deterministically encoding arrays and maps when the underlying collection is deterministically order.

查看更多
登录 后发表回答