AvroCoder.isDeterministic returns false.
Why isn't the AvroCoder deterministic? Wouldn't Avro records always be encoded into the same byte stream?
Since the Avro Coder isn't deterministic an Avro record can't be used as a Key for a group by operation. What's the best way to turn an Avro record into a key? Should we just use the json representation of the Avro record?
Based on the Avro specification it looks like only Arrays and Maps have non deterministic binary encoding.
Maps look like they are non deterministically encoded for two reasons
- The order of the elements isn't specified
- The blocks can be encoded two different ways either by specifying the number of elements or the number of bytes in the block.
Arrays look like they are non deterministically encoded because
- The block can be encoded two different ways either by specifying the number of elements or the number of bytes in the block.
So for any schema without an array or a map, I think the binary encoding should be deterministic. So I think we could create a deterministic encoder just by subclassing AvroCoder and overriding AvroCoder.isDeterministic to return true.
AvroDeterministicCoder is my first attempt at creating such a coder.
AvroCoder
can inspect the schema and type being coded and decide when it is deterministic. It was added in GitHub commit #a806df.
It includes support for deterministically encoding arrays and maps when the underlying collection is deterministically order.