I came across a memory management problem while using Spark's caching mechanism. I am currently utilizing Encoder
s with Kryo and was wondering if switching to beans would help me reduce the size of my cached dataset.
Basically, what are the pros and cons of using beans over Kryo serialization when working with Encoder
s? Are there any performance improvements? Is there a way to compress a cached Dataset
apart from caching with SER option?
For the record, I have found a similar topic that tackles the comparison between the two. However, it doesn't go into the details of this comparison.
Whenever you can. Unlike generic binary Encoders
, which use general purpose binary serialization and store whole objects as opaque blobs, Encoders.bean[T]
leverages the structure of an object, to provide class specific storage layout.
This difference becomes obvious when you compare the schemas created using Encoders.bean
and Encoders.kryo
.
Why does it matter?
- You get efficient field access using SQL API without any need for deserialization and full support for all
Dataset
transformations.
- With transparent field serialization you can fully utilize columnar storage, including built-in compression.
So when to use kryo
Encoder
? In general when nothing else works. Personally I would avoid it completely for data serialization. The only really useful application I can think of is serialization of aggregation buffer (check for example How to find mean of grouped Vector columns in Spark SQL?).