Why should a Writable datatype be Mutable? What are the advantages of using Text (vs String) as a datatype for Key/Value in Map, Combine, Shuffle or Reduce processes?
Thanks & Regards, Raja
Why should a Writable datatype be Mutable? What are the advantages of using Text (vs String) as a datatype for Key/Value in Map, Combine, Shuffle or Reduce processes?
Thanks & Regards, Raja
You can't choose, these datatypes must be mutable.
The reason is the serialization mechanism. Let's look at the code:
So we are reusing the same instance of key/value pairs all over again. Why? I don't know about the design decisions back then, but I assume it was to reduce the amount of garbage objects. Note that Hadoop is quite old and back then the garbage collectors were not as efficient as they are today, however even today it makes a big difference in runtime if you would map billions of objects and directly throw them away as garbage.
The real reason why you can't make the
Writable
type really immutable is that you can't declare fields asfinal
. Let's make a simple example with theIntWritable
:If you would make it immutable it would certainly not work with the serialization process anymore, because you would need to define
value
final. This can't work, because the keys and values are instantiated at runtime via reflection. This requires a default constructor and thus theInputFormat
can't guess the parameter that would be necessary to fill the final data fields. So the whole concept of reusing instances obviously contradicts the concept of immutability.However, you should ask yourself what kind of benefit an immutable key/value should have in Map/Reduce. In Joshua Bloch's- Effective Java, Item 15 he states that immutable classes are easier to design, implement and use. And he is right, because Hadoop's reducer is the worst possible example for mutability:
Every value in the iterable refers to the
same
shared object. Thus many people are confused if they buffer their values into a normal collection and ask themselves why it always retains the same values.In the end it boils down to the trade-off of performance (cpu and memory- imagine many billions of value objects for a single key must reside in RAM) vs. simplicity.
Simply put, the reason
Writable
cannot beImmutable
is thereadFields(DataInput)
method inWritable
. The way theHadoop
deserializes instances in to create an instance using the default (no argument) constructor and callingreadFields
to parse the values. Since the values are not being assigned in the construction, the object must be mutable.