Why should a Writable datatype be Mutable? What are the advantages of using Text (vs String) as a datatype for Key/Value in Map, Combine, Shuffle or Reduce processes?
Thanks & Regards, Raja
Why should a Writable datatype be Mutable? What are the advantages of using Text (vs String) as a datatype for Key/Value in Map, Combine, Shuffle or Reduce processes?
Thanks & Regards, Raja
You can't choose, these datatypes must be mutable.
The reason is the serialization mechanism. Let's look at the code:
// version 1.x MapRunner#run()
K1 key = input.createKey();
V1 value = input.createValue();
while (input.next(key, value)) {
// map pair to output
mapper.map(key, value, output, reporter);
...
So we are reusing the same instance of key/value pairs all over again. Why? I don't know about the design decisions back then, but I assume it was to reduce the amount of garbage objects. Note that Hadoop is quite old and back then the garbage collectors were not as efficient as they are today, however even today it makes a big difference in runtime if you would map billions of objects and directly throw them away as garbage.
The real reason why you can't make the Writable
type really immutable is that you can't declare fields as final
. Let's make a simple example with the IntWritable
:
public class IntWritable implements WritableComparable {
private int value;
public IntWritable() {}
public IntWritable(int value) { set(value); }
...
If you would make it immutable it would certainly not work with the serialization process anymore, because you would need to define value
final. This can't work, because the keys and values are instantiated at runtime via reflection. This requires a default constructor and thus the InputFormat
can't guess the parameter that would be necessary to fill the final data fields. So the whole concept of reusing instances obviously contradicts the concept of immutability.
However, you should ask yourself what kind of benefit an immutable key/value should have in Map/Reduce. In Joshua Bloch's- Effective Java, Item 15 he states that immutable classes are easier to design, implement and use. And he is right, because Hadoop's reducer is the worst possible example for mutability:
void reduce(IntWritable key, Iterable<Text> values, Context context) ...
Every value in the iterable refers to the same
shared object. Thus many people are confused if they buffer their values into a normal collection and ask themselves why it always retains the same values.
In the end it boils down to the trade-off of performance (cpu and memory- imagine many billions of value objects for a single key must reside in RAM) vs. simplicity.
Simply put, the reason Writable
cannot be Immutable
is the readFields(DataInput)
method in Writable
. The way the Hadoop
deserializes instances in to create an instance using the default (no argument) constructor and calling readFields
to parse the values. Since the values are not being assigned in the construction, the object must be mutable.