Why should a Writable datatype be Mutable?

2019-06-27 17:38发布

Why should a Writable datatype be Mutable? What are the advantages of using Text (vs String) as a datatype for Key/Value in Map, Combine, Shuffle or Reduce processes?

Thanks & Regards, Raja

2条回答
The star\"
2楼-- · 2019-06-27 17:53

You can't choose, these datatypes must be mutable.

The reason is the serialization mechanism. Let's look at the code:

// version 1.x MapRunner#run()
K1 key = input.createKey();
V1 value = input.createValue();

while (input.next(key, value)) {
   // map pair to output
   mapper.map(key, value, output, reporter);
   ...

So we are reusing the same instance of key/value pairs all over again. Why? I don't know about the design decisions back then, but I assume it was to reduce the amount of garbage objects. Note that Hadoop is quite old and back then the garbage collectors were not as efficient as they are today, however even today it makes a big difference in runtime if you would map billions of objects and directly throw them away as garbage.

The real reason why you can't make the Writable type really immutable is that you can't declare fields as final. Let's make a simple example with the IntWritable:

public class IntWritable implements WritableComparable {
  private int value;

  public IntWritable() {}

  public IntWritable(int value) { set(value); }
...

If you would make it immutable it would certainly not work with the serialization process anymore, because you would need to define value final. This can't work, because the keys and values are instantiated at runtime via reflection. This requires a default constructor and thus the InputFormat can't guess the parameter that would be necessary to fill the final data fields. So the whole concept of reusing instances obviously contradicts the concept of immutability.

However, you should ask yourself what kind of benefit an immutable key/value should have in Map/Reduce. In Joshua Bloch's- Effective Java, Item 15 he states that immutable classes are easier to design, implement and use. And he is right, because Hadoop's reducer is the worst possible example for mutability:

void reduce(IntWritable key, Iterable<Text> values, Context context) ...

Every value in the iterable refers to the same shared object. Thus many people are confused if they buffer their values into a normal collection and ask themselves why it always retains the same values.

In the end it boils down to the trade-off of performance (cpu and memory- imagine many billions of value objects for a single key must reside in RAM) vs. simplicity.

查看更多
劫难
3楼-- · 2019-06-27 18:09

Simply put, the reason Writable cannot be Immutable is the readFields(DataInput) method in Writable. The way the Hadoop deserializes instances in to create an instance using the default (no argument) constructor and calling readFields to parse the values. Since the values are not being assigned in the construction, the object must be mutable.

查看更多
登录 后发表回答