So I have Riak
running on 2 EC2
servers, using python to run javascript Mapreduce
. They have been clustered. Mainly used for "proof of concept".
There are 50 keys in the bucket, all the map/reduce function does is re-format the data. This is only for testing the map/reduce functionality in Riak.
Problem: The output only shows [{u'e': 2, u'undefined': 2, u'w': 2}]. That is completely wrong. The logs show that all the keys have "processed" but only 2 get returned. So my question is why is that happening and am I missing something important.
Code:
import riak
client = riak.RiakClient()
query = riak.RiakMapReduce(client).add('raw_hits10')
query.map("""function(v) {
var data = JSON.parse(v.values[0].data);
return [[data, 1]];
}""")
query.reduce("""function(vk) {
var res = {};
for (var indx in vk) {
var key_t = vk[indx][0];
var val_t = vk[indx][1];
ejsLog('/tmp/map_reduce.log', key_t + "--- " + val_t);
res[key_t] = 2;
}
return [res]
}
""")
for res in query.run():
print res
The results from printing:
[{u'e': 2, u'undefined': 2, u'w': 2}]
This makes no sense
In order to avoid having to load all data from the preceding phase into memory on the coordinating node before running the reduce phase (which would be problematic for large mapreduce jobs), the reduce function is run multiple times. Every iteration gets a batch of results from preceding phase together with any output from earlier reduce phase iteration(s). The default batch size is 20, but this is configurable. As the results from one reduce phase iteration will be fed in as input to the next iteration, reduce phase functions need to designed to handle this, and some strategies are described here.
It is also possible to force Riak to only run the reduce phase once for the entire input set by specifying the 'reduce_phase_only_1' parameter, but this is generally not recommended, especially for large jobs.