Avoid multiple sums in custom crossfilter reduce f

2019-05-22 02:31发布

This question arise from some difficulties in creating a crossfilter dataset, in particular on how to group the different dimension and compute a derived values. The final aim is to have a number of dc.js graphs using the dimensions and groups.

(Fiddle example https://jsfiddle.net/raino01r/0vjtqsjL/)

Question

Before going on with the explanation of the setting, the key question is the following:

How to create custom add, remove, init, functions to pass in .reduce so that the first two do not sum multiple times the same feature?

Data

Let's say I want to monitor the failure rate of a number of machines (just an example). I do this using different dimension: month, machine's location, and type of failure.

For example I have the data in the following form:

| month   | room | failureType | failCount | machineCount |
|---------|------|-------------|-----------|--------------|
| 2015-01 |  1   |  A          |  10       |  5           |
| 2015-01 |  1   |  B          |   2       |  5           |
| 2015-01 |  2   |  A          |   0       |  3           |
| 2015-01 |  2   |  B          |   1       |  3           |
| 2015-02 |  .   |  .          |   .       |  .           |

Expected

For the three given dimensions, I should have:

  • month_1_rate = $\frac{10+2+0+1}{5+3}$;
  • room_1_rate = $\frac{10+2}{5}$;
  • type_A_rate = $\frac{10+0}{5+3}$.

Idea

Essentially, what counts in this setting is the couple (day, room). I.e. given a day and a room there should be a rate attached to them (then the crossfilter should act to take in account the other filters).

Therefore, a way to go could be to store the couples that have already been used and do not sum machineCount for them - however we still want to update the failCount value.

Attempt (failing)

My attempt was to create custom reduce functions and not summing MachineCount that were already taken into account.

However there are some unexpected behaviours. I'm sure this is not the way to go - so I hope to have some suggestion on this. // A dimension is one of: // ndx = crossfilter(data); // ndx.dimension(function(d){return d.month;}) // ndx.dimension(function(d){return d.room;}) // ndx.dimension(function(d){return d.failureType;}) // Goal: have a general way to get the group given the dimension:

function get_group(dim){
    return dim.group().reduce(add_rate, remove_rate, initial_rate);
}

// month is given as datetime object
var monthNameFormat = d3.time.format("%Y-%m");
//
function check_done(p, v){
    return p.done.indexOf(v.room+'_'+monthNameFormat(v.month))==-1;
}    

// The three functions needed for the custom `.reduce` block.
function add_rate(p, v){
    var index = check_done(p, v);
    if (index) p.done.push(v.room+'_'+monthNameFormat(v.month));
    var count_to_sum = (index)? v.machineCount:0;
    p.mach_count += count_to_sum;
    p.fail_count += v.failCount;
    p.rate = (p.mach_count==0) ? 0 : p.fail_count*1000/p.mach_count;
    return p;
}
function remove_rate(p, v){
    var index = check_done(p, v);
    var count_to_subtract = (index)? v.machineCount:0;
    if (index) p.done.push(v.room+'_'+monthNameFormat(v.month));
    p.mach_count -= count_to_subtract;
    p.fail_count -= v.failCount;
    p.rate = (p.mach_count==0) ? 0 : p.fail_count*1000/p.mach_count;
    return p;
}
function initial_rate(){
    return {rate: 0, mach_count:0, fail_count:0, done: new Array()};
}

Connection with dc.js

As mentioned, the previous code is needed to create dimension, group to be passed in three different bar graphs using dc.js.

Each graph will have .valueAccessor(function(d){return d.value.rate};).

See the jsfiddle (https://jsfiddle.net/raino01r/0vjtqsjL/), for an implementation. Different numbers, but the datastructure is the same. Notice the in the fiddle you expect a Machine count to be 18 (in both months), however you always get the double (because of the 2 different locations).


Edit

Reduction + dc.js

Following Ethan Jewett answer, I used reductio to take care of the grouping. The updated fiddle is here https://jsfiddle.net/raino01r/dpa3vv69/

My reducer object needs two exception (month, room), when summing the machineCount values. Hence it is built as follows:

var reducer = reductio()
reducer.value('mach_count')
       .exception(function(d) { return d.room; })
       .exception(function(d) { return d.month; })
       .exceptionSum(function(d) { return d.machineCount; })
reducer.value('fail_count')
       .sum(function(d) { return d.failCount; })

This seems to fix the numbers when the graphs are rendered.

However, I do have a strange behaviour when filtering one single month and looking at the numbers in the type graph.

Possible solution

Rather double create two exception, I could merge the two fields when processing the data. I.e. as soon the data is defined I couls:

data.foreach(function(x){
    x['room_month'] = x['room'] + '_' + x['month'];
})

Then the above reduction code should become:

var reducer = reductio()
reducer.value('mach_count')
       .exception(function(d) { return d.room_month; })
       .exceptionSum(function(d) { return d.machineCount; })
reducer.value('fail_count')
       .sum(function(d) { return d.failCount; })

This solution seems to work. However I am not sure if this is a sensible things to do: if the dataset is large,adding a new feature could slow down things quite a lot!

1条回答
够拽才男人
2楼-- · 2019-05-22 02:36

A few things:

  1. Don't calculate rates in your Crossfilter reducers. Calculate the components of the rates. This will keep both simpler and faster. Do the actual division in your value accessor.

  2. You've basically got the right idea. I think there are two problems that I see immediately:

    • In your remove_rate your are not removing the key from the p.done array. You should be doing something like if (index) p.done.splice(p.done.indexOf(v.room+'_'+monthNameFormat(v.month)), 1); to remove it.

    • In your reduce functions, index is a boolean. (index == -1) will never evaluate to true, IIRC. So your added machine count will always be 0. Use var count_to_sum = index ? v.machineCount:0; instead.

If you want to put together a working example, I or someone else will be happy to get it going for you, I'm sure.

You may also want to try Reductio. Crossfilter reducers are difficult to do right and efficiently, so it may make sense to use a library to help. With Reductio, creating a group that calculates your machine count and failure count looks like this:

var reducer = reductio()
reducer.value('mach_count')
  .exception(function(d) { return d.room; })
  .exceptionSum(function(d) { return d.machineCount; })
reducer.value('fail_count')
  .sum(function(d) { return d.failCount; })

var dim = ndx.dimension(...)
var grp = dim.group()
reducer(group)
查看更多
登录 后发表回答