-->

crossfilter “double grouping” where key is the val

2019-07-27 09:34发布

问题:

Here is my data about mac address. It is recorded per minute. For each minute, I have many unique Mac addresses.

mac_add,created_time
18:59:36:12:23:33,2016-12-07 00:00:00.000
1c:e1:92:34:d7:46,2016-12-07 00:00:00.000
2c:f0:ee:86:bd:51,2016-12-07 00:00:00.000
5c:cf:7f:d3:2e:ce,2016-12-07 00:00:00.000
...
18:59:36:12:23:33,2016-12-07 00:01:00.000
1c:cd:e5:1e:99:78,2016-12-07 00:01:00.000
1c:e1:92:34:d7:46,2016-12-07 00:01:00.000
5c:cf:7f:22:01:df,2016-12-07 00:01:00.000
5c:cf:7f:d3:2e:ce,2016-12-07 00:01:00.000
...

I would like to create 2 bar charts using dc.js and crossfilter. Please refer to the image for the charts.

The first bar chart is easy enough to create. It is brushable. I created the "created_time" dimension, and created a group and reduceCount by "mac_add", such as below:

var moveTime = ndx.dimension(function (d) {
                    return d.dd; //# this is the created_time
                });
var timeGroup = moveTime.group().reduceCount(function (d) {
                    return d.mac_add;
                });
var visitorChart = dc.barChart('#visitor-no-bar');
visitorChart.width(990) 
                .height(350)
                .margins({ top: 0, right: 50, bottom: 20, left: 40 })
                .dimension(moveTime)
                .group(timeGroup)
                .centerBar(true)
                .gap(1)
                .elasticY(true)
                .x(d3.time.scale().domain([new Date(2016, 11, 7), new Date(2016, 11, 13)]))
                .round(d3.time.minute.round)
                .xUnits(d3.time.minute);

visitorChart.render();

The problem is on the second bar chart. The idea is that, one row of the data equals 1 minute, so I can aggregate and sum all minutes of each mac address to get the time length of each mac addresses, by creating another dimension by "mac_add" and do reduceCount on "mac_add" to get the time length. Then the goal is to group the time length by 30 minutes. So we can get how many mac address that have time length of 30 min and less, how many mac_add that have time length between 30 min and 1 hour, how many mac_add that have time length between 1 hour and 1.5 hour, etc...

Please correct me if I am wrong. Logically, I was thinking the dimension of the second bar chart should be the group of time length (such as <30, <1hr, < 1.5hr, etc). But the time length group themselves are not fix. It depends on the brush selection of the first chart. Maybe it only contains 30 min, maybe it only contains 1.5 hours, maybe it contains 1.5 hours and 2 hours, etc...

So I am really confused what parameters to put into the second bar chart. And method to get the required parameters (how to group a grouped data). Please help me to explain the solution.

Regards, Marvin

回答1:

I think we've called this a "double grouping" in the past, but I can't find the previous questions.

Setting up the groups

I'd start with a regular crossfilter group for the mac addresses, and then produce a fake group to aggregate by count of minutes.

var minutesPerMacDim = ndx.dimension(function(d) { return d.mac_add; }),
    minutesPerMapGroup = minutesPerMacDim.group();

function bin_keys_by_value(group, bin_value) {
    var _bins;
    return {
        all: function() {
            var bins = {};
            group.all().forEach(function(kv) {
                var valk = bin_value(kv.value);
                bins[valk] = bins[valk] || [];
                bins[valk].push(kv.key);
            });
            _bins = bins;
            // note: Object.keys returning numerical order here might not
            // work everywhere, but I couldn't find a browser where it didn't
            return Object.keys(bins).map(function(bin) {
                return {key: bin, value: bins[bin].length};
            })
        },
        bins: function() {
            return _bins;
        }
    };
}

function bin_30_mins = function(v) {
    return 30 * Math.ceil(v/30);
}

var macsPerMinuteCount = bin_keys_by_value(minutesPerMacGroup);

This will retain the mac addresses for each time bin, which we'll need for filtering later. It's uncommon to add a non-standard method bins to a fake group, but I can't think of an efficient way to retain that information, given that the filtering interface will only give us access to the keys.

Since the function takes a binning function, we could even use a threshold scale if we wanted more complicated bins than just rounding up to the nearest 30 minutes. A quantize scale is a more general way to do the rounding shown above.

Setting up the chart

Using this data to drive a chart is simple: we can use the dimension and fake group as usual.

chart
    .dimension(minutesPerMacDim)
    .group(macsPerMinuteCount)

Setting up the chart so that it can filter is a bit more complicated:

chart.filterHandler(function(dimension, filters) {
    if(filters.length === 0)
        dimension.filter(null);
    else {
        var bins = chart.group().bins(); // retrieve cached bins
        var macs = filters.map(function(key) { return bins[key]; })
        macs = Array.prototype.concat.apply([], macs);
        var macset = d3.set(macs);
        dimension.filterFunction(function(key) {
            return macset.has(key);
        })
    }
})

Recall that we're using a dimension which is keyed on mac addresses; this is good because we want to filter on mac addresses. But the chart is receiving minute-counts for its keys, and the filters will contain those keys, like 30, 60, 90, etc. So we need to supply a filterHandler which takes minute-count keys and filters the dimension based on those.

Note 1: This is all untested, so if it doesn't work, please post an example as a fiddle or bl.ock - there are fiddles and blocks you can fork to get started on the main page.

Note 2: Strictly speaking, this is not measuring the length of connections: it's counting the total number of minutes connected. Not sure if this matters to you. If a user disconnects and then reconnects within the timeframe, the two sessions will be counted as one. I think you'd have to preprocess to get duration.

EDIT: Based on your fiddle (thank you!) the code above does seem to work. It's just a matter of setting up the x scale and xUnits properly.

  chart2
      .x(d3.scale.linear().domain([60,1440]))
      .xUnits(function(start, end) {
          return (end-start)/30;
      })

A linear scale will do just fine here - I wouldn't try to quantize that scale, since the 30-minute divisions are already set up. We do need to set the xUnits so that dc.js knows how wide to make the bars.

I'm not sure why elasticX didn't work here, but the <30 bin completely dwarfed everything else, so I thought it was best to leave that out.

Fork of your fiddle: https://jsfiddle.net/gordonwoodhull/2a8ow1ay/2/