-->

Generating histogram for highly skewed data

2019-04-12 11:16发布

问题:

I'm using dc.js, crossfilter.js and d3.js to generate a barchart.

The barchart represents data for credit card transactions. It plots number of transactions (y-axis) over transaction dollar amount (x-axis).

It looks like this:

The data array basically looks like:

[
  ...
  {
    txn_id: 1,
    txn_amount: 20
  },
  ...
]

The data is highly variable depending on different merchants etc and I can't make any assumptions about distributions.

As you can see this graph isn't all that useful because of the data itself. In this case there is 1 transaction for -$7500 and 2 at around $7500.

In between there other amounts, but most transactions cluster around $0 - $100 where you can see the spike.

Unfortunately there is enough variance that you can't even see the bars for the less frequent transaction amounts.

This answer seems close, but not quite there.

What I'd really like to do is break the x-axis ticks into 10 reasonably-sized chunks that group the transaction amounts sensibly to make the graph more useful.

For example let's say in this case the average transaction amount is $20. And the extreme min and max values are -$7500 and $7500

So in this particular example I might like to have the x-axis chunked up as so:

Bin 1: -$1000 >= transaction amount
Bin 2: -$100 >= transaction amount > -$1000
Bin 3: -$50 >= transaction amount > -$100
Bin 4: $0 >= transaction amount > -$50
Bin 5: $15 >= transaction amount > $0
Bin 6: $25 >= transaction amount > $15
Bin 7: $40 >= transaction amount > $25
Bin 8: $100 >= transaction amount > $40
Bin 9: $1000 >= transaction amount > $100
Bin 10: transaction amount > $1000

(the chunk/bin size gets smaller and smaller the closer to the average we get).

Admittedly it's been ages since I've done any serious study of statistics, so I'm quite rusty. But it does seem that the way I break my data up into bins/chucks will have a lot to do with the standard deviation of my data.

I guess I have a good feel for what I want, I'm just a bit lost on how to use d3.js (d3.mean(), d3.quantile() ?) and dc.js to get a histogram similarly to how I've described.

So what's the correct way, or what libraries should I be using to:

  1. Create 10 'reasonably' sized bins according to an arbitrarily given data set
  2. Group the data into those bins (actually, this part should be pretty straightforward)

In terms of the physical spacing histogram's x-axis, I don't think it's necessary or desired for the ticks to be unevenly spaced (thus perhaps it is no longer a histogram).

I'd prefer the ticks stay evenly spaced despite the fact that chunk sizes are not equal. I will just be sure to label the ticks appropriately.

Any pointers in the right direction would be much appreciated.

Update:

So it seems the d3.js is several steps ahead of me as usual and has already got my back. I believe I can use d3.scale.quantile() to break the x-axis up into 10 quantiles (decile). Indeed, I've setup my quantile scale and it seems to be doing the right thing, when I input numbers directly into the quantile scale function (via the JS console) it outputs the correct bucket (out of the 10).

But unfortunately my graph is still messed up. Here is my code:

var datum = crossfilter(data),
    amount = datum.dimension(function(d) { return +d.txn_amount; }),
    amounts = amount.group();

amountsChart = dc.barChart("#dc-amounts-chart");
amountsChart
  .width(defaultWidth)
  .height(defaultHeight)
  .margins({top: 20, right: 20, bottom: 20, left: 50})
  .dimension(amount)
  .group(amounts)
  .centerBar(true)
  .gap(5)
  .elasticY(true)
  .x(d3.scale.quantile().domain(amounts.all().map(function(d) {
                          // d.key is the transaction dollar amount,
                          // d.value is the number of transactions at that amount
                          return d.key;
                        }))
                        .range([0,1,2,3,4,5,6,7,8,9]));

amountsChart.yAxis().ticks(5);

dc.renderAll();

and the resulting chart:

I think I'm getting close, but still not sure where I'm taking a wrong turn.

回答1:

You could use an outlier test to trim out your, well outliers and then add them back into the extreme bins. I'd also change the text on those bins to y, but that can easily be done by passing a custom set of ticks to the axis.

I've mocked up an example using the Chauvenet's criterion, one of a number of outlier tests. I'd originally thought to use the Grubbs test (or even better the multiple Grubbs Beck test) but there's a bit of work to code that. Chauvenet's criterion works quite simply by assuming that any value greater then m standard deviations from your mean is an outlier.

I've put this all together here and the function is:

function chauvenet (x) {
    var dMax = 3;
    var mean = d3.mean(x);
    var stdv = Math.sqrt(variance(x));
    var counter = 0;
    var temp = [];

    for (var i = 0; i < x.length; i++) {
        if(dMax > (Math.abs(x[i] - mean))/stdv) {
            temp[counter] = x[i]; 
            counter = counter + 1;
        }
    };

    return temp
}

The terms are all fairly obvious, dMax is the number of standard deviations, mean is the mean and stdv is the standard deviation (or square root of the variance).

Note I've not added the outliers back into the histogram, but that should be quite easy to do.



回答2:

If d3 is giving you a hard time .. Try this http://imaginea.github.com/uvCharts :) You must already be aware of nvd3