How can I calculate the point between two overlapp

2019-03-28 01:12发布

问题:

I have two sets of data that overlap a bit (see plot below). I need to find the point between these sets where one would guess an unknown data point would belong in a particular category.

If I have a new data point (let's say 5000), and had to bet $$$ on whether it belongs in Group A or Group B, how can I calculate the point that makes my bet most sure?

See sample dataset and accompanying plot below with approximated point between these groups (calculated by eye).

GROUP A
[385,515,975,1136,2394,2436,4051,4399,4484,4768,4768,4849,4856,4954,5020,5020,5020,5020,5020,5020,5020,5020,5020,5052,5163,5200,5271,5421,5421,5442,5746,5765,5903,5992,5992,6046,6122,6205,6208,6239,6310,6360,6416,6512,6536,6543,6581,6609,6696,6699,6752,6796,6806,6855,6859,6886,6906,6911,6923,6953,7016,7072,7086,7089,7110,7232,7278,7293,7304,7309,7348,7367,7378,7380,7419,7453,7454,7492,7506,7549,7563,7721,7723,7731,7745,7750,7751,7783,7791,7813,7813,7814,7818,7833,7863,7875,7886,7887,7902,7907,7935,7942,7942,7948,7973,7995,8002,8013,8013,8015,8024,8025,8030,8038,8041,8050,8056,8060,8064,8071,8081,8082,8085,8093,8124,8139,8142,8167,8179,8204,8214,8223,8225,8247,8248,8253,8258,8264,8265,8265,8269,8277,8278,8289,8300,8312,8314,8323,8328,8334,8363,8369,8390,8397,8399,8399,8401,8436,8442,8456,8457,8471,8474,8483,8503,8511,8516,8533,8560,8571,8575,8583,8592,8593,8626,8635,8635,8644,8659,8685,8695,8695,8702,8714,8715,8717,8729,8732,8740,8743,8750,8756,8772,8772,8778,8797,8828,8840,8840,8843,8856,8865,8874,8876,8878,8885,8887,8893,8896,8905,8910,8955,8970,8971,8991,8995,9014,9016,9042,9043,9063,9069,9104,9106,9107,9116,9131,9157,9227,9359,9471]

GROUP B
[12,16,29,32,33,35,39,42,44,44,44,45,45,45,45,45,45,45,45,45,47,51,51,51,57,57,60,61,61,62,71,75,75,75,75,75,75,76,76,76,76,76,76,79,84,84,85,89,93,93,95,96,97,98,100,100,100,100,100,102,102,103,105,108,109,109,109,109,109,109,109,109,109,109,109,109,110,110,112,113,114,114,116,116,118,119,120,121,122,124,125,128,129,130,131,132,133,133,137,138,144,144,146,146,146,148,149,149,150,150,150,151,153,155,157,159,164,164,164,167,169,170,171,171,171,171,173,174,175,176,176,177,178,179,180,181,181,183,184,185,187,191,193,199,203,203,205,205,206,212,213,214,214,219,224,224,224,225,225,226,227,227,228,231,234,234,235,237,240,244,245,245,246,246,246,248,249,250,250,251,255,255,257,264,264,267,270,271,271,281,282,286,286,291,291,292,292,294,295,299,301,302,304,304,304,304,304,306,308,314,318,329,340,344,345,356,359,363,368,368,371,375,379,386,389,390,392,394,408,418,438,440,456,456,458,460,461,467,491,503,505,508,524,557,558,568,591,609,622,656,665,668,687,705,728,817,839,965,1013,1093,1126,1512,1935,2159,2384,2424,2426,2484,2738,2746,2751,3006,3184,3184,3184,3184,3184,4023,5842,5842,6502,7443,7781,8132,8237,8501]

Array Stats:

                      Group A    Group B
Total Numbers             231        286
Mean                  7534.71     575.56
Standard Deviation    1595.04    1316.03

回答1:

I just want to point out another approach using density estimation.

Given your data, it's easy to fit a smoothed pdf using kernel density estimation. The below python code shows how to use the kde module in scipy.

from scipy.stats.kde import gaussian_kde
from numpy import linspace
import matplotlib.pyplot as plt
data1 = [385,515,975,1136,2394,2436,4051,4399,4484,4768,4768,4849,4856,4954,5020,5020,5020,5020,5020,5020,5020,5020,5020,5052,5163,5200,5271,5421,5421,5442,5746,5765,5903,5992,5992,6046,6122,6205,6208,6239,6310,6360,6416,6512,6536,6543,6581,6609,6696,6699,6752,6796,6806,6855,6859,6886,6906,6911,6923,6953,7016,7072,7086,7089,7110,7232,7278,7293,7304,7309,7348,7367,7378,7380,7419,7453,7454,7492,7506,7549,7563,7721,7723,7731,7745,7750,7751,7783,7791,7813,7813,7814,7818,7833,7863,7875,7886,7887,7902,7907,7935,7942,7942,7948,7973,7995,8002,8013,8013,8015,8024,8025,8030,8038,8041,8050,8056,8060,8064,8071,8081,8082,8085,8093,8124,8139,8142,8167,8179,8204,8214,8223,8225,8247,8248,8253,8258,8264,8265,8265,8269,8277,8278,8289,8300,8312,8314,8323,8328,8334,8363,8369,8390,8397,8399,8399,8401,8436,8442,8456,8457,8471,8474,8483,8503,8511,8516,8533,8560,8571,8575,8583,8592,8593,8626,8635,8635,8644,8659,8685,8695,8695,8702,8714,8715,8717,8729,8732,8740,8743,8750,8756,8772,8772,8778,8797,8828,8840,8840,8843,8856,8865,8874,8876,8878,8885,8887,8893,8896,8905,8910,8955,8970,8971,8991,8995,9014,9016,9042,9043,9063,9069,9104,9106,9107,9116,9131,9157,9227,9359,9471]
data2 = [12,16,29,32,33,35,39,42,44,44,44,45,45,45,45,45,45,45,45,45,47,51,51,51,57,57,60,61,61,62,71,75,75,75,75,75,75,76,76,76,76,76,76,79,84,84,85,89,93,93,95,96,97,98,100,100,100,100,100,102,102,103,105,108,109,109,109,109,109,109,109,109,109,109,109,109,110,110,112,113,114,114,116,116,118,119,120,121,122,124,125,128,129,130,131,132,133,133,137,138,144,144,146,146,146,148,149,149,150,150,150,151,153,155,157,159,164,164,164,167,169,170,171,171,171,171,173,174,175,176,176,177,178,179,180,181,181,183,184,185,187,191,193,199,203,203,205,205,206,212,213,214,214,219,224,224,224,225,225,226,227,227,228,231,234,234,235,237,240,244,245,245,246,246,246,248,249,250,250,251,255,255,257,264,264,267,270,271,271,281,282,286,286,291,291,292,292,294,295,299,301,302,304,304,304,304,304,306,308,314,318,329,340,344,345,356,359,363,368,368,371,375,379,386,389,390,392,394,408,418,438,440,456,456,458,460,461,467,491,503,505,508,524,557,558,568,591,609,622,656,665,668,687,705,728,817,839,965,1013,1093,1126,1512,1935,2159,2384,2424,2426,2484,2738,2746,2751,3006,3184,3184,3184,3184,3184,4023,5842,5842,6502,7443,7781,8132,8237,8501]

pdf1 = gaussian_kde(data1)
pdf2 = gaussian_kde(data2)

x = linspace(0, 9500, 1000)
plt.plot(x, pdf1(x),'r')
plt.plot(x, pdf2(x),'g')
plt.legend(['data1 pdf', 'data2 pdf'])

plt.show()

In the graph, the green is the pdf for the second dataset; the red is the pdf for the first dataset. Clearly the decision boundary is the vertical line that passes through the point where the green intersects with the red.

To find the the boundary numerically, we can perform something like below (assume there is only one intersection, otherwise it does not make sense):

min_diff = 10000
min_diff_x = -1
for x in linspace(3600, 4000, 400):
    diff = abs(pdf1(x) - pdf2(x))
    if diff < min_diff:
        min_diff = diff
        min_diff_x = x
print min_diff, min_diff_x

We found out that the boundary is located approximately at 3762.

If there are multiple intersections of the two pdfs, to make predictions of what class a data point x falls into, we calculate pdf1(x) and pdf2(x), the max one is the class that minimize the bayes risk. See here for more details on the topic of Bayes risk and evaluation of the probability of prediction error.

Below illustrates an example that includes actually three pdfs, at any query point x, we should ask the three pdfs separately and pick the one with the maximum value of pdf(x) as the predicted class.



回答2:

This can be viewed as a binary classification problem with a single continuous predictor. You could view this as fitting one simple decision tree, finding a threshold t such that you predict Group A when a value is >= t.

For this you pick the t that minimizes the entropy of the resulting splits. Let's say you have the following counts for some t:

| | <t | >= t | | Group A | X | Y | | Group B | Z | W |

The entropy of the < split is -(X/(X+Z))*log(X/(X+Z)) - (Z/(X+Z))*log(Z/(X+Z)). The entropy of the >= split is -(Y/(Y+W))*log(Y/(Y+W)) - (W/(Y+W))*log(W/(Y+W)). This looks messier than it is; it's just the sum of -p*log(p) for the proportion p of each group within a split.

You take the weighted average of the two, weighted by the overall size of the split. So the first term is weighted by (X+Z)/(X+Y+Z+W) and the other by (Y+W)/(X+Y+Z+W).



回答3:

With reasonable assumptions, a good discriminant is the unique data value that causes the area of B's probability density to the left of the split point to equal the area of A's to the right (or vice versa, which gives the same point).

A simple way to find this is to compute the two empiricle cumulative distribution functions (CDFs) as shown here and search them to provide the split point. This is the point where the two CDFs sum to 1.

Stated briefly, building the empiricle CDFs is just sorting each data set and using the data as the x-axis values. Drawing the curve left-to-right, start at y=0 and take a 1/n step upward at each x value. Such a curve rises asymptotically from 0 for x <= data1 to y = CDF(x) = 1 for x >= data[n]. There is a slightly more complicated method that gives a continuous stepwise linear curve rather than stair steps, which under certain assumptions is a better estimator of the true CDF. In

Note the discussion above is only to provide intuition. The CDF is represented perfectly by the sorted array of data. No new data structure is needed; i.e. x[i], i=1,2,...,n is the x value at which the curve reaches y = i/n.

With the two CDFs, R(x) and B(x), according to your diagram, you want to find the unique point x such that |1 - R(x) - B(x)| is minimized (with the piecewise linear CDF you'll always be able to make this zero). This can be done quite easily by binary search.

A nice thing about this method is that you can make it dynamic by maintaining the two CDFs in sorted sets (balanced binary search trees). As points are added, the new dividing point is easily found.

The ordered sets need the "order statistic". Here is a reference. By that I mean that you'll need to be able to query the sorted set to retrieve the ordinal of any stored x value in the CDF. This can be done with skip lists as well as trees.

I coded up one variant of this algorithm. It uses the piecewise CDF approximation, but also allows for "vertical steps" at repeated data points. This complicates the algorithm somewhat, but it's not too bad. Then I used bisection (rather than combinatorial binary search) to find the split point. The normal bisection algorithm needs modification to accommodate vertical "steps" in the CDF. I think I have all this right, but it's lightly tested.

One edge case that is not handled is if the data sets have disjoint ranges. This will find a point between the top of the lower and bottom of the higher, which is a perfectly valid discriminator. But you may want to do something fancier like return some kind of weighted average.

Note that if you have a good notion of the actual min and max values the data can achieve and they don't occur in the data, you should consider adding them so that the CDFs are not inadvertently biased.

On your example data the code produces 4184.76, which looks pretty close to the value you chose in your diagram (somewhat below halfway between min and max data).

Note I didn't sort the data because it already was. Sorting is definitely necessary.

public class SplitData {

    // Return: i such that a[i] <= x < a[i+1] if i,i+1 in range
    // else -1 if x < a[0]
    // else a.length if x >= a[a.length - 1]
    static int hi_bracket(double[] a, double x) {
        if (x < a[0]) return -1;
        if (x >= a[a.length - 1]) return a.length;
        int lo = 0, hi = a.length - 1;
        while (lo + 1 < hi) {
            int mid = (lo + hi) / 2;
            if (x < a[mid])
                hi = mid;
            else 
                lo = mid;
        }
        return lo;
    }

    // Return: i such that a[i-1] < x <= a[i] if i-1,i in range
    // else -1 if x <= a[0]
    // else a.length if x > a[a.length - 1]
    static int lo_bracket(double[] a, double x) {
        if (x <= a[0]) return -1;
        if (x > a[a.length - 1]) return a.length;
        int lo = 0, hi = a.length - 1;
        while (lo + 1 < hi) {
            int mid = (lo + hi) / 2;
            if (x <= a[mid])
                hi = mid;
            else
                lo = mid;
        }
        return hi;
    }

    // Interpolate the CDF value for the data a at value x.  Returns a range.
    static void interpolate_cdf(double[] a, double x, double[] rtn) {
        int lo_i1 = lo_bracket(a, x);
        if (lo_i1 == -1) {
            rtn[0] = rtn[1] = 0;
            return;
        }
        int hi_i0 = hi_bracket(a, x);
        if (hi_i0 == a.length) {
            rtn[0] = rtn[1] = 1;
            return;
        }
        if (hi_i0 + 1 == lo_i1) {  // normal interpolation
            rtn[0] = rtn[1] 
                = (hi_i0 + (x - a[hi_i0]) / (a[lo_i1] - a[hi_i0])) 
                    / (a.length - 1);
            return;
        }
        // we're on a joint or step; return range answer
        rtn[0] = (double)lo_i1 / (a.length - 1);  
        rtn[1] = (double)hi_i0 / (a.length - 1);
        assert rtn[0] <= rtn[1];
    }

    // Find the data value where the two given data set's empirical CDFs
    // sum to 1. This is a good discrimination value for new data.
    // This deals with the case where there's a step in either or both CDFs.
    static double find_bisector(double[] a, double[] b) {
        assert a.length > 0;
        assert b.length > 0;
        double lo = Math.min(a[0], b[0]);
        double hi = Math.max(a[a.length - 1], b[b.length - 1]);
        double eps = (hi - lo) * 1e-7;
        double[] a_rtn = new double[2], b_rtn = new double[2];
        while (hi - lo > eps) {
            double mid = 0.5 * (lo + hi);
            interpolate_cdf(a, mid, a_rtn);
            interpolate_cdf(b, mid, b_rtn);
            if (1 < a_rtn[0] + b_rtn[0])
                hi = mid;
            else if (a_rtn[1] + b_rtn[1] < 1)
                lo = mid;
            else
                return mid;  // 1 is included in the interpolated range
        }
        return 0.5 * (lo + hi);
    }

    public static void main(String[] args) {
        double split = find_bisector(a, b);
        System.err.println("Split at x = " + split);
    }

    static final double[] a = {
        385, 515, 975, 1136, 2394, 2436, 4051, 4399, 4484, 4768, 4768, 4849,
        4856, 4954, 5020, 5020, 5020, 5020, 5020, 5020, 5020, 5020, 5020, 5052,
        5163, 5200, 5271, 5421, 5421, 5442, 5746, 5765, 5903, 5992, 5992, 6046,
        6122, 6205, 6208, 6239, 6310, 6360, 6416, 6512, 6536, 6543, 6581, 6609,
        6696, 6699, 6752, 6796, 6806, 6855, 6859, 6886, 6906, 6911, 6923, 6953,
        7016, 7072, 7086, 7089, 7110, 7232, 7278, 7293, 7304, 7309, 7348, 7367,
        7378, 7380, 7419, 7453, 7454, 7492, 7506, 7549, 7563, 7721, 7723, 7731,
        7745, 7750, 7751, 7783, 7791, 7813, 7813, 7814, 7818, 7833, 7863, 7875,
        7886, 7887, 7902, 7907, 7935, 7942, 7942, 7948, 7973, 7995, 8002, 8013,
        8013, 8015, 8024, 8025, 8030, 8038, 8041, 8050, 8056, 8060, 8064, 8071,
        8081, 8082, 8085, 8093, 8124, 8139, 8142, 8167, 8179, 8204, 8214, 8223,
        8225, 8247, 8248, 8253, 8258, 8264, 8265, 8265, 8269, 8277, 8278, 8289,
        8300, 8312, 8314, 8323, 8328, 8334, 8363, 8369, 8390, 8397, 8399, 8399,
        8401, 8436, 8442, 8456, 8457, 8471, 8474, 8483, 8503, 8511, 8516, 8533,
        8560, 8571, 8575, 8583, 8592, 8593, 8626, 8635, 8635, 8644, 8659, 8685,
        8695, 8695, 8702, 8714, 8715, 8717, 8729, 8732, 8740, 8743, 8750, 8756,
        8772, 8772, 8778, 8797, 8828, 8840, 8840, 8843, 8856, 8865, 8874, 8876,
        8878, 8885, 8887, 8893, 8896, 8905, 8910, 8955, 8970, 8971, 8991, 8995,
        9014, 9016, 9042, 9043, 9063, 9069, 9104, 9106, 9107, 9116, 9131, 9157,
        9227, 9359, 9471
    };
    static final double[] b = {
        12, 16, 29, 32, 33, 35, 39, 42, 44, 44, 44, 45, 45, 45, 45, 45, 45, 45,
        45, 45, 47, 51, 51, 51, 57, 57, 60, 61, 61, 62, 71, 75, 75, 75, 75, 75,
        75, 76, 76, 76, 76, 76, 76, 79, 84, 84, 85, 89, 93, 93, 95, 96, 97, 98,
        100, 100, 100, 100, 100, 102, 102, 103, 105, 108, 109, 109, 109, 109,
        109, 109, 109, 109, 109, 109, 109, 109, 110, 110, 112, 113, 114, 114,
        116, 116, 118, 119, 120, 121, 122, 124, 125, 128, 129, 130, 131, 132,
        133, 133, 137, 138, 144, 144, 146, 146, 146, 148, 149, 149, 150, 150,
        150, 151, 153, 155, 157, 159, 164, 164, 164, 167, 169, 170, 171, 171,
        171, 171, 173, 174, 175, 176, 176, 177, 178, 179, 180, 181, 181, 183,
        184, 185, 187, 191, 193, 199, 203, 203, 205, 205, 206, 212, 213, 214,
        214, 219, 224, 224, 224, 225, 225, 226, 227, 227, 228, 231, 234, 234,
        235, 237, 240, 244, 245, 245, 246, 246, 246, 248, 249, 250, 250, 251,
        255, 255, 257, 264, 264, 267, 270, 271, 271, 281, 282, 286, 286, 291,
        291, 292, 292, 294, 295, 299, 301, 302, 304, 304, 304, 304, 304, 306,
        308, 314, 318, 329, 340, 344, 345, 356, 359, 363, 368, 368, 371, 375,
        379, 386, 389, 390, 392, 394, 408, 418, 438, 440, 456, 456, 458, 460,
        461, 467, 491, 503, 505, 508, 524, 557, 558, 568, 591, 609, 622, 656,
        665, 668, 687, 705, 728, 817, 839, 965, 1013, 1093, 1126, 1512, 1935,
        2159, 2384, 2424, 2426, 2484, 2738, 2746, 2751, 3006, 3184, 3184, 3184,
        3184, 3184, 4023, 5842, 5842, 6502, 7443, 7781, 8132, 8237, 8501
    };
}


回答4:

You can calculate the Mahalanobis distance of the new point with respect to each set. The set to which the new point has the lowest distance is the most likely match.

The Mahalanobis distance is a measure of the distance between a point P and a distribution D, introduced by P. C. Mahalanobis in 1936.1 It is a multi-dimensional generalization of the idea of measuring how many standard deviations away P is from the mean of D. This distance is zero if P is at the mean of D, and grows as P moves away from the mean

Since your space is one dimensional, the calculation should simplify to:

  1. Calculate the standard deviation of each distribution
  2. Calculate the mean of each distribution
  3. For each distribution, calculate how many standard deviations the point is away from the mean of the distribution.


回答5:

You are describing a one-dimensional statistical classification problem where you are looking for the 'decision boundary'. You have plenty of options to choose from:

  • logistic regression
  • nearest-neighbour classifier
  • support vector machines
  • multi-layer perceptrons
  • ...

but as the problem is simple (one dimensional, two well-separated classes) and the decision boundary is a rather empty region, I suspect that no heavy statistical method will significantly outperform a simple eye-based guesstimate.