I have two sets of data that overlap a bit (see plot below). I need to find the point between these sets where one would guess an unknown data point would belong in a particular category.
If I have a new data point (let's say 5000
), and had to bet $$$ on whether it belongs in Group A or Group B, how can I calculate the point that makes my bet most sure?
See sample dataset and accompanying plot below with approximated point between these groups (calculated by eye).
GROUP A
[385,515,975,1136,2394,2436,4051,4399,4484,4768,4768,4849,4856,4954,5020,5020,5020,5020,5020,5020,5020,5020,5020,5052,5163,5200,5271,5421,5421,5442,5746,5765,5903,5992,5992,6046,6122,6205,6208,6239,6310,6360,6416,6512,6536,6543,6581,6609,6696,6699,6752,6796,6806,6855,6859,6886,6906,6911,6923,6953,7016,7072,7086,7089,7110,7232,7278,7293,7304,7309,7348,7367,7378,7380,7419,7453,7454,7492,7506,7549,7563,7721,7723,7731,7745,7750,7751,7783,7791,7813,7813,7814,7818,7833,7863,7875,7886,7887,7902,7907,7935,7942,7942,7948,7973,7995,8002,8013,8013,8015,8024,8025,8030,8038,8041,8050,8056,8060,8064,8071,8081,8082,8085,8093,8124,8139,8142,8167,8179,8204,8214,8223,8225,8247,8248,8253,8258,8264,8265,8265,8269,8277,8278,8289,8300,8312,8314,8323,8328,8334,8363,8369,8390,8397,8399,8399,8401,8436,8442,8456,8457,8471,8474,8483,8503,8511,8516,8533,8560,8571,8575,8583,8592,8593,8626,8635,8635,8644,8659,8685,8695,8695,8702,8714,8715,8717,8729,8732,8740,8743,8750,8756,8772,8772,8778,8797,8828,8840,8840,8843,8856,8865,8874,8876,8878,8885,8887,8893,8896,8905,8910,8955,8970,8971,8991,8995,9014,9016,9042,9043,9063,9069,9104,9106,9107,9116,9131,9157,9227,9359,9471]
GROUP B
[12,16,29,32,33,35,39,42,44,44,44,45,45,45,45,45,45,45,45,45,47,51,51,51,57,57,60,61,61,62,71,75,75,75,75,75,75,76,76,76,76,76,76,79,84,84,85,89,93,93,95,96,97,98,100,100,100,100,100,102,102,103,105,108,109,109,109,109,109,109,109,109,109,109,109,109,110,110,112,113,114,114,116,116,118,119,120,121,122,124,125,128,129,130,131,132,133,133,137,138,144,144,146,146,146,148,149,149,150,150,150,151,153,155,157,159,164,164,164,167,169,170,171,171,171,171,173,174,175,176,176,177,178,179,180,181,181,183,184,185,187,191,193,199,203,203,205,205,206,212,213,214,214,219,224,224,224,225,225,226,227,227,228,231,234,234,235,237,240,244,245,245,246,246,246,248,249,250,250,251,255,255,257,264,264,267,270,271,271,281,282,286,286,291,291,292,292,294,295,299,301,302,304,304,304,304,304,306,308,314,318,329,340,344,345,356,359,363,368,368,371,375,379,386,389,390,392,394,408,418,438,440,456,456,458,460,461,467,491,503,505,508,524,557,558,568,591,609,622,656,665,668,687,705,728,817,839,965,1013,1093,1126,1512,1935,2159,2384,2424,2426,2484,2738,2746,2751,3006,3184,3184,3184,3184,3184,4023,5842,5842,6502,7443,7781,8132,8237,8501]
Array Stats:
Group A Group B
Total Numbers 231 286
Mean 7534.71 575.56
Standard Deviation 1595.04 1316.03
You can calculate the Mahalanobis distance of the new point with respect to each set. The set to which the new point has the lowest distance is the most likely match.
Since your space is one dimensional, the calculation should simplify to:
This can be viewed as a binary classification problem with a single continuous predictor. You could view this as fitting one simple decision tree, finding a threshold t such that you predict Group A when a value is >= t.
For this you pick the t that minimizes the entropy of the resulting splits. Let's say you have the following counts for some t:
| | <t | >= t | | Group A | X | Y | | Group B | Z | W |
The entropy of the < split is -(X/(X+Z))*log(X/(X+Z)) - (Z/(X+Z))*log(Z/(X+Z)). The entropy of the >= split is -(Y/(Y+W))*log(Y/(Y+W)) - (W/(Y+W))*log(W/(Y+W)). This looks messier than it is; it's just the sum of -p*log(p) for the proportion p of each group within a split.
You take the weighted average of the two, weighted by the overall size of the split. So the first term is weighted by (X+Z)/(X+Y+Z+W) and the other by (Y+W)/(X+Y+Z+W).
With reasonable assumptions, a good discriminant is the unique data value that causes the area of B's probability density to the left of the split point to equal the area of A's to the right (or vice versa, which gives the same point).
A simple way to find this is to compute the two empiricle cumulative distribution functions (CDFs) as shown here and search them to provide the split point. This is the point where the two CDFs sum to 1.
Stated briefly, building the empiricle CDFs is just sorting each data set and using the data as the x-axis values. Drawing the curve left-to-right, start at y=0 and take a 1/n step upward at each x value. Such a curve rises asymptotically from 0 for x <= data1 to y = CDF(x) = 1 for x >= data[n]. There is a slightly more complicated method that gives a continuous stepwise linear curve rather than stair steps, which under certain assumptions is a better estimator of the true CDF. In
Note the discussion above is only to provide intuition. The CDF is represented perfectly by the sorted array of data. No new data structure is needed; i.e. x[i], i=1,2,...,n is the x value at which the curve reaches y = i/n.
With the two CDFs, R(x) and B(x), according to your diagram, you want to find the unique point x such that |1 - R(x) - B(x)| is minimized (with the piecewise linear CDF you'll always be able to make this zero). This can be done quite easily by binary search.
A nice thing about this method is that you can make it dynamic by maintaining the two CDFs in sorted sets (balanced binary search trees). As points are added, the new dividing point is easily found.
The ordered sets need the "order statistic". Here is a reference. By that I mean that you'll need to be able to query the sorted set to retrieve the ordinal of any stored x value in the CDF. This can be done with skip lists as well as trees.
I coded up one variant of this algorithm. It uses the piecewise CDF approximation, but also allows for "vertical steps" at repeated data points. This complicates the algorithm somewhat, but it's not too bad. Then I used bisection (rather than combinatorial binary search) to find the split point. The normal bisection algorithm needs modification to accommodate vertical "steps" in the CDF. I think I have all this right, but it's lightly tested.
One edge case that is not handled is if the data sets have disjoint ranges. This will find a point between the top of the lower and bottom of the higher, which is a perfectly valid discriminator. But you may want to do something fancier like return some kind of weighted average.
Note that if you have a good notion of the actual min and max values the data can achieve and they don't occur in the data, you should consider adding them so that the CDFs are not inadvertently biased.
On your example data the code produces 4184.76, which looks pretty close to the value you chose in your diagram (somewhat below halfway between min and max data).
Note I didn't sort the data because it already was. Sorting is definitely necessary.
You are describing a one-dimensional statistical classification problem where you are looking for the 'decision boundary'. You have plenty of options to choose from:
but as the problem is simple (one dimensional, two well-separated classes) and the decision boundary is a rather empty region, I suspect that no heavy statistical method will significantly outperform a simple eye-based guesstimate.
I just want to point out another approach using density estimation.
Given your data, it's easy to fit a smoothed pdf using kernel density estimation. The below python code shows how to use the kde module in scipy.
In the graph, the green is the pdf for the second dataset; the red is the pdf for the first dataset. Clearly the decision boundary is the vertical line that passes through the point where the green intersects with the red.
To find the the boundary numerically, we can perform something like below (assume there is only one intersection, otherwise it does not make sense):
We found out that the boundary is located approximately at 3762.
If there are multiple intersections of the two pdfs, to make predictions of what class a data point
x
falls into, we calculatepdf1(x)
andpdf2(x)
, the max one is the class that minimize the bayes risk. See here for more details on the topic of Bayes risk and evaluation of the probability of prediction error.Below illustrates an example that includes actually three pdfs, at any query point
x
, we should ask the three pdfs separately and pick the one with the maximum value ofpdf(x)
as the predicted class.