How to find the best straight line separating two

2019-05-20 02:47发布

I have a bunch of points in a 2D plot. The red points indicate when my experiment is stable, the black when it is unstable. The two region are clearly separated by a line in this log-log plot, and I would like to find the best "separating line", i.e. the line that gives the criterion to separate the 2 regions and has the minimum error on this criterion. I did a search on various books and online but I could not find any approach to solve this problem. Are you aware of any tool? First of all one has to define the error. One thing that comes in my mind is: if the unknown line is ax+by+c=0, for each point (x0,y0) we define an error function like the following:

E=0 if point lays on the correct side of the line.
E= distance(a*x+b*y+c=0,(x0,y0)) = |a*x0+b*y0+c|/sqrt(a^2+b^2)   if the point
   lies on the wrong side.

and we minimize the sum of the errors. Not simple though since there is a threshold. If you guys know about some reference or link of approaches that solve this problem that would be appreciated. Cheers A.

enter image description here

1条回答
虎瘦雄心在
2楼-- · 2019-05-20 03:28

Some refs: Wikipedia Linear classifier and Support vector machine (SVM),
scikit-learn SVM, an example with 3 classes,
questions/tagged/classification on SO,
3000 more questions/tagged/classification on stats.stackexchange,
400 more questions/tagged/classification on datascience.stackexchange .

For your 2-class problem, do these steps:

  1. find the midpoints Rmid of the red points, Bmid of the black, Mid of the lot

  2. draw the line L from Rmid to Bmid

  3. the (hyper)plane through Mid, perpendicular to line L, is what you want: a linear classifier.
    Or you can just compare the distances |x - Rmid| and |x - Bmid|: call x nearer Rmid red, nearer Bmid black.

But there's more to be said. Projecting all the data points onto line L gives a 1-dimensional problem:

rrrrrrrrrrbrrrrrrrrbbrrr | rrbbbbbbbbbbbbbbb

It's a good idea to plot all the points on this line, to see and better understand the data.
(For point clouds in say 5 or 10 dimensions, it might be fun and/or informative to look at 2d or 3d slices from different angles.)

Each cut, "|" above, gives a "confusion matrix" of 4 numbers:

R-correct   R-called-B  e.g.  490   10
B-called-R  B-correct          50  450

This gives a rough idea of the error rate of your predictions red / black; print it, discuss it.
The best cut depends on costs, e.g. if calling an R a B is 10 times or 100 times worse than calling a B an R.

If the red points and the black points have different scatter / covariance, see Fisher's linear discriminant .

("SVM" is jargon for a class of methods for "good" separating hyperplanes / hypersurfaces -- there's no "machine".)

查看更多
登录 后发表回答