It's been a while since I was in college and knew how to calculate a best fit line, but I find myself needing to. Suppose I have a set of points, and I want to find the line that is the best of those points.
What is the equation to determine a best fit line?
How would I do that with PHP?
Of additional interest is probably how good of a fit the line is.
For that, use the Pearson correlation, here in a PHP function:
/**
* returns the pearson correlation coefficient (least squares best fit line)
*
* @param array $x array of all x vals
* @param array $y array of all y vals
*/
function pearson(array $x, array $y)
{
// number of values
$n = count($x);
$keys = array_keys(array_intersect_key($x, $y));
// get all needed values as we step through the common keys
$x_sum = 0;
$y_sum = 0;
$x_sum_sq = 0;
$y_sum_sq = 0;
$prod_sum = 0;
foreach($keys as $k)
{
$x_sum += $x[$k];
$y_sum += $y[$k];
$x_sum_sq += pow($x[$k], 2);
$y_sum_sq += pow($y[$k], 2);
$prod_sum += $x[$k] * $y[$k];
}
$numerator = $prod_sum - ($x_sum * $y_sum / $n);
$denominator = sqrt( ($x_sum_sq - pow($x_sum, 2) / $n) * ($y_sum_sq - pow($y_sum, 2) / $n) );
return $denominator == 0 ? 0 : $numerator / $denominator;
}
Here's an article comparing two ways to fit a line to data. One thing to watch out for is that there is a direct solution that is correct in theory but can have numerical problems. The article shows why that method can fail and gives another method that is better.
Method of Least Squares
http://en.wikipedia.org/wiki/Least_squares.
This book Numerical Recipes 3rd Edition: The Art of Scientific Computing (Hardcover)
has all you need for algorithms to implement Least Squares and other techniques.
Although you can use an iterative approach, you can directly calculate the slope and intercept of a line given a set of observations using a least-squares approach. See the "Univariate Linear Case" section of the Wikipedia article on linear regression for how to calculate the coefficients a
and b
in y = a + bx
given sets of (x,y)
points.
Implemented from wiki page, untested.
$sx = 0;
$sy = 0;
$sxy = 0;
$sx2 = 0;
$n = count($data);
foreach ($data as $x => $y)
{
$sx += $x;
$sy += $y;
$sxy += $x * $y;
$sx2 += $x * $x;
}
$beta = ($n*$sxy - $sx*$sy) / ($n*$sx2 - $sx*$sx);
$alpha = $sy/$n - $sx*$beta/$n;
echo "y = $alpha + $beta x";
You may want to check out linear regression, or more generally, curve fitting.
An often used approach is to iteratively minimize the sum of squared y-differences between your points and the fit function.
To add on to FryGuy's answer, if you need a function that also gives R^2 (to show how good the fit is):
function mathTrend($data) {
$sx = 0;
$sy = 0;
$sxy = 0;
$sx2 = 0;
$yTotal = 0;
$n = count($data);
if($n <= 1) {
return false;
}
foreach ($data as $row)
{
$row = array_values($row);
$x = $row[0];
$y = $row[1];
$yTotal += $y;
$sx += $x;
$sy += $y;
$sxy += $x * $y;
$sx2 += $x * $x;
}
$yAvg = $yTotal / $n;
$m = ($n*$sxy - $sx*$sy) / ($n*$sx2 - $sx*$sx);
$b = $sy/$n - $sx*$m/$n;
//Go through again to determine rSquared
//Using method from https://www.youtube.com/watch?v=w2FKXOa0HGA
$diffActual = 0;
$diffEstimated = 0;
foreach($data as $row) {
$row = array_values($row);
$x = $row[0];
$y = $row[1];
$expectedY = $m*$x+$b;
$diffActual += ($y - $yAvg)**2;
$diffEstimated += ($expectedY-$yAvg)**2;
}
$rSquared = $diffEstimated / $diffActual;
$result = ['m'=> $m, 'b' => $b, 'rSquared' => $rSquared];
return $result;
}