My goal is to solve:
Kc=y
with the pseudo-inverse (i.e. minimum norm solution):
c=K^{+}y
such that the model is (hopefully) high degree polynomial model f(x) = sum_i c_i x^i
. I am specially interested in the underdetermined case where we have more polynomial features than data (few equation too many variables/unknowns) columns = deg+1 > N = rows
. Note K
is the vandermode matrix of polynomial features.
I was initially using the python function np.linalg.pinv but then I noticed something funky was going on as I noted here: Why do different methods for solving Xc=y in python give different solution when they should not?. In that question I use a square matrix to learn a function on the interval [-1.+1]
with a high degree polynomial. The answer there suggested me to lower the degree of the polynomial and/or to increase the interval size. The main issue is that its not clear to me how to choose the interval or the max degree before thing become unreliable. I think my main issue is that choosing such a numerically stable range depends on the method that I may use. In the end what I truly care about is that
- the method I use is exactly (or really close) to the pseudo-inverse for this polynomial fitting problem
- that its numerically stable
ideally I want to try a large degree polynomial but that might be limited by my machine precision. Is it possible to increase the numerical precision of the machine by using something more accurate than floats?
Also, I really care that whatever function from python I use, it provides the closest answer to the pseudo inverse (and hopefully that its numerically stable so I can actually use it). To check which answer for the pseudo-inverse I wrote the following script:
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
def l2_loss(y,y_):
N = y.shape[0]
return (1/N)*np.linalg.norm(y-y_)
## some parameters
lb,ub = -200,200
N=100
D0=1
degree_mdl = 120
## target function
freq_cos = 2
f_target = lambda x: np.cos(freq_cos*2*np.pi*x)
## evaluate target_f on x_points
X = np.linspace(lb,ub,N) # [N,]
Y = f_target(X) # [N,]
# get pinv solution
poly_feat = PolynomialFeatures(degree=degree_mdl)
Kern = poly_feat.fit_transform( X.reshape(N,D0) ) # low degrees first [1,x,x**2,...]
c_pinv = np.dot(np.linalg.pinv( Kern ), Y)
## get polyfit solution
c_polyfit = np.polyfit(X,Y,degree_mdl)[::-1] # need to reverse to get low degrees first [1,x,x**2,...]
##
c_lstsq,_,_,_ = np.linalg.lstsq(Kern,Y.reshape(N,1))
##
print('lb,ub = {} '.format((lb,ub)))
print('differences with c_pinv')
print( '||c_pinv-c_pinv||^2 = {}'.format( np.linalg.norm(c_pinv-c_pinv) ))
print( '||c_pinv-c_polyfit||^2 = {}'.format( np.linalg.norm(c_pinv-c_polyfit) ))
print( '||c_pinv-c_lstsq||^2 = {}'.format( np.linalg.norm(c_pinv-c_lstsq) ))
##
print('differences with c_polyfit')
print( '||c_polyfit-c_pinv||^2 = {}'.format( np.linalg.norm(c_polyfit-c_pinv) ))
print( '||c_polyfit-c_polyfit||^2 = {}'.format( np.linalg.norm(c_polyfit-c_polyfit) ))
print( '||c_polyfit-c_lstsq||^2 = {}'.format( np.linalg.norm(c_polyfit-c_lstsq) ))
##
print('differences with c_lstsq')
print( '||c_lstsq-c_pinv||^2 = {}'.format( np.linalg.norm(c_lstsq-c_pinv) ))
print( '||c_lstsq-c_polyfit||^2 = {}'.format( np.linalg.norm(c_lstsq-c_polyfit) ))
print( '||c_lstsq-c_lstsq||^2 = {}'.format( np.linalg.norm(c_lstsq-c_lstsq) ))
##
print('Data set errors')
y_polyfit = np.dot(Kern,c_polyfit)
print( 'J_data(c_polyfit) = {}'.format( l2_loss(y_polyfit,Y) ) )
y_pinv = np.dot(Kern,c_pinv)
print( 'J_data(c_pinv) = {}'.format( l2_loss(y_pinv,Y) ) )
y_lstsq = np.dot(Kern,c_lstsq)
print( 'J_data(c_lstsq) = {}'.format( l2_loss(y_lstsq,Y) ) )
using that I managed to notice that rarely does polyfit
ever matches the parameters that pinv
uses. I know pinv definitively returns the pseudo-inverse, so I guess if my main goal is to "make sure I use the pseudo-inverse" then its a good idea to use np.pinv
. However, I also know mathematically that the pseudo-inverse always minimizes the least squares error J(c) = || Kc - y ||^2
no matter what (proof here theorem 11.1.2 page 446). Thus, maybe my goal should be to just use the python function that return smallest least squares error J
. Thus, I ran (in the underdetermined case) a comparison of the three methods
- Polygit,
np.polyfit
- pseudo-inverse,
np.linalg.pinv
- least squares,
np.linalg.lstsq
and compared what error least squares error they gave me on the data:
Then I inspected the weird dips the function seems to experience (which btw seems like a complete mystery why there are dips if the algorithms are not stochastic) and the numbers usually was smaller for polyfit, for example:
lb,ub = (-100, 100)
Data set errors
J_data(c_polyfit) = 5.329753025633029e-12
J_data(c_pinv) = 0.06670557822873546
J_data(c_lstsq) = 0.7479733306782645
given these results and that pseudo-inverse is a minimizer of least squares, it seems that the best thing is to ignore np.pinv
. Is that the best thing to do? Or am I missing something obvious?
As an extra note I went into polyfit code to see what exactly is making it have better least square errors (which right now I am using as a way to say its the best approximation for the pseudo-inverse) and it seems it has some weird condition/numerical stability code:
# scale lhs to improve condition number and solve
scale = NX.sqrt((lhs*lhs).sum(axis=0))
lhs /= scale
c, resids, rank, s = lstsq(lhs, rhs, rcond)
c = (c.T/scale).T # broadcast scale coefficients
which I assume, is what is bringing the extra stability to the polyfit that pinv
does not have?
Is this the right decision to use polyfit
for my task of high degree polynomials linear system approximation?
also at this point I'm willing to use other software like matlab if it provides me the correct pseudo-inverse AND more numerical stability (for most degrees and any bounds).
Another random idea I just had, was that maybe there is a nice way to sample the function, such that the stability of the pseudo-inverse is good. My guess is that approximating a cosine with a polynomial requires some type of number of samples or distance between them (like the nyquist shannon sampling theorem says if the basis functions are sinusoidals...)
It should be pointed out that probably inverting (or pseudo ivnerting) and then multiplying is a bad idea. See:
https://www.johndcook.com/blog/2010/01/19/dont-invert-that-matrix/
that one only talks about inverse but I assume it extends to pseudo-inverses as well.
right now my confusion is that usually we don't want to actually compute the pseudo-inverse explicitly and do A^+y=x_min_norm
to get the minimum norm solution. However, I would have thought that np.lstsq
would yield the answer that I wanted but its error differs greatly from the other too. I find that extremely confusing...make me think I'm using the wrong way to get the minimum norm solution in python.
I am not trying to get a regularized solution. I am trying to get the minimum norm solution and nothing else, as numerically accurately as possible.