My goal is to solve:
Kc=y
with the pseudo-inverse (i.e. minimum norm solution):
c=K^{+}y
such that the model is (hopefully) high degree polynomial model f(x) = sum_i c_i x^i
. I am specially interested in the underdetermined case where we have more polynomial features than data (few equation too many variables/unknowns) columns = deg+1 > N = rows
. Note K
is the vandermode matrix of polynomial features.
I was initially using the python function np.linalg.pinv but then I noticed something funky was going on as I noted here: Why do different methods for solving Xc=y in python give different solution when they should not?. In that question I use a square matrix to learn a function on the interval [-1.+1]
with a high degree polynomial. The answer there suggested me to lower the degree of the polynomial and/or to increase the interval size. The main issue is that its not clear to me how to choose the interval or the max degree before thing become unreliable. I think my main issue is that choosing such a numerically stable range depends on the method that I may use. In the end what I truly care about is that
- the method I use is exactly (or really close) to the pseudo-inverse for this polynomial fitting problem
- that its numerically stable
ideally I want to try a large degree polynomial but that might be limited by my machine precision. Is it possible to increase the numerical precision of the machine by using something more accurate than floats?
Also, I really care that whatever function from python I use, it provides the closest answer to the pseudo inverse (and hopefully that its numerically stable so I can actually use it). To check which answer for the pseudo-inverse I wrote the following script:
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
def l2_loss(y,y_):
N = y.shape[0]
return (1/N)*np.linalg.norm(y-y_)
## some parameters
lb,ub = -200,200
N=100
D0=1
degree_mdl = 120
## target function
freq_cos = 2
f_target = lambda x: np.cos(freq_cos*2*np.pi*x)
## evaluate target_f on x_points
X = np.linspace(lb,ub,N) # [N,]
Y = f_target(X) # [N,]
# get pinv solution
poly_feat = PolynomialFeatures(degree=degree_mdl)
Kern = poly_feat.fit_transform( X.reshape(N,D0) ) # low degrees first [1,x,x**2,...]
c_pinv = np.dot(np.linalg.pinv( Kern ), Y)
## get polyfit solution
c_polyfit = np.polyfit(X,Y,degree_mdl)[::-1] # need to reverse to get low degrees first [1,x,x**2,...]
##
c_lstsq,_,_,_ = np.linalg.lstsq(Kern,Y.reshape(N,1))
##
print('lb,ub = {} '.format((lb,ub)))
print('differences with c_pinv')
print( '||c_pinv-c_pinv||^2 = {}'.format( np.linalg.norm(c_pinv-c_pinv) ))
print( '||c_pinv-c_polyfit||^2 = {}'.format( np.linalg.norm(c_pinv-c_polyfit) ))
print( '||c_pinv-c_lstsq||^2 = {}'.format( np.linalg.norm(c_pinv-c_lstsq) ))
##
print('differences with c_polyfit')
print( '||c_polyfit-c_pinv||^2 = {}'.format( np.linalg.norm(c_polyfit-c_pinv) ))
print( '||c_polyfit-c_polyfit||^2 = {}'.format( np.linalg.norm(c_polyfit-c_polyfit) ))
print( '||c_polyfit-c_lstsq||^2 = {}'.format( np.linalg.norm(c_polyfit-c_lstsq) ))
##
print('differences with c_lstsq')
print( '||c_lstsq-c_pinv||^2 = {}'.format( np.linalg.norm(c_lstsq-c_pinv) ))
print( '||c_lstsq-c_polyfit||^2 = {}'.format( np.linalg.norm(c_lstsq-c_polyfit) ))
print( '||c_lstsq-c_lstsq||^2 = {}'.format( np.linalg.norm(c_lstsq-c_lstsq) ))
##
print('Data set errors')
y_polyfit = np.dot(Kern,c_polyfit)
print( 'J_data(c_polyfit) = {}'.format( l2_loss(y_polyfit,Y) ) )
y_pinv = np.dot(Kern,c_pinv)
print( 'J_data(c_pinv) = {}'.format( l2_loss(y_pinv,Y) ) )
y_lstsq = np.dot(Kern,c_lstsq)
print( 'J_data(c_lstsq) = {}'.format( l2_loss(y_lstsq,Y) ) )
using that I managed to notice that rarely does polyfit
ever matches the parameters that pinv
uses. I know pinv definitively returns the pseudo-inverse, so I guess if my main goal is to "make sure I use the pseudo-inverse" then its a good idea to use np.pinv
. However, I also know mathematically that the pseudo-inverse always minimizes the least squares error J(c) = || Kc - y ||^2
no matter what (proof here theorem 11.1.2 page 446). Thus, maybe my goal should be to just use the python function that return smallest least squares error J
. Thus, I ran (in the underdetermined case) a comparison of the three methods
- Polygit,
np.polyfit
- pseudo-inverse,
np.linalg.pinv
- least squares,
np.linalg.lstsq
and compared what error least squares error they gave me on the data:
Then I inspected the weird dips the function seems to experience (which btw seems like a complete mystery why there are dips if the algorithms are not stochastic) and the numbers usually was smaller for polyfit, for example:
lb,ub = (-100, 100)
Data set errors
J_data(c_polyfit) = 5.329753025633029e-12
J_data(c_pinv) = 0.06670557822873546
J_data(c_lstsq) = 0.7479733306782645
given these results and that pseudo-inverse is a minimizer of least squares, it seems that the best thing is to ignore np.pinv
. Is that the best thing to do? Or am I missing something obvious?
As an extra note I went into polyfit code to see what exactly is making it have better least square errors (which right now I am using as a way to say its the best approximation for the pseudo-inverse) and it seems it has some weird condition/numerical stability code:
# scale lhs to improve condition number and solve
scale = NX.sqrt((lhs*lhs).sum(axis=0))
lhs /= scale
c, resids, rank, s = lstsq(lhs, rhs, rcond)
c = (c.T/scale).T # broadcast scale coefficients
which I assume, is what is bringing the extra stability to the polyfit that pinv
does not have?
Is this the right decision to use polyfit
for my task of high degree polynomials linear system approximation?
also at this point I'm willing to use other software like matlab if it provides me the correct pseudo-inverse AND more numerical stability (for most degrees and any bounds).
Another random idea I just had, was that maybe there is a nice way to sample the function, such that the stability of the pseudo-inverse is good. My guess is that approximating a cosine with a polynomial requires some type of number of samples or distance between them (like the nyquist shannon sampling theorem says if the basis functions are sinusoidals...)
It should be pointed out that probably inverting (or pseudo ivnerting) and then multiplying is a bad idea. See:
https://www.johndcook.com/blog/2010/01/19/dont-invert-that-matrix/
that one only talks about inverse but I assume it extends to pseudo-inverses as well.
right now my confusion is that usually we don't want to actually compute the pseudo-inverse explicitly and do A^+y=x_min_norm
to get the minimum norm solution. However, I would have thought that np.lstsq
would yield the answer that I wanted but its error differs greatly from the other too. I find that extremely confusing...make me think I'm using the wrong way to get the minimum norm solution in python.
I am not trying to get a regularized solution. I am trying to get the minimum norm solution and nothing else, as numerically accurately as possible.
My area of research involves a compression algorithm essentially called the Fourier extensions. What is the most accurate? It is highly dependent on the vector I believe due to properties of smoothness. During the summer I used something called Savitsky Golay. There are fairly numerically stable and accurate ways of filtering this. However, my adviser has a method that is relatively fast and numerically stable. The area is called Fourier extension or continuation. How? I don't know if I am allowed to post it, here is the article. If I believe I may have already posted in the summer on here in python partially.
This has nothing to do with python because python uses the same underlying libraries as most numerical coding schemes which is BLAS and LAPACK. Netlib is online.
There are a number of other similar fast and numerically stable ideas that may be suitable I would recommend. There is an entire book devoted to this by Boyd. Chapters 6 and 7 are on this. It is about total variation with regularization due to the underlying noise you may have in the signal I imagine.
Other aspects. You may need to regularize the SVD due to ill-conditioning. There are books devoted to this usually. Simply to answer your question what is best algorithm. There are multiple dimensions to algorithms and you haven't really stated the properties of the problem. If you didn't know about Runge's phenomenon. That is using high degree polynomials is adverse.
There is a whole class of Hermite polynomials to deal with the Gibbs phenomena and other filtering techniques but this isn't posed well. You're using generic functions. I'd recommend getting Trefthen and Bau. Sometimes they do a Chebychev Reprojection.
What is the condition number of K. Additionally there is something that happens when fitting polynomials called Runges phenomena. You should consider this. Use generic functions you need to do a low rank approximation to regularize if the condition number is too high. I actually just read it. You are using a Vandermonde matrix. I am going to demonstrate this pretty easily. Vandermonde matrices are bad. Don't use them. They have knots.
c1 =
6.0469e+12
c2 =
9.3987e+32
I attempted a low rank approximation but the vandermonde matrices are not nice. See.
does nothing really...If you don't know what is happening when you get that inverse the condition number is equal to maximum singular value over the minimum so you have some extremely small singular values at machine precision.
Additionally, I think you have some confusion about minimum norm and regularization. You said you want a minimum norm in the least squares sense. The SVD gives the least squares. It's property nine, A is constructed from a basis by the SVD. This is covered in trefethen but the vandermonde matrix is ill-conditioned.
even small ill constructed vandermonde matrices are going to lose it. Now about your about solution. Don't use vandermonde matrices. Construct the polynomial otherwise. A better idea is barycentric lagrange interpolation. A library is here
Here is an example in matlab.
barylag is taken from the matlab website. As I can't really comment on your discrepancies you should realize the actual way lsqr is done. Lsqr algorithms are Krylov methods. This is covered in Trefethen. SVD is also I have an example on my quora page about numerical stability with the QR which is how you actually construct these algorithms