Fast linear interpolation in Numpy / Scipy “along

2020-05-20 04:08发布

问题:

Let's say that I have data from weather stations at 3 (known) altitudes on a mountain. Specifically, each station records a temperature measurement at its location every minute. I have two kinds of interpolation I'd like to perform. And I'd like to be able to perform each quickly.

So let's set up some data:

import numpy as np
from scipy.interpolate import interp1d
import pandas as pd
import seaborn as sns

np.random.seed(0)
N, sigma = 1000., 5

basetemps = 70 + (np.random.randn(N) * sigma)
midtemps = 50 + (np.random.randn(N) * sigma)
toptemps = 40 + (np.random.randn(N) * sigma)
alltemps = np.array([basetemps, midtemps, toptemps]).T # note transpose!
trend = np.sin(4 / N * np.arange(N)) * 30
trend = trend[:, np.newaxis]

altitudes = np.array([500, 1500, 4000]).astype(float)

finaltemps = pd.DataFrame(alltemps + trend, columns=altitudes)
finaltemps.index.names, finaltemps.columns.names = ['Time'], ['Altitude']
finaltemps.plot()

Great, so our temperatures look like this:

Interpolate all times to for the same altitude:

I think this one is pretty straightforward. Say I want to get the temperature at an altitude of 1,000 for each time. I can just use built in scipy interpolation methods:

interping_function = interp1d(altitudes, finaltemps.values)
interped_to_1000 = interping_function(1000)

fig, ax = plt.subplots(1, 1, figsize=(8, 5))
finaltemps.plot(ax=ax, alpha=0.15)
ax.plot(interped_to_1000, label='Interped')
ax.legend(loc='best', title=finaltemps.columns.name)

This works nicely. And let's see about speed:

%%timeit
res = interp1d(altitudes, finaltemps.values)(1000)
#-> 1000 loops, best of 3: 207 µs per loop

Interpolate "along a path":

So now I have a second, related problem. Say I know the altitude of a hiking party as a function of time, and I want to compute the temperature at their (moving) location by linearly interpolating my data through time. In particular, the times at which I know the location of the hiking party are the same times at which I know the temperatures at my weather stations. I can do this without too much effort:

location = np.linspace(altitudes[0], altitudes[-1], N)
interped_along_path = np.array([interp1d(altitudes, finaltemps.values[i, :])(loc) 
                                             for i, loc in enumerate(location)])

fig, ax = plt.subplots(1, 1, figsize=(8, 5))
finaltemps.plot(ax=ax, alpha=0.15)
ax.plot(interped_along_path, label='Interped')
ax.legend(loc='best', title=finaltemps.columns.name)

So this works really nicely, but its important to note that the key line above is using list comprehension to hide an enormous amount of work. In the previous case, scipy is creating a single interpolation function for us, and evaluating it once on a large amount of data. In this case, scipy is actually constructing N individual interpolating functions and evaluating each once on a small amount of data. This feels inherently inefficient. There is a for loop lurking here (in the list comprehension) and moreover, this just feels flabby.

Not surprisingly, this is much slower than the previous case:

%%timeit
res = np.array([interp1d(altitudes, finaltemps.values[i, :])(loc) 
                            for i, loc in enumerate(location)])
#-> 10 loops, best of 3: 145 ms per loop

So the second example runs 1,000 slower than the first. I.e. consistent with the idea that the heavy lifting is the "make a linear interpolation function" step...which is happening 1,000 times in the second example but only once in the first.

So, the question: is there a better way to approach the second problem? For example, is there a good way to set it up with 2-dimensinoal interpolation (which could perhaps handle the case where the times at which the hiking party locations are known are not the times at which the temperatures have been sampled)? Or is there a particularly slick way to handle things here where the times do line up? Or other?

回答1:

For a fixed point in time, you can utilize the following interpolation function:

g(a) = cc[0]*abs(a-aa[0]) + cc[1]*abs(a-aa[1]) + cc[2]*abs(a-aa[2])

where a is the hiker's altitude, aa the vector with the 3 measurement altitudes and cc is a vector with the coefficients. There are three things to note:

  1. For given temperatures (alltemps) corresponding to aa, determining cc can be done by solving a linear matrix equation using np.linalg.solve().
  2. g(a) is easy to vectorize for a (N,) dimensional a and (N, 3) dimensional cc (including np.linalg.solve() respectively).
  3. g(a) is called a first order univariate spline kernel (for three points). Using abs(a-aa[i])**(2*d-1) would change the spline order to d. This approach could be interpreted a simplified version of a Gaussian Process in Machine Learning.

So the code would be:

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

# generate temperatures
np.random.seed(0)
N, sigma = 1000, 5
trend = np.sin(4 / N * np.arange(N)) * 30
alltemps = np.array([tmp0 + trend + sigma*np.random.randn(N)
                     for tmp0 in [70, 50, 40]])

# generate attitudes:
altitudes = np.array([500, 1500, 4000]).astype(float)
location = np.linspace(altitudes[0], altitudes[-1], N)


def doit():
    """ do the interpolation, improved version for speed """
    AA = np.vstack([np.abs(altitudes-a_i) for a_i in altitudes])
    # This is slighty faster than np.linalg.solve(), because AA is small:
    cc = np.dot(np.linalg.inv(AA), alltemps)

    return (cc[0]*np.abs(location-altitudes[0]) +
            cc[1]*np.abs(location-altitudes[1]) +
            cc[2]*np.abs(location-altitudes[2]))


t_loc = doit()  # call interpolator

# do the plotting:
fg, ax = plt.subplots(num=1)
for alt, t in zip(altitudes, alltemps):
    ax.plot(t, label="%d feet" % alt, alpha=.5)
ax.plot(t_loc, label="Interpolation")
ax.legend(loc="best", title="Altitude:")
ax.set_xlabel("Time")
ax.set_ylabel("Temperature")
fg.canvas.draw()

Measuring the time gives:

In [2]: %timeit doit()
10000 loops, best of 3: 107 µs per loop

Update: I replaced the original list comprehensions in doit() to import speed by 30% (For N=1000).

Furthermore, as requested for comparison, @moarningsun's benchmark code block on my machine:

10 loops, best of 3: 110 ms per loop  
interp_checked
10000 loops, best of 3: 83.9 µs per loop
scipy_interpn
1000 loops, best of 3: 678 µs per loop
Output allclose:
[True, True, True]

Note that N=1000 is a relatively small number. Using N=100000 produces the results:

interp_checked
100 loops, best of 3: 8.37 ms per loop

%timeit doit()
100 loops, best of 3: 5.31 ms per loop

This shows that this approach scales better for large N than the interp_checked approach.



回答2:

A linear interpolation between two values y1, y2 at locations x1 and x2, with respect to point xi is simply:

yi = y1 + (y2-y1) * (xi-x1) / (x2-x1)

With some vectorized Numpy expressions we can select the relevant points from the dataset and apply the above function:

I = np.searchsorted(altitudes, location)

x1 = altitudes[I-1]
x2 = altitudes[I]

time = np.arange(len(alltemps))
y1 = alltemps[time,I-1]
y2 = alltemps[time,I]

xI = location

yI = y1 + (y2-y1) * (xI-x1) / (x2-x1)

The trouble is that some points lie on the boundaries of (or even outside of) the known range, which should be taken into account:

I = np.searchsorted(altitudes, location)
same = (location == altitudes.take(I, mode='clip'))
out_of_range = ~same & ((I == 0) | (I == altitudes.size))
I[out_of_range] = 1  # Prevent index-errors

x1 = altitudes[I-1]
x2 = altitudes[I]

time = np.arange(len(alltemps))
y1 = alltemps[time,I-1]
y2 = alltemps[time,I]

xI = location

yI = y1 + (y2-y1) * (xI-x1) / (x2-x1)
yI[out_of_range] = np.nan

Luckily, Scipy already provides ND interpolation, which also just as easy takes care of the mismatching times, for example:

from scipy.interpolate import interpn

time = np.arange(len(alltemps))

M = 150
hiketime = np.linspace(time[0], time[-1], M)
location = np.linspace(altitudes[0], altitudes[-1], M)
xI = np.column_stack((hiketime, location))

yI = interpn((time, altitudes), alltemps, xI)

Here's a benchmark code (without any pandas actually, bit I did include the solution from the other answer):

import numpy as np
from scipy.interpolate import interp1d, interpn

def original():
    return np.array([interp1d(altitudes, alltemps[i, :])(loc)
                                for i, loc in enumerate(location)])

def OP_self_answer():
    return np.diagonal(interp1d(altitudes, alltemps)(location))

def interp_checked():
    I = np.searchsorted(altitudes, location)
    same = (location == altitudes.take(I, mode='clip'))
    out_of_range = ~same & ((I == 0) | (I == altitudes.size))
    I[out_of_range] = 1  # Prevent index-errors

    x1 = altitudes[I-1]
    x2 = altitudes[I]

    time = np.arange(len(alltemps))
    y1 = alltemps[time,I-1]
    y2 = alltemps[time,I]

    xI = location

    yI = y1 + (y2-y1) * (xI-x1) / (x2-x1)
    yI[out_of_range] = np.nan

    return yI

def scipy_interpn():
    time = np.arange(len(alltemps))
    xI = np.column_stack((time, location))
    yI = interpn((time, altitudes), alltemps, xI)
    return yI

N, sigma = 1000., 5

basetemps = 70 + (np.random.randn(N) * sigma)
midtemps = 50 + (np.random.randn(N) * sigma)
toptemps = 40 + (np.random.randn(N) * sigma)
trend = np.sin(4 / N * np.arange(N)) * 30
trend = trend[:, np.newaxis]
alltemps = np.array([basetemps, midtemps, toptemps]).T + trend
altitudes = np.array([500, 1500, 4000], dtype=float)
location = np.linspace(altitudes[0], altitudes[-1], N)

funcs = [original, interp_checked, scipy_interpn]
for func in funcs:
    print(func.func_name)
    %timeit func()

from itertools import combinations
outs = [func() for func in funcs]
print('Output allclose:')
print([np.allclose(out1, out2) for out1, out2 in combinations(outs, 2)])

With the following result on my system:

original
10 loops, best of 3: 184 ms per loop
OP_self_answer
10 loops, best of 3: 89.3 ms per loop
interp_checked
1000 loops, best of 3: 224 µs per loop
scipy_interpn
1000 loops, best of 3: 1.36 ms per loop
Output allclose:
[True, True, True, True, True, True]

Scipy's interpn suffers somewhat in terms of speed compared to the very fastest method, but for it's generality and ease of use it's definitely the way to go.



回答3:

I'll offer one bit of progress. In the second case (interpolating "along a path") we are making many different interpolation functions. One thing we could try is to make just one interpolation function (one which does interpolation in the altitude dimension over all times as in the first case above) and evaluate that function over and over (in a vectorized way). That would give us way more data than we want (it would give us a 1,000 x 1,000 matrix instead of a 1,000-element vector). But then our target result would just be along the diagonal. So the question is, does calling a single function on way more complex arguments run faster than making many functions and calling them with simple arguments?

The answer is yes!

The key is that the interpolating function returned by scipy.interpolate.interp1d is able to accept a numpy.ndarray as its input. So you can effectively call the interpolating function many times at C-speed by feeding in a vector input. I.e. this is way, way faster than writing a for loop which calls the interpolating function over and over again on a scalar input. So while we compute many many data points that we wind up throwing away, we save even more time by not constructing many different interpolating functions that we barely use.

old_way = interped_along_path = np.array([interp1d(altitudes, finaltemps.values[i, :])(loc) 
                                                      for i, loc in enumerate(location)])
# look ma, no for loops!
new_way = np.diagonal(interp1d(altitudes, finaltemps.values)(location)) 
# note, `location` is a vector!
abs(old_way - new_way).max()
#-> 0.0

and yet:

%%timeit
res = np.diagonal(interp1d(altitudes, finaltemps.values)(location))
#-> 100 loops, best of 3: 16.7 ms per loop

So this approach gets us a factor of 10 better! Can anyone do better? Or suggest an entirely different approach?