Best way to pass repeated parameter to a Numpy vec

2020-04-10 17:03发布

问题:

So, continuing from the discussion @TheBlackCat and I were having in this answer, I would like to know the best way to pass arguments to a Numpy vectorized function. The function in question is defined thus:

vect_dist_funct = np.vectorize(lambda p1, p2: vincenty(p1, p2).meters)

where, vincenty comes from the Geopy package.

I currently call vect_dist_funct in this manner:

def pointer(point, centroid, tree_idx):
    intersect = list(tree_idx.intersection(point))
    if len(intersect) > 0:
        points = pd.Series([point]*len(intersect)).values
        polygons = centroid.loc[intersect].values
        dist = vect_dist_funct(points, polygons)
        return pd.Series(dist, index=intercept, name='Dist').sort_values()
    else:
        return pd.Series(np.nan, index=[0], name='Dist')

points['geometry'].apply(lambda x: pointer(point=x.coords[0], centroid=line['centroid'], tree_idx=tree_idx))

(Please refer to the question here: Labelled datatypes Python)

My question pertains to what happens inside the function pointer. The reason I am converting points to a pandas.Series and then getting the values (in the 4th line, just under the if statement) is to make it in the same shape as polygons. If I merely call points either as points = [point]*len(intersect) or as points = itertools.repeat(point, len(intersect)), Numpy complains that it "cannot broadcast arrays of size (n,2) and size (n,) together" (n is the length of intersect).

If I call vect_dist_funct like so: dist = vect_dist_funct(itertools.repeat(points, len(intersect)), polygons), vincenty complains that I have passed it too many arguments. I am at a complete loss to understand the difference between the two.

Note that these are coordinates, therefore will always be in pairs. Here are examples of how point and polygons look like:

point = (-104.950752   39.854744) # Passed directly to the function like this.
polygons = array([(-104.21750802451864, 37.84052458697633),
                  (-105.01017084789603, 39.82012158954065),
                  (-105.03965315742742, 40.669867471420886),
                  (-104.90353460825702, 39.837631505433706),
                  (-104.8650601872832, 39.870796282334744)], dtype=object)
           # As returned by statement centroid.loc[intersect].values

What is the best way to call vect_dist_funct in this circumstance, such that I can have a vectorized call, and both Numpy and vincenty will not complain that I am passing wrong arguments? Also, techniques that result in minimum memory consumption, and increased speed are sought. The goal is to compute distance between the point to each polygon centroid.

回答1:

np.vectorize doesn't really help you here. As per the documentation:

The vectorize function is provided primarily for convenience, not for performance. The implementation is essentially a for loop.

In fact, vectorize actively hurts you, since it converts the inputs into numpy arrays, doing an unnecessary and expensive type conversion and producing the errors you are seeing. You are much better off using a function with a for loop.

It also is better to use a function rather than a lambda for a to-level function, since it lets you have a docstring.

So this is how I would implement what you are doing:

def vect_dist_funct(p1, p2):
    """Apply `vincenty` to `p1` and each element of `p2`.

    Iterate over `p2`, returning `vincenty` with the first argument
    as `p1` and the second as the current element of `p2`.  Returns
    a numpy array where each row is the result of the `vincenty` function
    call for the corresponding element of `p2`.
    """
    return [vincenty(p1, p2i).meters for p2i in p2]

If you really want to use vectorize, you can use the excluded argument to not vectorize the p1 argument, or better yet set up a lambda that wraps vincenty and only vectorizes the second argument:

def vect_dist_funct(p1, p2):
    """Apply `vincenty` to `p1` and each element of `p2`.

    Iterate over `p2`, returning `vincenty` with the first argument
    as `p1` and the second as the current element of `p2`.  Returns
    a list where each value is the result of the `vincenty` function
    call for the corresponding element of `p2`.
    """
    vinc_p = lambda x: vincenty(p1, x)
    return np.vectorize(vinc_p)(p2)