How to vectorize a function which contains an if s

2019-03-26 05:09发布

问题:

Let's say we have the following function:

def f(x, y):
    if y == 0:
        return 0
    return x/y

This works fine with scalar values. Unfortunately when I try to use numpy arrays for x and y the comparison y == 0 is treated as an array operation which results in an error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-13-9884e2c3d1cd> in <module>()
----> 1 f(np.arange(1,10), np.arange(10,20))

<ipython-input-10-fbd24f17ea07> in f(x, y)
      1 def f(x, y):
----> 2     if y == 0:
      3         return 0
      4     return x/y

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

I tried to use np.vectorize but it doesn't make a difference, the code still fails with the same error. np.vectorize is one option which gives the result I expect.

The only solution that I can think of is to use np.where on the y array with something like:

def f(x, y):
    np.where(y == 0, 0, x/y)

which doesn't work for scalars.

Is there a better way to write a function which contains an if statement? It should work with both scalars and arrays.

回答1:

One way is to convert x and y to numpy arrays inside your function:

def f(x, y):
    x = np.array(x)
    y = np.array(y)
    return np.where(y == 0, 0, x/y)

This will work when one of x or y is a scalar and the other is a numpy array. It will also work if they are both arrays that can be broadcast. It won't work if they're arrays of incompatible shapes (e.g., 1D arrays of different lengths), but it's not clear what the desired behavior would be in that case anyway.



回答2:

I wonder what the problem is you're facing with np.vectorize. It works fine on my system:

In [145]: def f(x, y):
     ...:     if y == 0:
     ...:         return 0
     ...:     return x/y

In [146]: vf = np.vectorize(f)

In [147]: vf([[3],[10]], [0,1,2,0])
Out[147]: 
array([[ 0,  3,  1,  0],
       [ 0, 10,  5,  0]])

Note that the result dtype is determined by the result of the first element. You can also set the desired output yourself:

In [148]: vf = np.vectorize(f, otypes=[np.float])

In [149]: vf([[3],[10]], [0,1,2,0])
Out[149]: 
array([[  0. ,   3. ,   1.5,   0. ],
       [  0. ,  10. ,   5. ,   0. ]])

There are more examples in the docs.



回答3:

You can use a masked array that will perform the division only where y!=0:

def f(x, y):
    x = np.atleast_1d(np.array(x))
    y = np.atleast_1d(np.ma.array(y, mask=(y==0)))
    ans = x/y
    ans[ans.mask]=0
    return np.asarray(ans)


回答4:

A kind of clunky but effective way is to basically pre-process the data:

def f(x, y):
    if type(x) == int and type(y) == int: return x/y # Will it ever be used for this?

    # Change scalars to arrays
    if type(x) == int: x = np.full(y.shape, x, dtype=y.dtype)
    if type(y) == int: y = np.full(x.shape, y, dtype=x.dtype)

    # Change all divide by zero operations to 0/1
    div_zero_idx = (y==0)
    x[div_zero_idx] = 0
    y[div_zero_idx] = 1

    return x/y

I timed all the different approaches:

def f_mask(x, y):
    x = np.ma.array(x, mask=(y==0))
    y = np.array(y)
    ans = x/y
    ans[ans.mask]=0
    return np.asarray(ans)

def f_where(x, y):
    x = np.array(x)
    y = np.array(y)
    return np.where(y == 0, 0, x/y)

def f_vect(x, y):
    if y == 0:
        return 0
    return x/y

vf = np.vectorize(f_vect)

print timeit.timeit('f(np.random.randint(10, size=array_length), np.random.randint(10, size=array_length))', number=10000, setup="from __main__ import f; import numpy as np; array_length=1000")
print timeit.timeit('f_mask(np.random.randint(10, size=array_length), np.random.randint(10, size=array_length))', number=10000, setup="from __main__ import f_mask; import numpy as np; array_length=1000")
print timeit.timeit('f_where(np.random.randint(10, size=array_length), np.random.randint(10, size=array_length))', number=10000, setup="from __main__ import f_where; import numpy as np; array_length=1000")
print timeit.timeit('vf(np.random.randint(10, size=array_length), np.random.randint(10, size=array_length))', number=10000, setup="from __main__ import vf; import numpy as np; array_length=(1000)")

# f
# 0.760189056396

# f_mask
# 2.24414896965

# f_where
# RuntimeWarning: divide by zero encountered in divide return np.where(y == 0, 0, x/y)
# 1.08176398277

# f_vect
# 3.45374488831

The first function is the quickest, and has no warnings. The time ratios are similar if x or y are scalars. For higher dimensional arrays, the masked array approach gets relatively faster (it's still the slowest though).