How to vectorize (make use of pandas/numpy) instea

2019-04-14 22:19发布

问题:

I wish to efficiently use pandas (or numpy) instead of a nested for loop with an if statement to solve a particular problem. Here is a toy version:

Suppose I have the following two DataFrames

import pandas as pd
import numpy as np

dict1 = {'vals': [100,200], 'in': [0,1], 'out' :[1,3]}
df1 = pd.DataFrame(data=dict1)

dict2 = {'vals': [500,800,300,200], 'in': [0.1,0.5,2,4], 'out' :[0.5,2,4,5]}
df2 = pd.DataFrame(data=dict2)

Now I wish to loop through each row each dataframe and multiply the vals if a particular condition is met. This code works for what I want

ans = []

for i in range(len(df1)):
    for j in range(len(df2)):
        if (df1['in'][i] <= df2['out'][j] and df1['out'][i] >= df2['in'][j]):
            ans.append(df1['vals'][i]*df2['vals'][j])

np.sum(ans)

However, clearly this is very inefficient and in reality my DataFrames can have millions of entries making this unusable. I am also not making us of pandas or numpy efficient vector implementations. Does anyone have any ideas how to efficiently vectorize this nested loop?

I feel like this code is something akin to matrix multiplication so could progress be made utilising outer? It's the if condition that I'm finding hard to wedge in, as the if logic needs to compare each entry in df1 against all entries in df2.

回答1:

You can also use a compiler like Numba to do this job. This would also outperform the vectorized solution and doesn't need a temporary array.

Example

import numba as nb
import numpy as np
import pandas as pd
import time

@nb.njit(fastmath=True,parallel=True,error_model='numpy')
def your_function(df1_in,df1_out,df1_vals,df2_in,df2_out,df2_vals):
  sum=0.
  for i in nb.prange(len(df1_in)):
      for j in range(len(df2_in)):
          if (df1_in[i] <= df2_out[j] and df1_out[i] >= df2_in[j]):
              sum+=df1_vals[i]*df2_vals[j]
  return sum

Testing

dict1 = {'vals': np.random.randint(1,100,1000), 'in': np.random.randint(1,10,1000), 'out': np.random.randint(1,10,1000)}
df1 = pd.DataFrame(data=dict1)
dict2 = {'vals': np.random.randint(1,100,1500), 'in': 5*np.random.random(1500), 'out': 5*np.random.random(1500)}
df2 = pd.DataFrame(data=dict2)

#first call has some compilation overhead
res=your_function(df1['in'].values,df1['out'].values,df1['vals'].values,df2['in'].values,df2['out'].values,df2['vals'].values)

t1=time.time()
for i in range(1000):
  res=your_function(df1['in'].values,df1['out'].values,df1['vals'].values,df2['in'].values,df2['out'].values,df2['vals'].values)
  #res_2=g(df1, df2)

print(time.time()-t1)

Timings

vectorized solution @AGN Gazer: 9.15ms
parallelized Numba Version: 0.7ms


回答2:

m1 = np.less_equal.outer(df1['in'], df2['out']) 
m2 = np.greater_equal.outer(df1['out'], df2['in'])
m = np.logical_and(m1, m2)
v12 = np.outer(df1['vals'], df2['vals'])
print(v12[m].sum())

Or, replace first three lines with this long line:

m = np.less_equal.outer(df1['in'], df2['out']) & np.greater_equal.outer(df1['out'], df2['in'])
s = np.outer(df1['vals'], df2['vals'])[m].sum()

For very large problems, dask is recommended.

Timing Tests:

Here is a timing comparison when using 1000 and 1500-long arrays:

In [166]: dict1 = {'vals': np.random.randint(1,100,1000), 'in': np.random.randint(1,10,1000), 'out': np.random.randint(1,10,1000)}
     ...: df1 = pd.DataFrame(data=dict1)
     ...: 
     ...: dict2 = {'vals': np.random.randint(1,100,1500), 'in': 5*np.random.random(1500), 'out': 5*np.random.random(1500)}
     ...: df2 = pd.DataFrame(data=dict2)

Author's original method (Python loops):

In [167]: def f(df1, df2):
     ...:     ans = []
     ...:     for i in range(len(df1)):
     ...:         for j in range(len(df2)):
     ...:             if (df1['in'][i] <= df2['out'][j] and df1['out'][i] >= df2['in'][j]):
     ...:                 ans.append(df1['vals'][i]*df2['vals'][j])
     ...:     return np.sum(ans)
     ...: 
     ...: 

In [168]: %timeit f(df1, df2)
47.3 s ± 1.02 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

@Ben.T method:

In [170]: %timeit df2['ans']= df2.apply(lambda row: df1['vals'][(df1['in'] <= row['out']) & (df1['out'] >= row['in'])].sum()*row['vals'],1); df2['a
     ...: ns'].sum()
2.22 s ± 40.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Vectorized solution proposed here:

In [171]: def g(df1, df2):
     ...:     m = np.less_equal.outer(df1['in'], df2['out']) & np.greater_equal.outer(df1['out'], df2['in'])
     ...:     return np.outer(df1['vals'], df2['vals'])[m].sum()
     ...: 
     ...: 

In [172]: %timeit g(df1, df2)

7.81 ms ± 127 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


回答3:

Your answer:

471 µs ± 35.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Method 1 (3+ times slower):

df1.apply(lambda row: list((df2['vals'][(row['in'] <= df2['out']) & (row['out'] >= df2['in'])] * row['vals'])), axis=1).sum()

1.56 ms ± 7.56 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Method 2 (2+ times slower):

ans = []
for name, row in df1.iterrows():
    _in = row['in']
    _out = row['out']
    _vals = row['vals']
    ans.append(df2['vals'].loc[(df2['in'] <= _out) & (df2['out'] >= _in)].values * _vals)

1.01 ms ± 8.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Method 3 (3+ times faster):

df1_vals = df1.values
ans = np.zeros(shape=(len(df1_vals), len(df2.values)))
for i in range(df1_vals.shape[0]):
    df2_vals = df2.values
    df2_vals[:, 2][~np.logical_and(df1_vals[i, 1] >= df2_vals[:, 0], df1_vals[i, 0] <= df2_vals[:, 1])] = 0
    ans[i, :] = df2_vals[:, 2] * df1_vals[i, 2]

144 µs ± 3.11 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In Method 3 you can view the solution by performing:

ans[ans.nonzero()]

Out[]: array([ 50000.,  80000., 160000.,  60000.]

I wasn't able to think of a way to remove the underlying loop :( but I learnt a lot about numpy in the process! (yay for learning)



回答4:

One way to do it is by using apply. Create a column in df2 containing the sum of vals in df1, meeting your criteria on in and out, multiplied by the vals of the row of df2

df2['ans']= df2.apply(lambda row: df1['vals'][(df1['in'] <= row['out']) & 
                                              (df1['out'] >= row['in'])].sum()*row['vals'],1)

then just sum this column

df2['ans'].sum()