可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I wish to efficiently use pandas
(or numpy
) instead of a nested for
loop with an if
statement to solve a particular problem. Here is a toy version:
Suppose I have the following two DataFrames
import pandas as pd
import numpy as np
dict1 = {'vals': [100,200], 'in': [0,1], 'out' :[1,3]}
df1 = pd.DataFrame(data=dict1)
dict2 = {'vals': [500,800,300,200], 'in': [0.1,0.5,2,4], 'out' :[0.5,2,4,5]}
df2 = pd.DataFrame(data=dict2)
Now I wish to loop through each row each dataframe and multiply the vals if a particular condition is met. This code works for what I want
ans = []
for i in range(len(df1)):
for j in range(len(df2)):
if (df1['in'][i] <= df2['out'][j] and df1['out'][i] >= df2['in'][j]):
ans.append(df1['vals'][i]*df2['vals'][j])
np.sum(ans)
However, clearly this is very inefficient and in reality my DataFrames can have millions of entries making this unusable. I am also not making us of pandas
or numpy
efficient vector implementations. Does anyone have any ideas how to efficiently vectorize this nested loop?
I feel like this code is something akin to matrix multiplication so could progress be made utilising outer
? It's the if
condition that I'm finding hard to wedge in, as the if
logic needs to compare each entry in df1
against all entries in df2
.
回答1:
You can also use a compiler like Numba to do this job. This would also outperform the vectorized solution and doesn't need a temporary array.
Example
import numba as nb
import numpy as np
import pandas as pd
import time
@nb.njit(fastmath=True,parallel=True,error_model='numpy')
def your_function(df1_in,df1_out,df1_vals,df2_in,df2_out,df2_vals):
sum=0.
for i in nb.prange(len(df1_in)):
for j in range(len(df2_in)):
if (df1_in[i] <= df2_out[j] and df1_out[i] >= df2_in[j]):
sum+=df1_vals[i]*df2_vals[j]
return sum
Testing
dict1 = {'vals': np.random.randint(1,100,1000), 'in': np.random.randint(1,10,1000), 'out': np.random.randint(1,10,1000)}
df1 = pd.DataFrame(data=dict1)
dict2 = {'vals': np.random.randint(1,100,1500), 'in': 5*np.random.random(1500), 'out': 5*np.random.random(1500)}
df2 = pd.DataFrame(data=dict2)
#first call has some compilation overhead
res=your_function(df1['in'].values,df1['out'].values,df1['vals'].values,df2['in'].values,df2['out'].values,df2['vals'].values)
t1=time.time()
for i in range(1000):
res=your_function(df1['in'].values,df1['out'].values,df1['vals'].values,df2['in'].values,df2['out'].values,df2['vals'].values)
#res_2=g(df1, df2)
print(time.time()-t1)
Timings
vectorized solution @AGN Gazer: 9.15ms
parallelized Numba Version: 0.7ms
回答2:
m1 = np.less_equal.outer(df1['in'], df2['out'])
m2 = np.greater_equal.outer(df1['out'], df2['in'])
m = np.logical_and(m1, m2)
v12 = np.outer(df1['vals'], df2['vals'])
print(v12[m].sum())
Or, replace first three lines with this long line:
m = np.less_equal.outer(df1['in'], df2['out']) & np.greater_equal.outer(df1['out'], df2['in'])
s = np.outer(df1['vals'], df2['vals'])[m].sum()
For very large problems, dask
is recommended.
Timing Tests:
Here is a timing comparison when using 1000 and 1500-long arrays:
In [166]: dict1 = {'vals': np.random.randint(1,100,1000), 'in': np.random.randint(1,10,1000), 'out': np.random.randint(1,10,1000)}
...: df1 = pd.DataFrame(data=dict1)
...:
...: dict2 = {'vals': np.random.randint(1,100,1500), 'in': 5*np.random.random(1500), 'out': 5*np.random.random(1500)}
...: df2 = pd.DataFrame(data=dict2)
Author's original method (Python loops):
In [167]: def f(df1, df2):
...: ans = []
...: for i in range(len(df1)):
...: for j in range(len(df2)):
...: if (df1['in'][i] <= df2['out'][j] and df1['out'][i] >= df2['in'][j]):
...: ans.append(df1['vals'][i]*df2['vals'][j])
...: return np.sum(ans)
...:
...:
In [168]: %timeit f(df1, df2)
47.3 s ± 1.02 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
@Ben.T method:
In [170]: %timeit df2['ans']= df2.apply(lambda row: df1['vals'][(df1['in'] <= row['out']) & (df1['out'] >= row['in'])].sum()*row['vals'],1); df2['a
...: ns'].sum()
2.22 s ± 40.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Vectorized solution proposed here:
In [171]: def g(df1, df2):
...: m = np.less_equal.outer(df1['in'], df2['out']) & np.greater_equal.outer(df1['out'], df2['in'])
...: return np.outer(df1['vals'], df2['vals'])[m].sum()
...:
...:
In [172]: %timeit g(df1, df2)
7.81 ms ± 127 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
回答3:
Your answer:
471 µs ± 35.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Method 1 (3+ times slower):
df1.apply(lambda row: list((df2['vals'][(row['in'] <= df2['out']) & (row['out'] >= df2['in'])] * row['vals'])), axis=1).sum()
1.56 ms ± 7.56 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Method 2 (2+ times slower):
ans = []
for name, row in df1.iterrows():
_in = row['in']
_out = row['out']
_vals = row['vals']
ans.append(df2['vals'].loc[(df2['in'] <= _out) & (df2['out'] >= _in)].values * _vals)
1.01 ms ± 8.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Method 3 (3+ times faster):
df1_vals = df1.values
ans = np.zeros(shape=(len(df1_vals), len(df2.values)))
for i in range(df1_vals.shape[0]):
df2_vals = df2.values
df2_vals[:, 2][~np.logical_and(df1_vals[i, 1] >= df2_vals[:, 0], df1_vals[i, 0] <= df2_vals[:, 1])] = 0
ans[i, :] = df2_vals[:, 2] * df1_vals[i, 2]
144 µs ± 3.11 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In Method 3 you can view the solution by performing:
ans[ans.nonzero()]
Out[]: array([ 50000., 80000., 160000., 60000.]
I wasn't able to think of a way to remove the underlying loop :( but I learnt a lot about numpy in the process! (yay for learning)
回答4:
One way to do it is by using apply
. Create a column in df2
containing the sum of vals in df1, meeting your criteria on in and out, multiplied by the vals of the row of df2
df2['ans']= df2.apply(lambda row: df1['vals'][(df1['in'] <= row['out']) &
(df1['out'] >= row['in'])].sum()*row['vals'],1)
then just sum this column
df2['ans'].sum()