I have a very large pandas dataframe with approximately 500,000 columns. Each column is about 500 elements long. For each column, I need to retrieve the (index, column) location of the top-k elements in the column.
So, if k were equal to 2, and this were my data frame:
A B C D
w 4 8 10 2
x 5 1 1 6
y 9 22 25 7
z 15 5 7 2
I would want to return:
[(A,y),(A,z),(B,w),(B,y),(C,w),(C,y),(D,x),(D,y)]
Keep in mind that I have approximately 500,000 columns, so speed is my primary concern. Is there a reasonable way of doing this that will not take an entire week on my machine? What is the fastest way - even if it will be fast enough for the amount of data I have?
Thanks for the help!
I think numpy
has a good solution for this that's fast and you can format the output however you want.
In [2]: df = pd.DataFrame(data=np.random.randint(0, 1000, (200, 500000)),
columns=range(500000), index=range(200))
In [3]: def top_k(x,k):
ind=np.argpartition(x,-1*k)[-1*k:]
return ind[np.argsort(x[ind])]
In [69]: %time np.apply_along_axis(lambda x: top_k(x,2),0,df.as_matrix())
CPU times: user 5.91 s, sys: 40.7 ms, total: 5.95 s
Wall time: 6 s
Out[69]:
array([[ 14, 54],
[178, 141],
[ 49, 111],
...,
[ 24, 122],
[ 55, 89],
[ 9, 175]])
Pretty fast compared to the pandas solution (which is cleaner IMO but we're going for speed here):
In [41]: %time np.array([df[c].nlargest(2).index.values for c in df])
CPU times: user 3min 43s, sys: 6.58 s, total: 3min 49s
Wall time: 4min 8s
Out[41]:
array([[ 54, 14],
[141, 178],
[111, 49],
...,
[122, 24],
[ 89, 55],
[175, 9]])
The lists are in reverse order of each other (you can easily fix this by reversing sort in the numpy
version)
Note that in the example due to random int generation we can likely have more than k
values that are equal and max so indices returned may not agree among all methods but all will yield a valid result (you will get k
indices that match the max values in the column)
Pandas has an efficient nlargest
operation you can use that is faster than a full sort. It will still take awhile to apply across 500,000 columns.
In [1]: df = pd.DataFrame(data=np.random.randint(0, 100, (200, 500000)),
columns=range(500000), index=range(200))
In [2]: %time np.array([df[c].nlargest(2).index.values for c in df])
Wall time: 2min 57s
Out[2]:
array([[171, 1],
[ 42, 78],
As @EdChum noted, you probably don't want to store as tuples, it would be a lot more efficient to use two arrays or some other strategy.