fast way to get index of top-k elements of every c

I have a very large pandas dataframe with approximately 500,000 columns. Each column is about 500 elements long. For each column, I need to retrieve the (index, column) location of the top-k elements in the column.

So, if k were equal to 2, and this were my data frame:

  A  B  C  D
w 4  8  10 2
x 5  1  1  6 
y 9  22 25 7 
z 15 5  7  2

I would want to return:

[(A,y),(A,z),(B,w),(B,y),(C,w),(C,y),(D,x),(D,y)]

Keep in mind that I have approximately 500,000 columns, so speed is my primary concern. Is there a reasonable way of doing this that will not take an entire week on my machine? What is the fastest way - even if it will be fast enough for the amount of data I have?

Thanks for the help!

标签： sorting pandas indexing columnsorting

2条回答

放我归山

2楼-- · 2019-04-09 07:53

Pandas has an efficient nlargest operation you can use that is faster than a full sort. It will still take awhile to apply across 500,000 columns.

In [1]: df = pd.DataFrame(data=np.random.randint(0, 100, (200, 500000)), 
                          columns=range(500000), index=range(200))

In [2]: %time np.array([df[c].nlargest(2).index.values for c in df])
Wall time: 2min 57s
Out[2]: 
array([[171,   1],
       [ 42,  78],

As @EdChum noted, you probably don't want to store as tuples, it would be a lot more efficient to use two arrays or some other strategy.

0人赞添加讨论(0) 举报

我想做一个坏孩纸

3楼-- · 2019-04-09 07:56

I think numpy has a good solution for this that's fast and you can format the output however you want.

In [2]: df = pd.DataFrame(data=np.random.randint(0, 1000, (200, 500000)), 
                      columns=range(500000), index=range(200))

In [3]: def top_k(x,k):
             ind=np.argpartition(x,-1*k)[-1*k:]
             return ind[np.argsort(x[ind])]

In [69]: %time np.apply_along_axis(lambda x: top_k(x,2),0,df.as_matrix())
CPU times: user 5.91 s, sys: 40.7 ms, total: 5.95 s
Wall time: 6 s

Out[69]:
array([[ 14,  54],
       [178, 141],
       [ 49, 111],
       ...,
       [ 24, 122],
       [ 55,  89],
       [  9, 175]])

Pretty fast compared to the pandas solution (which is cleaner IMO but we're going for speed here):

In [41]: %time np.array([df[c].nlargest(2).index.values for c in df])
CPU times: user 3min 43s, sys: 6.58 s, total: 3min 49s
Wall time: 4min 8s

Out[41]:
array([[ 54,  14],
       [141, 178],
       [111,  49],
       ...,
       [122,  24],
       [ 89,  55],
       [175,   9]])

The lists are in reverse order of each other (you can easily fix this by reversing sort in the numpy version)

Note that in the example due to random int generation we can likely have more than k values that are equal and max so indices returned may not agree among all methods but all will yield a valid result (you will get k indices that match the max values in the column)

0人赞添加讨论(0) 举报

fast way to get index of top-k elements of every c

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间