I have a very large pandas dataframe with approximately 500,000 columns. Each column is about 500 elements long. For each column, I need to retrieve the (index, column) location of the top-k elements in the column.
So, if k were equal to 2, and this were my data frame:
A B C D
w 4 8 10 2
x 5 1 1 6
y 9 22 25 7
z 15 5 7 2
I would want to return:
[(A,y),(A,z),(B,w),(B,y),(C,w),(C,y),(D,x),(D,y)]
Keep in mind that I have approximately 500,000 columns, so speed is my primary concern. Is there a reasonable way of doing this that will not take an entire week on my machine? What is the fastest way - even if it will be fast enough for the amount of data I have?
Thanks for the help!
Pandas has an efficient
nlargest
operation you can use that is faster than a full sort. It will still take awhile to apply across 500,000 columns.As @EdChum noted, you probably don't want to store as tuples, it would be a lot more efficient to use two arrays or some other strategy.
I think
numpy
has a good solution for this that's fast and you can format the output however you want.Pretty fast compared to the pandas solution (which is cleaner IMO but we're going for speed here):
The lists are in reverse order of each other (you can easily fix this by reversing sort in the
numpy
version)Note that in the example due to random int generation we can likely have more than
k
values that are equal and max so indices returned may not agree among all methods but all will yield a valid result (you will getk
indices that match the max values in the column)