Now there are a lot of similar questions but most of them answer how to delete the duplicate columns. However, I want to know how can I make a list of tuples where each tuple contains the column names of duplicate columns. I am assuming that each column has a unique name. Just to further illustrate my question:
df = pd.DataFrame({'A': [1, 2, 3, 4, 5],'B': [2, 4, 2, 1, 9],
'C': [1, 2, 3, 4, 5],'D': [2, 4, 2, 1, 9],
'E': [3, 4, 2, 1, 2],'F': [1, 1, 1, 1, 1]},
index = ['a1', 'a2', 'a3', 'a4', 'a5'])
then I want the output:
[('A', 'C'), ('B', 'D')]
And if you are feeling great today then also extend the same question to rows. How to get a list of tuples where each tuple contains duplicate rows.
Here's one NumPy approach -
Sample runs -
Converting to do the same, but for rows(index), we just need to switch the operations along the other axis, like so -
Sample run -
Benchmarking
Approaches -
Note :
@John Galt's soln-2
wasn't included because the inputs being of size(8000,500)
would blow up with the proposedbroadcasting
for that one.Timings -
Super boost with NumPy's view functionality
Leveraging NumPy's view functionality that lets us view each group of elements as one dtype, we could gain further noticeable performance boost, like so -
Timings -
Just crazy speedups!
Not using panda, just pure python :
Using pandas :
Not really nice, but may be quicker since everything is done in one iteration over the data.
Based on @John Galt one liner which is like this:
you can get the
result_row
as follows:using transpose (df.T)
This should also do:
Yields:
Here's one more option using only comprehensions/built-ins:
Result:
Here's a single-liner
Alternatively, using NumPy broadcasting. Better, look at Divakar's solution