Group duplicate column IDs in pandas dataframe

2019-01-06 19:38发布

Now there are a lot of similar questions but most of them answer how to delete the duplicate columns. However, I want to know how can I make a list of tuples where each tuple contains the column names of duplicate columns. I am assuming that each column has a unique name. Just to further illustrate my question:

df = pd.DataFrame({'A': [1, 2, 3, 4, 5],'B': [2, 4, 2, 1, 9],
                   'C': [1, 2, 3, 4, 5],'D': [2, 4, 2, 1, 9],
                   'E': [3, 4, 2, 1, 2],'F': [1, 1, 1, 1, 1]},
                   index = ['a1', 'a2', 'a3', 'a4', 'a5'])

then I want the output:

[('A', 'C'), ('B', 'D')]

And if you are feeling great today then also extend the same question to rows. How to get a list of tuples where each tuple contains duplicate rows.

7条回答
Luminary・发光体
2楼-- · 2019-01-06 20:16

Here's one NumPy approach -

def group_duplicate_cols(df):
    a = df.values
    sidx = np.lexsort(a)
    b = a[:,sidx]

    m = np.concatenate(([False], (b[:,1:] == b[:,:-1]).all(0), [False] ))
    idx = np.flatnonzero(m[1:] != m[:-1])
    C = df.columns[sidx].tolist()
    return [C[i:j] for i,j in zip(idx[::2],idx[1::2]+1)]

Sample runs -

In [100]: df
Out[100]: 
    A  B  C  D  E  F
a1  1  2  1  2  3  1
a2  2  4  2  4  4  1
a3  3  2  3  2  2  1
a4  4  1  4  1  1  1
a5  5  9  5  9  2  1

In [101]: group_duplicate_cols(df)
Out[101]: [['A', 'C'], ['B', 'D']]

# Let's add one more duplicate into group containing 'A'
In [102]: df.F = df.A

In [103]: group_duplicate_cols(df)
Out[103]: [['A', 'C', 'F'], ['B', 'D']]

Converting to do the same, but for rows(index), we just need to switch the operations along the other axis, like so -

def group_duplicate_rows(df):
    a = df.values
    sidx = np.lexsort(a.T)
    b = a[sidx]

    m = np.concatenate(([False], (b[1:] == b[:-1]).all(1), [False] ))
    idx = np.flatnonzero(m[1:] != m[:-1])
    C = df.index[sidx].tolist()
    return [C[i:j] for i,j in zip(idx[::2],idx[1::2]+1)]

Sample run -

In [260]: df2
Out[260]: 
   a1  a2  a3  a4  a5
A   3   5   3   4   5
B   1   1   1   1   1
C   3   5   3   4   5
D   2   9   2   1   9
E   2   2   2   1   2
F   1   1   1   1   1

In [261]: group_duplicate_rows(df2)
Out[261]: [['B', 'F'], ['A', 'C']]

Benchmarking

Approaches -

# @John Galt's soln-1
from itertools import combinations
def combinations_app(df):
    return[x for x in combinations(df.columns, 2) if (df[x[0]] == df[x[-1]]).all()]

# @Abdou's soln
def pandas_groupby_app(df):
    return [tuple(d.index) for _,d in df.T.groupby(list(df.T.columns)) if len(d) > 1]                        

# @COLDSPEED's soln
def triu_app(df):
    c = df.columns.tolist()
    i, j = np.triu_indices(len(c), 1)
    x = [(c[_i], c[_j]) for _i, _j in zip(i, j) if (df[c[_i]] == df[c[_j]]).all()]
    return x

# @cmaher's soln
def lambda_set_app(df):
    return list(filter(lambda x: len(x) > 1, list(set([tuple([x for x in df.columns if all(df[x] == df[y])]) for y in df.columns]))))

Note : @John Galt's soln-2 wasn't included because the inputs being of size (8000,500) would blow up with the proposed broadcasting for that one.

Timings -

In [179]: # Setup inputs with sizes as mentioned in the question
     ...: df = pd.DataFrame(np.random.randint(0,10,(8000,500)))
     ...: df.columns = ['C'+str(i) for i in range(df.shape[1])]
     ...: idx0 = np.random.choice(df.shape[1], df.shape[1]//2,replace=0)
     ...: idx1 = np.random.choice(df.shape[1], df.shape[1]//2,replace=0)
     ...: df.iloc[:,idx0] = df.iloc[:,idx1].values
     ...: 

# @John Galt's soln-1
In [180]: %timeit combinations_app(df)
1 loops, best of 3: 24.6 s per loop

# @Abdou's soln
In [181]: %timeit pandas_groupby_app(df)
1 loops, best of 3: 3.81 s per loop

# @COLDSPEED's soln
In [182]: %timeit triu_app(df)
1 loops, best of 3: 25.5 s per loop

# @cmaher's soln
In [183]: %timeit lambda_set_app(df)
1 loops, best of 3: 27.1 s per loop

# Proposed in this post
In [184]: %timeit group_duplicate_cols(df)
10 loops, best of 3: 188 ms per loop

Super boost with NumPy's view functionality

Leveraging NumPy's view functionality that lets us view each group of elements as one dtype, we could gain further noticeable performance boost, like so -

def view1D(a): # a is array
    a = np.ascontiguousarray(a)
    void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
    return a.view(void_dt).ravel()

def group_duplicate_cols_v2(df):
    a = df.values
    sidx = view1D(a.T).argsort()
    b = a[:,sidx]

    m = np.concatenate(([False], (b[:,1:] == b[:,:-1]).all(0), [False] ))
    idx = np.flatnonzero(m[1:] != m[:-1])
    C = df.columns[sidx].tolist()
    return [C[i:j] for i,j in zip(idx[::2],idx[1::2]+1)]

Timings -

In [322]: %timeit group_duplicate_cols(df)
10 loops, best of 3: 185 ms per loop

In [323]: %timeit group_duplicate_cols_v2(df)
10 loops, best of 3: 69.3 ms per loop

Just crazy speedups!

查看更多
走好不送
3楼-- · 2019-01-06 20:20

Not using panda, just pure python :

data = {'A': [1, 2, 3, 4, 5],'B': [2, 4, 2, 1, 9],
        'C': [1, 2, 3, 4, 5],'D': [2, 4, 2, 1, 9],
        'E': [3, 4, 2, 1, 2],'F': [1, 1, 1, 1, 1]}
from collections import defaultdict

deduplicate = defaultdict(list)


for key, items in data.items():
    deduplicate[tuple(items)].append(key)  # cast to tuple because they are hashables but lists are not.

duplicates = list()
for vector, letters in deduplicate.items():
    if len(letters) > 1:
        duplicates.append(letters)

print(duplicates)

Using pandas :

import pandas

df = pandas.DataFrame(data)
duplicates = []

dedup2 = defaultdict(list)

for key in df.columns:
    dedup2[tuple(df[key])].append(key)

duplicates = list()
for vector, letters in dedup2.items():
    if len(letters) > 1:
        duplicates.append(letters)

print(duplicates)

Not really nice, but may be quicker since everything is done in one iteration over the data.

dedup2 = defaultdict(list)

duplicates = {}

for key in df.columns:
    astup = tuple(df[key])
    duplic = dedup2[astup] 
    duplic.append(key)
    if len(duplic) > 1:
        duplicates[astup] = duplic

duplicates = duplicates.values()
print(duplicates)
查看更多
做个烂人
4楼-- · 2019-01-06 20:28

Based on @John Galt one liner which is like this:

result_col = [x for x in combinations(df.columns, 2) if (df[x[0]] == df[x[-1]]).all()]

you can get the result_row as follows:

result_row = [x for x in combinations(df.T.columns,2) if (df.T[x[0]] == df.T[x[-1]]).all()]

using transpose (df.T)

查看更多
Evening l夕情丶
5楼-- · 2019-01-06 20:29

This should also do:

[tuple(d.index) for _,d in df.T.groupby(list(df.T.columns)) if len(d) > 1]

Yields:

# [('A', 'C'), ('B', 'D')]
查看更多
Ridiculous、
6楼-- · 2019-01-06 20:32

Here's one more option using only comprehensions/built-ins:

filter(lambda x: len(x) > 1, list(set([tuple([x for x in df.columns if all(df[x] == df[y])]) for y in df.columns])))

Result:

[('A', 'C'), ('B', 'D')]
查看更多
▲ chillily
7楼-- · 2019-01-06 20:39

Here's a single-liner

In [22]: from itertools import combinations

In [23]: [x for x in combinations(df.columns, 2) if (df[x[0]] == df[x[-1]]).all()]
Out[23]: [('A', 'C'), ('B', 'D')]

Alternatively, using NumPy broadcasting. Better, look at Divakar's solution

In [124]: cols = df.columns

In [125]: dftv = df.T.values

In [126]: cross = pd.DataFrame((dftv == dftv[:, None]).all(-1), cols, cols)

In [127]: cross
Out[127]:
       A      B      C      D      E      F
A   True  False   True  False  False  False
B  False   True  False   True  False  False
C   True  False   True  False  False  False
D  False   True  False   True  False  False
E  False  False  False  False   True  False
F  False  False  False  False  False   True

# Only take values from lower triangle
In [128]: s = cross.where(np.tri(*cross.shape, k=-1)).unstack()

In [129]: s[s == 1].index.tolist()
Out[129]: [('A', 'C'), ('B', 'D')]
查看更多
登录 后发表回答