Group list of tuples efficiently

2020-07-22 18:42发布

问题:

I have a large list of tuples e.g. [ (1,2), (1,3), (1,4), (2,1), (2,3) ] etc. I want to convert it to [ (1, [1,2,3,4]), (2, [1,3] ) ] efficiently. I am grouping tuples by the first element of each tuple i.e. (1,2), (1,3), (1,4) becomes (1, [2,3,4]) (also see the Haskell version below). I doubt that this can be done in one pass? The input list is always ordered.

In python in tried using defaultdict which I thought was the natural solution without reinventing the wheel. It works well but it does not preserve the order of keys. One solution is to use ordered defaultdict as explained here.

Anyway, I'd like to know the language independent and efficient solution to this problem. My current solution requires two passes and one call to set( ) on a list.

Update

I am thinking of implementing following Haskell version:

a = [ (1,2), (1,3), (1,4), (2,1), (2,3) ] 
b = groupBy (\ x y -> fst x == fst y ) 
b 
[[(1,2),(1,3),(1,4)],[(2,1),(2,3)]]  
map (\x -> (fst .head $ x, map snd x ) ) b 
[(1,[2,3,4]),(2,[1,3])]

Performance of answers

I implemented two answers (coldspeed and pm2ring). On moderate size lists (upto 10^4 elements) PM2 ring solution is faster; at size 10^5, both takes same time, on larger list COLDSPEED starts winning. Below are the numbers (with python3).

First column is number of entries in list, second is time taken by coldspeed and third column has time taken by pm2 ring solutions. All times are in second.

10 0.0001 0.0000
100 0.0001 0.0000
1000 0.0005 0.0001
10000 0.0044 0.0014
100000 0.0517 0.0452
1000000 0.5579 1.5249

Script is here http://github.com/dilawar/playground/raw/master/Python/so_group_tuple.py

With Ashwini optimization

PM 2Ring solution is even faster (roughly 3x - 5x) with Ashwini's suggestions.

10 4.887580871582031e-05 1.2636184692382812e-05
100 0.00010132789611816406 2.0742416381835938e-05
1000 0.0005109310150146484 0.000110626220703125
10000 0.004467487335205078 0.0009067058563232422
100000 0.05056118965148926 0.017516136169433594
1000000 0.6100358963012695 0.26450490951538086
10000000 6.092756509780884 2.8253660202026367

With PYPY

Somewhat mixed results. Last column is ratio of column 2 and column 3.

pypy so_group_tuple.py 
(10, [1.6927719116210938e-05, 3.409385681152344e-05], 0.4965034965034965)
(100, [4.601478576660156e-05, 8.296966552734375e-05], 0.5545977011494253)
(1000, [0.010258913040161133, 0.0019040107727050781], 5.388054094665665)
(10000, [0.0002448558807373047, 0.00021600723266601562], 1.1335540838852096)
(100000, [0.002658843994140625, 0.0018231868743896484], 1.45834967961292)
(1000000, [0.0833890438079834, 0.02979302406311035], 2.7989452709245284)
(10000000, [1.0556740760803223, 0.6789278984069824], 1.5549133841124023)

I am going with PM 2Ring solution since it is much faster till list size 10^5.

回答1:

You can do this with itertools.groupby, and using zip to rearrange the data from the collected groups:

from itertools import groupby
from operator import itemgetter

a = [(1, 2), (1, 3), (1, 4), (2, 1), (2, 3)]
b = [(k, list(list(zip(*g))[1])) for k, g in groupby(a, itemgetter(0))]
print(b)

output

[(1, [2, 3, 4]), (2, [1, 3])]

That list comp is a bit dense. Here's a variation using a traditional for loop that prints an intermediate result to make it a little easier to see what's going on.

b = []
for k, g in groupby(a, itemgetter(0)):
    t = list(zip(*g))
    print(t)
    b.append(list(t[1]))

print('Output', b)

output

[(1, 1, 1), (2, 3, 4)]
[(2, 2), (1, 3)]
Output [[2, 3, 4], [1, 3]]

As Ashwini Chaudhary mentions in the comments, nesting another list comp in there makes the code much more readable, it's probably also more efficient, since it avoids a couple of calls.

b = [(k, [x for _, x in g]) for k, g in groupby(a, itemgetter(0))]


回答2:

You may use the collections.OrderedDict (import collections first):

o = collections.OrderedDict()

for x in t:
    o.setdefault(x[0], []).append(x[1])  

Now, convert o.items() to a list:

list(o.items())
# [(1, [2, 3, 4]), (2, [1, 3])]


回答3:

May be if the input list is already ordered, it is not required to use any other ordering function or feature to again reorder the list. Below code will automatically give the output as you've shown.

mylistarr = ((1, 2), (1, 3), (1, 4), (2, 1), (2, 3))
output = dict()
for tuple in mylistarr:
    if tuple[0] not in anotherlist:
        output[tuple[0]] = list()
        output[tuple[0]].append(tuple[0])
    output[tuple[0]].append(tuple[1])
print output

Output: {1: [1, 2, 3, 4], 2: [2, 1, 3]}