I have a large list of tuples e.g. [ (1,2), (1,3), (1,4), (2,1), (2,3) ]
etc. I want to convert it to [ (1, [1,2,3,4]), (2, [1,3] ) ]
efficiently. I am grouping tuples by the first element of each tuple i.e. (1,2), (1,3), (1,4)
becomes (1, [2,3,4])
(also see the Haskell version below). I doubt that this can be done in one pass? The input list is always ordered.
In python
in tried using defaultdict
which I thought was the natural solution without reinventing the wheel. It works well but it does not preserve the order of keys. One solution is to use ordered defaultdict
as explained here.
Anyway, I'd like to know the language independent and efficient solution to this problem. My current solution requires two passes and one call to set( )
on a list.
Update
I am thinking of implementing following Haskell version:
a = [ (1,2), (1,3), (1,4), (2,1), (2,3) ]
b = groupBy (\ x y -> fst x == fst y )
b
[[(1,2),(1,3),(1,4)],[(2,1),(2,3)]]
map (\x -> (fst .head $ x, map snd x ) ) b
[(1,[2,3,4]),(2,[1,3])]
Performance of answers
I implemented two answers (coldspeed and pm2ring). On moderate size lists (upto 10^4 elements) PM2 ring solution is faster; at size 10^5, both takes same time, on larger list COLDSPEED starts winning. Below are the numbers (with python3).
First column is number of entries in list, second is time taken by coldspeed
and third column has time taken by pm2 ring
solutions. All times are in second.
10 0.0001 0.0000
100 0.0001 0.0000
1000 0.0005 0.0001
10000 0.0044 0.0014
100000 0.0517 0.0452
1000000 0.5579 1.5249
Script is here http://github.com/dilawar/playground/raw/master/Python/so_group_tuple.py
With Ashwini optimization
PM 2Ring
solution is even faster (roughly 3x - 5x) with Ashwini's suggestions.
10 4.887580871582031e-05 1.2636184692382812e-05
100 0.00010132789611816406 2.0742416381835938e-05
1000 0.0005109310150146484 0.000110626220703125
10000 0.004467487335205078 0.0009067058563232422
100000 0.05056118965148926 0.017516136169433594
1000000 0.6100358963012695 0.26450490951538086
10000000 6.092756509780884 2.8253660202026367
With PYPY
Somewhat mixed results. Last column is ratio of column 2 and column 3.
pypy so_group_tuple.py
(10, [1.6927719116210938e-05, 3.409385681152344e-05], 0.4965034965034965)
(100, [4.601478576660156e-05, 8.296966552734375e-05], 0.5545977011494253)
(1000, [0.010258913040161133, 0.0019040107727050781], 5.388054094665665)
(10000, [0.0002448558807373047, 0.00021600723266601562], 1.1335540838852096)
(100000, [0.002658843994140625, 0.0018231868743896484], 1.45834967961292)
(1000000, [0.0833890438079834, 0.02979302406311035], 2.7989452709245284)
(10000000, [1.0556740760803223, 0.6789278984069824], 1.5549133841124023)
I am going with PM 2Ring
solution since it is much faster till list size 10^5.