I was playing around to get a better feeling for itertools groupby
, so I grouped a list of tuples by the number and tried to get a list of the resulting groups. When I convert the result of groupby
to a list however, I get a strange result: all but the last group are empty. Why is that? I assumed turning an iterator into a list would be less efficient but never change behavior. I guess the lists are empty because the inner iterators are traversed but when/where does that happen?
import itertools
l=list(zip([1,2,2,3,3,3],['a','b','c','d','e','f']))
#[(1, 'a'), (2, 'b'), (2, 'c'), (3, 'd'), (3, 'e'), (3, 'f')]
grouped_l = list(itertools.groupby(l, key=lambda x:x[0]))
#[(1, <itertools._grouper at ...>), (2, <itertools._grouper at ...>), (3, <itertools._grouper at ...>)]
[list(x[1]) for x in grouped_l]
[[], [], [(3, 'f')]]
grouped_i = itertools.groupby(l, key=lambda x:x[0])
#<itertools.groupby at ...>
[list(x[1]) for x in grouped_i]
[[(1, 'a')], [(2, 'b'), (2, 'c')], [(3, 'd'), (3, 'e'), (3, 'f')]]
From the itertools.groupby()
documentation:
The returned group is itself an iterator that shares the underlying iterable with groupby()
. Because the source is shared, when the groupby()
object is advanced, the previous group is no longer visible.
Turning the output from groupby()
into a list advances the groupby()
object.
Hence, you shouldn't be type-casting itertools.groupby
object to list. If you want to store the values as list
, then you should be doing something like this list comprehension in order to create copy of groupby
object:
grouped_l = [(a, list(b)) for a, b in itertools.groupby(l, key=lambda x:x[0])]
This will allow you to iterate your list (transformed from groupby
object) multiple times. However, if you are interested in only iterating the result once, then the second solution you mentioned in the question will suffice your requirement.
groupby
is super lazy. Here's an illuminating demo. Let's group three a
-values and four b
-values, and print out what's happening:
>>> from itertools import groupby
>>> def letters():
for letter in 'a', 'a', 'a', 'b', 'b', 'b', 'b':
print('yielding', letter)
yield letter
Going through the groups WITHOUT looking at their members
Let's roll:
>>> groups = groupby(letters())
>>>
Nothing got printed yet! So until now, groupby
did nothing. What a lazy bum. Let's ask it for the first group:
>>> next(groups)
yielding a
('a', <itertools._grouper object at 0x05A16050>)
So groupby
tells us that this is a group of a
-values, and we could go through that _grouper
object to get them all. But wait, why did "yielding a" get printed only once? Our generator is yielding three of them, isn't it? Well, that's because groupby
is lazy. It did read one value to identify the group, because it needs to tell us what the group is about, i.e., that it's a group of a
-values. And it offers us that _grouper
object for us to get all the group's members if we want to. But we didn't ask to go through the members, so the lazy bum didn't go any further. It simply didn't have a reason to. Let's ask for the next group:
>>> next(groups)
yielding a
yielding a
yielding b
('b', <itertools._grouper object at 0x05A00FD0>)
Wait, what? Why "yielding a" when we're now dealing with the second group, the group of b
-values? Well, because groupby
previously stopped after the first a
because that was enough to give us all we had asked for. But now, to tell us about the second group, it has to find the second group, and for this it asks our generator until it sees something other than a
. Note that "yielding b" is again only printed once, even though our generator yields four of them. Let's ask for the third group:
>>> next(groups)
yielding b
yielding b
yielding b
Traceback (most recent call last):
File "<pyshell#32>", line 1, in <module>
next(groups)
StopIteration
Ok so there is no third group and thus groupby
issues a StopIteration
so the consumer (e.g., a loop or list comprehension) would know to stop. But before that, the remaining "yielding b" get printed, because groupby
got off its lazy butt and walked over the remaining values in hopes to find a new group.
Going through the groups WITH looking at their members
Let's try again, this time let's ask for the members:
>>> groups = groupby(letters())
>>> key, members = next(groups)
yielding a
>>> key
'a'
Again, groupby
asked our generator for just a single value, in order to identify the group so it can tell us that it's an a
-group. But this time, we'll also ask for the group members:
>>> list(members)
yielding a
yielding a
yielding b
['a', 'a', 'a']
Aha! There are the remaining "yielding a". Also, already the first "yielding b"! Even though we didn't even ask for the second group yet! But of course groupby
has to go this far because we asked for the group members, so it has to keep looking until it gets a non-member. Let's get the next group:
>>> key, members = next(groups)
>>>
Wait, what? Nothing got printed at all? Is groupby
sleeping? Wake up! Oh wait... that's right... it already found out that the next group is b
-values. Let's ask for all of them:
>>> list(members)
yielding b
yielding b
yielding b
['b', 'b', 'b', 'b']
Now the remaining three "yielding b" happen, because we asked for them so groupby
has to get them.
Why doesn't it work to get the group members afterwards?
Let's try it your initial way with list(groupby(...))
:
>>> groups = list(groupby(letters()))
yielding a
yielding a
yielding a
yielding b
yielding b
yielding b
yielding b
>>> [list(members) for key, members in groups]
[[], ['b']]
Note that not only is the first group empty, but also, the second group only has one element (you didn't mention that).
Why?
Again: groupby
is super lazy. It offers you those _grouper
objects so you can go through each group's members. But if you don't ask to see the group members and instead just ask for the next group to be identified, then groupby
just shrugs and is like "Ok, you're the boss, I'll just go find the next group".
What your list(groupby(...))
does is it asks groupby
to identify all groups. So it does that. But if you then at the end ask for the members of each group, then groupby
is like "Dude... I'm sorry, I offered them to you but you didn't want them. And I'm lazy, so I don't keep things around for no good reason. I can give you the last member of the last group, because I still remember that one, but for everything before that... sorry, I just don't have them anymore, you should've told me that you wanted them".
P.S. In all of this, of course "lazy" really means "efficient". Not something bad but something good!
Summary: The reason is that itertools generally do not store data. They just consume an iterator. So when the outer iterator advances, the inner iterator must as well.
Analogy: Imagine you are a flight attendant standing at the door, admitting a single line passengers to an aircraft. The passengers are arranged by boarding group but you can only see and admit them one at a time. Periodically, as people enter you will learn when one boarding group has ended and then next has begun.
To advance to the next group, you're going to have to admit all the remaining passengers in the current group. You can't see what is downstream in line without letting all the current passengers through.
Unix comparison: The design of groupby() is algorithmically similar to the Unix uniq utility.
What the docs say: "The returned group is itself an iterator that shares the underlying iterable with groupby(). Because the source is shared, when the groupby() object is advanced, the previous group is no longer visible."
How to use it: If the data is needed later, it should be stored as a list:
groups = []
uniquekeys = []
data = sorted(data, key=keyfunc)
for k, g in groupby(data, keyfunc):
groups.append(list(g)) # Store group iterator as a list
uniquekeys.append(k)