I found this comprehension that works perfectly for flattening a list of lists:
>>> list_of_lists = [(1,2,3),(2,3,4),(3,4,5)]
>>> [item for sublist in list_of_lists for item in sublist]
[1, 2, 3, 2, 3, 4, 3, 4, 5]
I like this better than using itertools.chain()
, but I just can't understand it. I've tried surrounding parts with parentheses, to see if I could reduce the complexity, but now I'm just more confused:
>>> [(item for sublist in list_of_lists) for item in sublist]
[<generator object <genexpr> at 0x7ff919fdfd20>, <generator object <genexpr> at 0x7ff919fdfd70>, <generator object <genexpr> at 0x7ff919fdfdc0>]
>>> [item for sublist in (list_of_lists for item in sublist)]
[5, 5, 5]
I get this feeling that I'm having a hard time understanding because I don't quite understand how generators work... I mean, I thought I did, but now I'm seriously in doubt. Like I said, I love how compact this idiom is, and it's exactly what I need, but I'm loathe to use code that I don't understand.
Can anyone explain what exactly is happening here?
Read the for loops as if they were nested, from left to right. The expression on the left is the one that produces each value in the final list:
for sublist in list_of_lists:
for item in sublist:
item # added to the list
List comprehensions also support if
tests to filter what elements are used; these can also be seen as nested statements, in the same way as the for
loops.
By adding parenthesis, you changed the expression; everything in parenthesis is now the left-hand expression to add:
for item in sublist:
(item for sublist in list_of_lists) # added to the list
A for
loop like that is a generator expression. It works exactly like a list comprehension except that it doesn't build a list. The elements are instead produced on demand. You can ask a generator expression for the next value, then the next value, etc.
In this case, there must be a pre-existing sublist
object for this to work at all; the outer loop is not over list_of_lists
anymore, after all.
Your last attempt translates to:
for sublist in (list_of_lists for item in sublist):
item # aded to the list
Here list_of_lists
is a loop element in a generator expression looping over for item in sublist
. Again, sublist
must exist already for this to work. The loop then adds a pre-existing item
to the final list output.
In your case, apparently sublist
is a list with 3 items in it; your final list produced 3 elements. item
was bound to 5
, so you got 3 times 5
in your output.
List Comprehension
When I first started with list comprehension, I read that like English sentences and I was able to easily understand them. For example,
[item for sublist in list_of_lists for item in sublist]
can be read like
for each sublist in list_of_lists and for each item in sublist add item
Also, the filtering part can be read as
for each sublist in list_of_lists and for each item in sublist add item only if it is valid
And the corresponding comprehension would be
[item for sublist in list_of_lists for item in sublist if valid(item)]
Generators
They are like land mines, triggered only when invoked with the next
protocol. They are similar to functions, but till an exception is raised or the end of function is reached, they are not exhausted and they can be invoked again and again. The important thing is, they retain the state between the previous invocation and the current.
The difference between a generator and a function is that, generators use yield
keyword to give the value back to the invoker. In case of a generator expression, they are similar to the list comprehension, the fist expression is the actual value being "yielded".
With this basic understanding, if we look at your expressions in the question,
[(item for sublist in list_of_lists) for item in sublist]
You are mixing list comprehension with the generator expressions. This will be read like this
for each item in sublist add a generator expression which is defined as, for every sublist in list_of_lists yield item
which is not what you had in your mind. And since the generator expression is not iterated, the generator expression object is added in the list as it is. Since they will not be evaluated without being invoked with the next protocol, they will not produce any error (if there are any, unless they have syntax error). In this case, it will produce runtime error as sublist
is not defined yet.
Also, in the last case,
[item for sublist in (list_of_lists for item in sublist)]
for each sublist in the generator expression, add item and the generator expression is defined as for each item in sublist yield list_of_lists.
The for loop will iterate any iterable with the next protocol. So, the generator expression will be evaluated and the item
will always be the last element in the iteration of the sublist
and you are adding that in the list. This will also produce runtime error, since sublist is not defined yet.
The list comprehension works like this:
[<what i want> <for loops in the order you'd write them naturally>]
In this case, <what I want>
is every item
in every sublist
. To get those items, you just loop over the sublists in the original list, and save/yield each item in the sublist. Thus, the order of the for loops in the list comprehension is the same order you would have used if you did not use a list comprehension. The only confusing part is that the <what I want>
comes first, and not inside the body of the last loop.