Why the following two decoding methods return different results?
>>> import codecs
>>>
>>> data = ['', '', 'a', '']
>>> list(codecs.iterdecode(data, 'utf-8'))
[u'a']
>>> [codecs.decode(i, 'utf-8') for i in data]
[u'', u'', u'a', u'']
Is this a bug or expected behavior? My Python version 2.7.13.
This is normal.
iterdecode
takes an iterator over encoded chunks and returns an iterator over decoded chunks, but it doesn't promise a one-to-one correspondence. All it guarantees is that the concatenation of all output chunks is a valid decoding of the concatenation of all input chunks.If you look at the source code, you'll see it's explicitly discarding empty output chunks:
Be aware that the reason
iterdecode
exists, and the reason you wouldn't just calldecode
on all the chunks yourself, is that the decoding process is stateful. The UTF-8 encoded form of one character might be split over multiple chunks. Other codecs might have really weird stateful behavior, like maybe a byte sequence that inverts the case of all characters until you see that byte sequence again.