How can I determine the display width of a Unicode string in Python 3.x, and is there a way to use that information to align those strings with str.format()
?
Motivating example: Printing a table of strings to the console. Some of the strings contain non-ASCII characters.
>>> for title in d.keys():
>>> print("{:<20} | {}".format(title, d[title]))
zootehni- | zooteh.
zootekni- | zootek.
zoothèque | zooth.
zooveterinar- | zoovet.
zoovetinstitut- | zoovetinst.
母 | 母母
>>> s = 'è'
>>> len(s)
2
>>> [ord(c) for c in s]
[101, 768]
>>> unicodedata.name(s[1])
'COMBINING GRAVE ACCENT'
>>> s2 = '母'
>>> len(s2)
1
As can be seen, str.format()
simply takes the number of code-points in the string (len(s)
) as its width, leading to skewed columns in the output. Searching through the unicodedata
module, I have not found anything suggesting a solution.
Unicode normalization can fix the problem for è, but not for Asian characters, which often have larger display width. Similarly, zero-width unicode characters exist (e.g. zero-width space for allowing line breaks within words). You can't work around these issues with normalization, so please do not suggest "normalize your strings".
Edit: Added info about normalization.
Edit 2: In my original dataset also have some European combining characters that don't result in a single code-point even after normalization:
zwemwater | zwemw.
zwia̢z- | zw.
>>> s3 = 'a\u0322' # The 'a + combining retroflex hook below' from zwiaz
>>> len(unicodedata.normalize('NFC', s3))
2