Display width of unicode strings in Python [duplic

2020-02-13 14:06发布

问题:

This question already has answers here:
Closed 5 years ago.

How can I determine the display width of a Unicode string in Python 3.x, and is there a way to use that information to align those strings with str.format()?

Motivating example: Printing a table of strings to the console. Some of the strings contain non-ASCII characters.

>>> for title in d.keys():
>>>     print("{:<20} | {}".format(title, d[title]))

    zootehni-           | zooteh.
    zootekni-           | zootek.
    zoothèque          | zooth.
    zooveterinar-       | zoovet.
    zoovetinstitut-     | zoovetinst.
    母                   | 母母

>>> s = 'è'
>>> len(s)
    2
>>> [ord(c) for c in s]
    [101, 768]
>>> unicodedata.name(s[1])
    'COMBINING GRAVE ACCENT'
>>> s2 = '母'
>>> len(s2)
    1

As can be seen, str.format() simply takes the number of code-points in the string (len(s)) as its width, leading to skewed columns in the output. Searching through the unicodedata module, I have not found anything suggesting a solution.

Unicode normalization can fix the problem for è, but not for Asian characters, which often have larger display width. Similarly, zero-width unicode characters exist (e.g. zero-width space for allowing line breaks within words). You can't work around these issues with normalization, so please do not suggest "normalize your strings".

Edit: Added info about normalization.

Edit 2: In my original dataset also have some European combining characters that don't result in a single code-point even after normalization:

    zwemwater     | zwemw.
    zwia̢z-       | zw.

>>> s3 = 'a\u0322'   # The 'a + combining retroflex hook below' from zwiaz
>>> len(unicodedata.normalize('NFC', s3))
    2

回答1:

You have several options:

  1. Some consoles support escape sequences for pixel-exact positioning of the cursor. Might cause some overprinting, though.

    Historical note: This approach was used in the Amiga terminal to display images in a console window by printing a line of text and then advancing the cursor down by one pixel. The leftover pixels of the text line slowly built an image.

  2. Create a table in your code which contains the real (pixel) widths of all Unicode characters in the font that is used in the console / terminal window. Use a UI framework and a small Python script to generate this table.

    Then add code which calculates the real width of the text using this table. The result might not be a multiple of the character width in the console, though. Together with pixel-exact cursor movement, this might solve your issue.

    Note: You'll have to add special handling for ligatures (fi, fl) and composites. Alternatively, you can load a UI framework without opening a window and use the graphics primitives to calculate the string widths.

  3. Use the tab character (\t) to indent. But that will only help if your shell actually uses the real text width to place the cursor. Many terminals will simply count characters.

  4. Create a HTML file with a table and look at it in a browser.