Is there a Python constant for Unicode whitespace?

2019-04-18 10:51发布

问题:

The string module contains a whitespace attribute, which is a string consisting of all the ASCII characters that are considered whitespace. Is there a corresponding constant that includes Unicode spaces too, such as the no-break space (U+00A0)? We can see from the question "strip() and strip(string.whitespace) give different results" that at least strip is aware of additional Unicode whitespace characters.

This question was identified as a duplicate of In Python, how to list all characters matched by POSIX extended regex [:space:]?, but the answers to that question identify ways of searching for whitespace characters to generate your own list. This is a time-consuming process. My question was specifically about a constant.

回答1:

Is there a Python constant for Unicode whitespace?

Short answer: No. I have personally grepped for these characters (specifically, the numeric code points) in the Python code base, and such a constant is not there.

The sections below explains why it is not necessary, and how it is implemented without this information being available as a constant. But having such a constant would also be a really bad idea.

If the Unicode Consortium added another character/code-point that is semantically whitespace, the maintainers of Python would have a poor choice between continuing to support semantically incorrect code or changing the constant and possibly breaking pre-existing code that might (inadvisably) make assumptions about the constant not changing.

How could it add these character code-points? There are 1,111,998 possible characters in Unicode. But only 120,672 are occupied as of version 8. Each new version of Unicode may add additional characters. One of these new characters might be a form of whitespace.

The information is stored in a dynamically generated C function

The code that determines what is whitespace in unicode is the following dynamically generated code.

# Generate code for _PyUnicode_IsWhitespace()
print("/* Returns 1 for Unicode characters having the bidirectional", file=fp)
print(" * type 'WS', 'B' or 'S' or the category 'Zs', 0 otherwise.", file=fp)
print(" */", file=fp)
print('int _PyUnicode_IsWhitespace(const Py_UCS4 ch)', file=fp)
print('{', file=fp)
print('    switch (ch) {', file=fp)
for codepoint in sorted(spaces):
    print('    case 0x%04X:' % (codepoint,), file=fp)
print('        return 1;', file=fp)
print('    }', file=fp)
print('    return 0;', file=fp)
print('}', file=fp)
print(file=fp)

This is a switch statement, which is a constant code block, but this information is not available as a module "constant" like the string module has. It is instead buried in the function compiled from C and not directly accessible from Python.

This is likely because as more code points are added to Unicode, we would not be able to change constants for backwards compatibility reasons.

The Generated Code

Here's the generated code currently at the tip:

int _PyUnicode_IsWhitespace(const Py_UCS4 ch)
{
    switch (ch) {
    case 0x0009:
    case 0x000A:
    case 0x000B:
    case 0x000C:
    case 0x000D:
    case 0x001C:
    case 0x001D:
    case 0x001E:
    case 0x001F:
    case 0x0020:
    case 0x0085:
    case 0x00A0:
    case 0x1680:
    case 0x2000:
    case 0x2001:
    case 0x2002:
    case 0x2003:
    case 0x2004:
    case 0x2005:
    case 0x2006:
    case 0x2007:
    case 0x2008:
    case 0x2009:
    case 0x200A:
    case 0x2028:
    case 0x2029:
    case 0x202F:
    case 0x205F:
    case 0x3000:
        return 1;
    }
    return 0;
}

Making your own constant:

The following code (from my answer here), in Python 3, generates a constant of all whitespace:

import re
import sys

s = ''.join(chr(c) for c in range(sys.maxunicode+1))
ws = ''.join(re.findall(r'\s', s))

As an optimization, you could store this in a code base, instead of auto-generating it every new process, but I would caution against assuming that it would never change.

>>> ws
'\t\n\x0b\x0c\r\x1c\x1d\x1e\x1f \x85\xa0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u2028\u2029\u202f\u205f\u3000'

(Other answers to the question linked show how to get that for Python 2.)

Remember that at one point, some people probably thought 256 character encodings was all that we'd ever need.

>>> import string
>>> string.whitespace
' \t\n\r\x0b\x0c'

If you're insisting on keeping a constant in your code base, just generate the constant for your version of Python, and store it as a literal:

unicode_whitespace = u'\t\n\x0b\x0c\r\x1c\x1d\x1e\x1f \x85\xa0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u2028\u2029\u202f\u205f\u3000'

The u prefix makes it unicode in Python 2 (2.7 happens to recognize the entire string above as whitespace too), and in Python 3 it is ignored as string literals are unicode by default.