Python Interpreter String Pooling Optimization [du

2019-04-28 15:33发布

问题:

This question already has an answer here:

  • When does python choose to intern a string [duplicate] 3 answers

After seeing this question and its duplicate a question still remained for me.

I get what is and == do and why if I run

a = "ab"
b = "ab"

a == b

I get True. The question here would be WHY this happens:

a = "ab"
b = "ab"
a is b # Returns True

So I did my research and I found this. The answer says Python interpreter uses string pooling. So if it sees that two strings are the same, it assigns the same id to the new one for optimization.

Until here everything is alright and answered. My real question is why this pooling only happens for some strings. Here is an example:

a = "ab"
b = "ab"
a is b # Returns True, as expected knowing Interpreter uses string pooling

a = "a_b"
b = "a_b"
a is b # Returns True, again, as expected knowing Interpreter uses string pooling

a = "a b"
b = "a b"
a is b # Returns False, why??

a = "a-b"
b = "a-b"
a is b # Returns False, WHY??

So it seems for some characters, string pooling isn't working. I used Python 2.7.6 for this examples so I thought this would be fixed in Python 3. But after trying the same examples in Python 3, the same results appear.

Question: Why isn't string pooling optimized for this examples? Wouldn't it be better for Python to optimize this as well?


Edit: If I run "a b" is "a b" returns True. The question is why using variables it returns False for some characters but True for others.

回答1:

Your question is a duplicate of a more general question "When does python choose to intern a string", the correct answer to which is that string interning is implementation specific.

Interning of strings in CPython 2.7.7 is described very well in this article: The internals of Python string interning. Information therein allows to explain your examples.

The reason that the strings "ab" and "a_b" are interned, whereas "a b" and "a-b" aren't, is that the former look like python identifiers and the latter don't.

Naturally, interning every single string would incur a runtime cost. Therefore the interpreter must decide whether a given string is worth interning. Since the names of identifiers used in a python program are embedded in the program's bytecode as strings, identifier-like strings have a higher chance of benefiting from interning.

A short excerpt from the above article:

The function all_name_chars rules out strings that are not composed of ascii letters, digits or underscores, i.e. strings looking like identifiers:

#define NAME_CHARS \
    "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz"

/* all_name_chars(s): true iff all chars in s are valid NAME_CHARS */

static int
all_name_chars(unsigned char *s)
{
    static char ok_name_char[256];
    static unsigned char *name_chars = (unsigned char *)NAME_CHARS;

    if (ok_name_char[*name_chars] == 0) {
        unsigned char *p;
        for (p = name_chars; *p; p++)
            ok_name_char[*p] = 1;
    }
    while (*s) {
        if (ok_name_char[*s++] == 0)
            return 0;
    }
    return 1;
}

With all these explanations in mind, we now understand why 'foo!' is 'foo!' evaluates to False whereas 'foo' is 'foo' evaluates to True.