What are the rules for cpython's string intern

2019-02-12 15:34发布

In python 3.5, is it possible to predict when we will get an interned string or when we will get a copy? After reading a few Stack Overflow answers on this issue I've found this one the most helpful but still not comprehensive. Than I looked at Python docs, but the interning is not guaranteed by default

Normally, the names used in Python programs are automatically interned, and the dictionaries used to hold module, class or instance attributes have interned keys.

So, my question is about inner intern() conditions, i.e. decision-making (whether to intern string literal or not): why the same piece of code works on one system and not on another one and what rules did author of the answer on mentioned topic mean when saying

the rules for when this happens are quite convoluted

2条回答
beautiful°
2楼-- · 2019-02-12 15:57

From what I understood from the post you linked:

When you use if a == b, you are checking if the value of a is the value of b, whereas when you use if a is b, you are checking if a and b are the same object (or share the same spot in the memory).

Now python interns the constant strings (defined by "blabla"). So:

>>> a = "abcdef"
>>> a is "abcdef"
True

But when you do:

>>> a = "".join([chr(i) for i in range(ord('a'), ord('g'))])
>>> a
'abcdef'
>>> a is "abcdef"
False

In the C programming language, using a string with "" will make it a const char *. I think this is what is happening here.

查看更多
Bombasti
3楼-- · 2019-02-12 15:59

You think there are rules?

The only rule for interning is that the return value of intern is interned. Everything else is up to the whims of whoever decided some piece of code should or shouldn't do interning. For example, "left" gets interned by PyCodeNew:

/* Intern selected string constants */
for (i = PyTuple_GET_SIZE(consts); --i >= 0; ) {
    PyObject *v = PyTuple_GetItem(consts, i);
    if (!all_name_chars(v))
        continue;
    PyUnicode_InternInPlace(&PyTuple_GET_ITEM(consts, i));
}

The "rule" here is that a string object in the co_consts of a Python code object gets interned if it consists purely of ASCII characters that are legal in a Python identifier. "left" gets interned, but "as,df" wouldn't be, and "1234" would be interned even though an identifier can't start with a digit. While identifiers can contain non-ASCII characters, such characters are still rejected by this check. Actual identifiers don't ever pass through this code; they get unconditionally interned a few lines up, ASCII or not. This code is subject to change, and there's plenty of other code that does interning or interning-like things.

Asking us for the "rules" for string interning is like asking a meteorologist what the rules are for whether it rains on your wedding. We can tell you quite a lot about how it works, but it won't be much use to you, and you'll always get surprises.

查看更多
登录 后发表回答