Same memory address for different strings in cytho

2019-05-31 16:20发布

问题:

I wrote a tree object in cython that has many nodes, each containing a single unicode character. I wanted to test whether the character gets interned if I use Py_UNICODE or str as the variable type. I'm trying to test this by creating multiple instances of the node class and getting the memory address of the character for each, but somehow I end up with the same memory address, even if the different instances contain different characters. Here is my code:

from libc.stdint cimport uintptr_t

cdef class Node():
    cdef:
        public str character
        public unsigned int count
        public Node lo, eq, hi

    def __init__(self, str character):
        self.character = character

    def memory(self):
        return <uintptr_t>&self.character[0]

I am trying to compare the memory locations like so, from Python:

a = Node("a")
a2 = Node("a")
b = Node("b")
print(a.memory(), a2.memory(), b.memory())

But the memory addresses that prints out are all the same. What am I doing wrong?

回答1:

Obviously, what you are doing is not what you think you would be doing.

self.character[0] doesn't return the address/reference of the first character (as it would be the case for an array for example), but a Py_UCS4-value (i.e. an usigned 32bit-integer), which is copied to a (local, temprorary) variable on the stack.

In your function, <uintptr_t>&self.character[0] gets you the address of the local variable on the stack, which per chance is always the same because when calling memory there is always the same stack-layout.

To make it clearer, here is the difference to a char * c_string, where &c_string[0] gives you the address of the first character in c_string.

Compare:

%%cython
from libc.stdint cimport uintptr_t

cdef char *c_string = "name";
def get_addresses_from_chars():
    for i in range(4):
        print(<uintptr_t>&c_string[i])

cdef str py_string="name";
def get_addresses_from_pystr():
    for i in range(4):
        print(<uintptr_t>&py_string[i])

An now:

>>> get_addresses_from_chars() # works  - different addresses every time
# ...7752
# ...7753
# ...7754
# ...7755
>>> get_addresses_from_pystr() # works differently - the same address.
# ...0672 
# ...0672
# ...0672
# ...0672

You can see it this way: c_string[...] is a cdef functionality, but py_string[...] is a python-functionality and thus cannot return an address per construction.

To influence the stack-layout, you could use a recursive function:

def memory(self, level):
    if level==0 :
        return <uintptr_t>&self.character[0]
    else:
        return self.memory(level-1)

Now calling it with a.memory(0), a.memory(1) and so on will give you different addresses (unless tail-call-optimization will kick in, I don't believe it will happen, but you could disable the optimization (-O0) just to be sure). Because depending on the level/recursion-depth, the local variable, whose address will be returned, is in a different place on the stack.


To see whether Unicode-objects are interned, it is enough to use id, which yields the address of the object (this is a CPython's implementation detail) so you don't need Cython at all:

>>> id(a.character) == id(a2.character)
# True

or in Cython, doing the same what id does (a little bit faster):

%%cython
from libc.stdint cimport uintptr_t
from cpython cimport PyObject
...
    def memory(self):
        # cast from object to PyObject, so the address can be used
        return <uintptr_t>(<PyObject*>self.character)

You need to cast an object to PyObject *, so the Cython will allow to take the address of the variable.

And now:

 >>> ...
 >>> print(a.memory(), a2.memory(), b.memory())
 # ...5800 ...5800 ...5000

If you want to get the address of the first code-point in the unicode object (which is not the same as the address of the string), you can use <PY_UNICODE *>self.character which Cython will replace by a call to PyUnicode_AsUnicode, e.g.:

%%cython
...   
def memory(self):
    return <uintptr_t>(<Py_UNICODE*>self.character), id(self.character)

and now

>>> ...
>>> print(a.memory(), a2.memory(), b.memory())
# (...768, ...800) (...768, ...800) (...144, ...000)

i.e. "a" is interned and has different address than "b" and code-points bufffer has a different address than the objects containing it (as one would expect).