I wrote a tree object in cython that has many nodes, each containing a single unicode character. I wanted to test whether the character gets interned if I use Py_UNICODE or str as the variable type. I'm trying to test this by creating multiple instances of the node class and getting the memory address of the character for each, but somehow I end up with the same memory address, even if the different instances contain different characters. Here is my code:
from libc.stdint cimport uintptr_t
cdef class Node():
cdef:
public str character
public unsigned int count
public Node lo, eq, hi
def __init__(self, str character):
self.character = character
def memory(self):
return <uintptr_t>&self.character[0]
I am trying to compare the memory locations like so, from Python:
a = Node("a")
a2 = Node("a")
b = Node("b")
print(a.memory(), a2.memory(), b.memory())
But the memory addresses that prints out are all the same. What am I doing wrong?
Obviously, what you are doing is not what you think you would be doing.
self.character[0]
doesn't return the address/reference of the first character (as it would be the case for an array for example), but a Py_UCS4
-value (i.e. an usigned 32bit-integer), which is copied to a (local, temprorary) variable on the stack.
In your function, <uintptr_t>&self.character[0]
gets you the address of the local variable on the stack, which per chance is always the same because when calling memory
there is always the same stack-layout.
To make it clearer, here is the difference to a char * c_string
, where &c_string[0]
gives you the address of the first character in c_string
.
Compare:
%%cython
from libc.stdint cimport uintptr_t
cdef char *c_string = "name";
def get_addresses_from_chars():
for i in range(4):
print(<uintptr_t>&c_string[i])
cdef str py_string="name";
def get_addresses_from_pystr():
for i in range(4):
print(<uintptr_t>&py_string[i])
An now:
>>> get_addresses_from_chars() # works - different addresses every time
# ...7752
# ...7753
# ...7754
# ...7755
>>> get_addresses_from_pystr() # works differently - the same address.
# ...0672
# ...0672
# ...0672
# ...0672
You can see it this way: c_string[...]
is a cdef
functionality, but py_string[...]
is a python-functionality and thus cannot return an address per construction.
To influence the stack-layout, you could use a recursive function:
def memory(self, level):
if level==0 :
return <uintptr_t>&self.character[0]
else:
return self.memory(level-1)
Now calling it with a.memory(0)
, a.memory(1)
and so on will give you different addresses (unless tail-call-optimization will kick in, I don't believe it will happen, but you could disable the optimization (-O0
) just to be sure). Because depending on the level
/recursion-depth, the local variable, whose address will be returned, is in a different place on the stack.
To see whether Unicode-objects are interned, it is enough to use id
, which yields the address of the object (this is a CPython's implementation detail) so you don't need Cython at all:
>>> id(a.character) == id(a2.character)
# True
or in Cython, doing the same what id
does (a little bit faster):
%%cython
from libc.stdint cimport uintptr_t
from cpython cimport PyObject
...
def memory(self):
# cast from object to PyObject, so the address can be used
return <uintptr_t>(<PyObject*>self.character)
You need to cast an object
to PyObject *
, so the Cython will allow to take the address of the variable.
And now:
>>> ...
>>> print(a.memory(), a2.memory(), b.memory())
# ...5800 ...5800 ...5000
If you want to get the address of the first code-point in the unicode object (which is not the same as the address of the string), you can use <PY_UNICODE *>self.character
which Cython will replace by a call to PyUnicode_AsUnicode
, e.g.:
%%cython
...
def memory(self):
return <uintptr_t>(<Py_UNICODE*>self.character), id(self.character)
and now
>>> ...
>>> print(a.memory(), a2.memory(), b.memory())
# (...768, ...800) (...768, ...800) (...144, ...000)
i.e. "a"
is interned and has different address than "b"
and code-points bufffer has a different address than the objects containing it (as one would expect).