Problem
Void pointer to a cdef class is pointing to the same memory address without forcing the the python reference counter.
Description
I have a simple class that I want to store in a cpp vector by casting it to a void pointer. However, after printing the memory addresses the pointer is pointing to, it repeats after the second iteration, unless I force the reference counter to be increased by adding the new object to a list. Can somebody why the memory loops back without the reference counter enforcement?
# distutils: language = c++
# distutils: extra_compile_args = -std=c++11
from libcpp.vector cimport vector
from libc.stdio cimport printf
cdef class Temp:
cdef int a
def __init__(self, a):
self.a = a
def f():
cdef vector[void *] vec
cdef int i, n = 3
cdef Temp tmp
cdef list ids = []
# cdef list classes = [] # force reference counter?
for i in range(n):
tmp = Temp(1)
# classes.append(tmp)
vec.push_back(<void *> tmp)
printf('%p ', <void *> tmp)
ids.append(id(tmp))
print(ids)
f()
Which outputs:
[140137023037824, 140137023037848, 140137023037824]
However if I force the reference counter by adding it to the classes list:
[140663518040448, 140663518040472, 140663518040496]
This answer became quite long, so there is a quick overview of the content:
- Explanation of the observed behavior
- Naive approach to avoid the problem
- A more systematic and c++-typical solution
- Illustrating problem for multi-threaded code in "nogil"-mode
- Extending c++-typical solution for nogil-mode
Explanation of the observed behavior
The deal with Cython: as long as your variables are of type object
or inherit from it (in your case cdef Temp
) cython manages the reference counting for you. As soon as you cast it to PyObject *
or any other pointer - the reference counting is your responsibility.
Obviously, the only reference to the created object is the variable tmp
, as soon as you rebind it to the newly created Temp
-object, the reference-counter of the old object becomes 0
and it is destroyed - the pointers in the vector becomes dangling. However, the same memory can be reused (it is quite probably) and thus you see always the same reused address.
Naive solution
How could you do the reference counting? For example (I use rather PyObject *
than void *
):
...
from cpython cimport PyObject,Py_XINCREF, Py_XDECREF
...
def f():
cdef vector[PyObject *] vec
cdef int i, n = 3
cdef Temp tmp
cdef PyObject *tmp_ptr
cdef list ids = []
for i in range(n):
tmp = Temp(1)
tmp_ptr = <PyObject *> tmp
Py_XINCREF(tmp_ptr) # ensure it is not destroyed
vec.push_back(tmp_ptr)
printf('%p ', tmp_ptr)
ids.append(id(tmp))
#free memory:
for i in range(n):
Py_XDECREF(vec.at(i))
print(ids)
Now all objects stay alive and "die" only after Py_XDECREF
is called explicitly.
C++-typical solution
The above is not a very typical c++-way of doing things, I would rather introduce a wrapper which manages the reference counting automatically (not unlike std::shared_ptr
):
...
cdef extern from *:
"""
#include <Python.h>
class PyObjectHolder{
public:
PyObject *ptr;
PyObjectHolder():ptr(nullptr){}
PyObjectHolder(PyObject *o):ptr(o){
Py_XINCREF(ptr);
}
//rule of 3
~PyObjectHolder(){
Py_XDECREF(ptr);
}
PyObjectHolder(const PyObjectHolder &h):
PyObjectHolder(h.ptr){}
PyObjectHolder& operator=(const PyObjectHolder &other){
Py_XDECREF(ptr);
ptr=other.ptr;
Py_XINCREF(ptr);
return *this;
}
};
"""
cdef cppclass PyObjectHolder:
PyObjectHolder(PyObject *o)
...
def f():
cdef vector[PyObjectHolder] vec
cdef int i, n = 3
cdef Temp tmp
cdef PyObject *tmp_ptr
cdef list ids = []
for i in range(n):
tmp = Temp(1)
vec.push_back(PyObjectHolder(<PyObject *> tmp)) # vector::emplace_back is missing in Cython-wrappers
printf('%p ', <PyObject *> tmp)
ids.append(id(tmp))
print(ids)
# PyObjectHolder automatically decreases ref-counter as soon
# vec is out of scope, no need to take additional care
Noteworthy things:
PyObjectHolder
increases ref-counter as soon as it take possession of a PyObject
-pointer and decreases it as soon as it releases the pointer.
- Rule of three means we also have to take care in copy-constructor and assignment operator
- I've omitted move-stuff for c++11, but you need to take care of it as well.
Problems with nogil-mode
There is however one very important thing: You shouldn't release GIL with the above implementation (i.e. import it as PyObjectHolder(PyObject *o) nogil
but there are also problems when C++ copies the vectors and similar) - because otherwise Py_XINCREF
and Py_XDECREF
might not work correctly.
To illustrate that let's take a look at the following code, which releases gil and does some stupid calculations in parallel (the whole magic cell is in listings at the end of the answer):
%%cython --cplus -c=/openmp
...
# importing as nogil - A BAD THING
cdef cppclass PyObjectHolder:
PyObjectHolder(PyObject *o) nogil
# some functionality using a lot of incref/decref
cdef int create_vectors(PyObject *o) nogil:
cdef vector[PyObjectHolder] vec
cdef int i
for i in range(100):
vec.push_back(PyObjectHolder(o))
return vec.size()
# using PyObjectHolder without gil - A BAD THING
def run(object o):
cdef PyObject *ptr=<PyObject*>o;
cdef int i
for i in prange(10, nogil=True):
create_vectors(ptr)
And now:
import sys
a=[1000]*1000
print("Starts with", sys.getrefcount(a[0]))
# prints: Starts with 1002
run(a[0])
print("Ends with", sys.getrefcount(a[0]))
#prints: Ends with 1177
We got lucky, the program didn't crash (but could!). However due to race conditions, we ended up with memory leak - a[0]
has reference count of 1177
but there are only 1000 references(+2 inside of sys.getrefcount
) references alive, so this object will never be destroyed.
Making PyObjectHolder
thread-safe
So what to do? The simplest solution is to use a mutex to protect the accesses to ref-counter(i.e. every time Py_XINCREF
or Py_XDECREF
is called). The downside of this approach is that it might slowdown the single core code considerable (see for example this old article about an older try to replace GIL by mutex-similar approach).
Here is a prototype:
%%cython --cplus -c=/openmp
...
cdef extern from *:
"""
#include <Python.h>
#include <mutex>
std::mutex ref_mutex;
class PyObjectHolder{
public:
PyObject *ptr;
PyObjectHolder():ptr(nullptr){}
PyObjectHolder(PyObject *o):ptr(o){
std::lock_guard<std::mutex> guard(ref_mutex);
Py_XINCREF(ptr);
}
//rule of 3
~PyObjectHolder(){
std::lock_guard<std::mutex> guard(ref_mutex);
Py_XDECREF(ptr);
}
PyObjectHolder(const PyObjectHolder &h):
PyObjectHolder(h.ptr){}
PyObjectHolder& operator=(const PyObjectHolder &other){
{
std::lock_guard<std::mutex> guard(ref_mutex);
Py_XDECREF(ptr);
ptr=other.ptr;
Py_XINCREF(ptr);
}
return *this;
}
};
"""
cdef cppclass PyObjectHolder:
PyObjectHolder(PyObject *o) nogil
...
And now, running the code snipped from above yields the expected/right behavior:
import sys
a=[1000]*1000
print("Starts with", sys.getrefcount(a[0]))
# prints: Starts with 1002
run(a[0])
print("Ends with", sys.getrefcount(a[0]))
#prints: Ends with 1002
However, as @DavidW has pointed out, using std::mutex
works only for openmp-threads, but not threads created by the Python-interpreter.
Here is an example for which the mutex-solution will fail.
First, wrapping nogil-function as def
-function:
%%cython --cplus -c=/openmp
...
def single_create_vectors(object o):
cdef PyObject *ptr=<PyObject *>o
with nogil:
create_vectors(ptr)
And now using threading
-module to create
import sys
a=[1000]*10000 # some safety, so chances are high python will not crash
print(sys.getrefcount(a[0]))
#output: 10002
from threading import Thread
threads = []
for i in range(100):
t = Thread(target=single_create_vectors, args=(a[0],))
threads.append(t)
t.start()
for t in threads:
t.join()
print(sys.getrefcount(a[0]))
#output: 10015 but should be 10002!
An alternative to using std::mutex
would be to use the Python-machinery, i.e. PyGILState_STATE
, which would lead to code similar to
...
PyObjectHolderPy(PyObject *o):ptr(o){
PyGILState_STATE gstate;
gstate = PyGILState_Ensure();
Py_XINCREF(ptr);
PyGILState_Release(gstate);
}
...
This would also work for the threading
-example above. However, PyGILState_Ensure
has just too much overhead - for the example above it would be about 100 times slower than the mutex-solution. One more lightweight solution with Python-machinery would mean also much more hassle.
Listing complete thread-unsafe version:
%%cython --cplus -c=/openmp
from libcpp.vector cimport vector
from libc.stdio cimport printf
from cpython cimport PyObject
from cython.parallel import prange
import sys
cdef extern from *:
"""
#include <Python.h>
class PyObjectHolder{
public:
PyObject *ptr;
PyObjectHolder():ptr(nullptr){}
PyObjectHolder(PyObject *o):ptr(o){
Py_XINCREF(ptr);
}
//rule of 3
~PyObjectHolder(){
Py_XDECREF(ptr);
}
PyObjectHolder(const PyObjectHolder &h):
PyObjectHolder(h.ptr){}
PyObjectHolder& operator=(const PyObjectHolder &other){
{
Py_XDECREF(ptr);
ptr=other.ptr;
Py_XINCREF(ptr);
}
return *this;
}
};
"""
cdef cppclass PyObjectHolder:
PyObjectHolder(PyObject *o) nogil
cdef int create_vectors(PyObject *o) nogil:
cdef vector[PyObjectHolder] vec
cdef int i
for i in range(100):
vec.push_back(PyObjectHolder(o))
return vec.size()
def run(object o):
cdef PyObject *ptr=<PyObject*>o;
cdef int i
for i in prange(10, nogil=True):
create_vectors(ptr)
The fact that your objects end up at the same address is coincidence. Your issue is that your python objects get destroyed when the last python reference to them goes away. If you want to keep python objects alive, you will need to hold a reference to them somewhere.
In your case, since tmp
is the only reference to the Temp
object you create within your loop, every time you re-assign tmp
, the object it was previously referencing gets destroyed. That leaves blank space in memory that's conveniently exactly the right size to hold the Temp
object that gets created in the next iteration of the loop, leading to the alternating pattern you see in your pointers.