Fixed-size sequence of bytestrings in Cython

2019-07-19 05:09发布

问题:

I am new to Cython and have very little experience with C so bear with me.

I want to store a fixed-size sequence of immutable byte objects. The object would look like:

obj = (b'abc', b'1234', b'^&$#%')

The elements in the tuple are immutable, but their length is arbitrary.

What I tried was something along the lines of:

cdef char[3] *obj
cdef char* a, b, c
a = b'abc'
b = b'1234'
c = b'^&$#%'
obj = (a, b, c)

But I get:

Storing unsafe C derivative of temporary Python reference

Can someone please point me in the right direction?

Bonus question: how do I type an arbitrarily long sequence of those 3-tuples?

Thanks!

回答1:

You are definitely close! There appears to be two issues.

First, we need to change the declaration of obj so that it reads that we are trying to create an array of char* objects, fixed to a size of 3. To do this, you need to put the type, then the variable name, and only then the size of the array. This will give you the desired array of char* on the stack.

Second, when you declare char* a, b, c, only a is a char*, while b and c are just char! This is made clear in cython during the compilation phase, which outputs the following warning for me:

Non-trivial type declarators in shared declaration (e.g. mix of pointers and values). Each pointer declaration should be on its own line.

So you should do this instead:

cdef char* obj[3]
cdef char* a
cdef char* b
cdef char* c
a = b'abc'
b = b'1234'
c = b'^&$#%'
obj = [a, b, c]

As a side note, you can minimize typing cdef by doing this for your code:

cdef:
    char* obj[3]
    char* a
    char* b
    char* c
a = b'abc'
b = b'1234'
c = b'^&$#%'
obj = [a, b, c]

Bonus:

Based on your level of experience with C and pointers in general, I think I will just show the more newbie-friendly approach using C++ data structures. C++ has simple built-in data structures like vector, which is the equivalent of a python list. The C alternative would be to have a pointer to a struct, signifying an "array" of triplets. You would then be personally in charge of managing the memory of this using functions like malloc, free, realloc, etc.

Here is something to get you started; I strongly suggest you follow some online C or C++ tutorials on your own and adapt them to cython, which should be fairly trivial after some practice. I am showing both a test.pyx file as well as the setup.py file that shows how you can compile this with c++ support.

test.pyx

from libcpp.vector cimport vector

"""
While it can be discouraged to mix raw char* C "strings" wth C++ data types, 
the code here is pretty simple.
Fixed arrays cannot be used directly for vector type, so we use a struct.
Ideally, you would use an std::array of std::string, but std::array does not 
exist in cython's libcpp. It should be easy to add support for this with an
extern statement though (out of the scope of this mini-tutorial).
"""
ctypedef struct triplet:
    char* data[3]

cdef:
    vector[triplet] obj
    triplet abc
    triplet xyz

abc.data = ["abc", "1234", "^&$#%"]
xyz.data = ["xyz", "5678", "%#$&^"]
obj.push_back(abc)#pretty much like python's list.append
obj.push_back(xyz)

"""
Loops through the vector.
Cython can automagically print structs so long as their members can be 
converted trivially to python types.
"""
for o in obj:
    print(o)

setup.py

from distutils.core import setup
from Cython.Build import cythonize
from distutils.core import Extension

def create_extension(ext_name):
    global language, libs, args, link_args
    path_parts = ext_name.split(".")
    path = "./{0}.pyx".format("/".join(path_parts))
    ext = Extension(ext_name, sources=[path], libraries=libs, language=language,
            extra_compile_args=args, extra_link_args=link_args)
    return ext

if __name__ == "__main__":
    libs = []#no external c libraries in this case
    language = "c++"#chooses c++ rather than c since STL is used
    args = ["-w", "-O3", "-ffast-math", "-march=native", "-fopenmp"]#assumes gcc is the compiler
    link_args = ["-fopenmp"]#none here, could use -fopenmp for parallel code
    annotate = True#autogenerates .html files per .pyx
    directives = {#saves typing @cython decorators and applies them globally
        "boundscheck": False,
        "wraparound": False,
        "initializedcheck": False,
        "cdivision": True,
        "nonecheck": False,
    }

    ext_names = [
        "test",
    ]

    extensions = [create_extension(ext_name) for ext_name in ext_names]
    setup(ext_modules = cythonize(
            extensions, 
            annotate=annotate, 
            compiler_directives=directives,
        )
    )