I have embedded a Python interpreter in a C program. Suppose the C program reads some bytes from a file into a char array and learns (somehow) that the bytes represent text with a certain encoding (e.g., ISO 8859-1, Windows-1252, or UTF-8). How do I decode the contents of this char array into a Python string?
The Python string should in general be of type unicode
—for instance, a 0x93
in Windows-1252 encoded input becomes a u'\u0201c'
.
I have attempted to use PyString_Decode
, but it always fails when there are non-ASCII characters in the string. Here is an example that fails:
#include <Python.h>
#include <stdio.h>
int main(int argc, char *argv[])
{
char c_string[] = { (char)0x93, 0 };
PyObject *py_string;
Py_Initialize();
py_string = PyString_Decode(c_string, 1, "windows_1252", "replace");
if (!py_string) {
PyErr_Print();
return 1;
}
return 0;
}
The error message is UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 0: ordinal not in range(128)
, which indicates that the ascii
encoding is used even though we specify windows_1252
in the call to PyString_Decode
.
The following code works around the problem by using PyString_FromString
to create a Python string of the undecoded bytes, then calling its decode
method:
#include <Python.h>
#include <stdio.h>
int main(int argc, char *argv[])
{
char c_string[] = { (char)0x93, 0 };
PyObject *raw, *decoded;
Py_Initialize();
raw = PyString_FromString(c_string);
printf("Undecoded: ");
PyObject_Print(raw, stdout, 0);
printf("\n");
decoded = PyObject_CallMethod(raw, "decode", "s", "windows_1252");
Py_DECREF(raw);
printf("Decoded: ");
PyObject_Print(decoded, stdout, 0);
printf("\n");
return 0;
}
Try calling
PyErr_Print()
in the "if (!py_string)
" clause. Perhaps the python exception will give you some more information.You don't want to decode the string into a Unicode representation, you just want to treat it as an array of bytes, right?
Just use
PyString_FromString
:That's all. Now you have a Python
str()
object. See docs here: https://docs.python.org/2/c-api/string.htmlI'm a little bit confused about how to specify "str" or "unicode." They are quite different if you have non-ASCII characters. If you want to decode a C string and you know exactly what character set it's in, then yes,
PyString_DecodeString
is a good place to start.PyString_Decode does this:
IOW, it does basically what you're doing in your second example - converts to a string, then decode the string. The problem here arises from PyString_AsDecodedString, rather than PyString_AsDecodedObject. PyString_AsDecodedString does PyString_AsDecodedObject, but then tries to convert the resulting unicode object into a string object with the default encoding (for you, looks like that's ASCII). That's where it fails.
I believe you'll need to do two calls - but you can use PyString_AsDecodedObject rather than calling the python "decode" method. Something like:
I'm not entirely sure what the reasoning behind PyString_Decode working this way is. A very old thread on python-dev seems to indicate that it has something to do with chaining the output, but since the Python methods don't do the same, I'm not sure if that's still relevant.