TypeError: expected bytes, str found in custom pyt

2019-07-29 03:15发布

问题:

I am using a new bioinformatics tool called Giggle and I have installed the python wrapper on my system. Even though the scenario is quite specific, I think the problem is quite general. This function:

index = Giggle.create("index", "HMEC_hg19_BroadHMM_ALL.bed")

should create an index based on several (or in this case one) .bed file. The bed files look like this:

chr1    10000   10600   15_Repetitive/CNV   0   .   10000   10600   245,245,245
chr1    10600   11137   13_Heterochrom/lo   0   .   10600   11137   245,245,245
chr1    11137   11737   8_Insulator 0   .   11137   11737   10,190,254
chr1    11737   11937   11_Weak_Txn 0   .   11737   11937   153,255,102
chr1    11937   12137   7_Weak_Enhancer 0   .   11937   12137   255,252,4
chr1    12137   14537   11_Weak_Txn 0   .   12137   14537   153,255,102
chr1    14537   20337   10_Txn_Elongation   0   .   14537   20337   0,176,80

It is basically a large tab delimited file containing genomic intervals and their corresponding chromosome. When running the above command I get the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "giggle/giggle.pyx", line 25, in giggle.giggle.Giggle.create
TypeError: expected bytes, str found

I have no clue why this is happening and I have tried converting the files to other types of encoding but nothing worked. The code snippet to which the error refers is as follows:

def create(self, char *path, char *glob):
    giggle_bulk_insert(to_bytes(glob), to_bytes(path), 1)
    return Giggle(path)

I am using Python 3.6 on a Linux subsystem for windows 10.

回答1:

The problem is that in python 3 strings are represented as unicode strings, not byte strings as it was the case in python 2. When you install giggle and run your code using python 2 everything works fine. But you can do:

index = Giggle.create("index".encode('utf-8'), "HMEC_hg19_BroadHMM_ALL.bed".encode('utf-8'))

or alternatively

index = Giggle.create(b"index", b"HMEC_hg19_BroadHMM_ALL.bed")

to have explicit byte strings. It worked for me, up to the point that giggle complains about the .bed file being incorrectly formatted (I probably messed up the format when copying)

Update: There is another issue that comes up when calling it like described above:

File type not supported 'HMEC_hg19_BroadHMM_ALL.bed'

Which is caused by the underlying lib giggle only accepting .bed.gz files, which can be seen in python-giggle/lib/giggle/src/file_read.c:

if ( (strlen(i->file_name) > 7) &&
    strcmp(".bed.gz", file_name + strlen(i->file_name) - 7) == 0) {
    i->type = BED;
}

So I am assuming that the Readme at the python-giggle site is not correct in claiming that you can call it with .bed files.

I tested it with one of the files provided in python-giggle\lib\giggle\test\data and it ran without an error



回答2:

The create() method expects byte strings:

create(self, char *path, char *glob):

Cython can only accept bytes objects in Python 3, str in Python 2, to convert to a char array automatically.

Either pass in bytes objects when you call the method (encoding your str objects first), or alter that method signature to accept str unicode strings. See Accepting strings from Python code in the Cython tutorial.



回答3:

Encoding your string in utf-8 will solve your problem:

yourstr.encode('utf-8')