how to convert python/cython unicode string to arr

2019-04-01 01:02发布

问题:

Possible Duplicate:
How to correct bugs in this Damerau-Levenshtein implementation?

I have the following Cython code (adapted from the bpbio project) that does Damerau-Levenenshtein edit-distance calculation:

#---------------------------------------------------------------------------
cdef extern from "stdlib.h":
  ctypedef unsigned int size_t
  size_t strlen(char *s)
  void *malloc(size_t size)
  void *calloc(size_t n, size_t size)
  void free(void *ptr)
  int strcmp(char *a, char *b)
  char * strcpy(char *a, char *b)

#---------------------------------------------------------------------------
cdef extern from "Python.h":
  object PyTuple_GET_ITEM(object, int)
  void Py_INCREF(object)

#---------------------------------------------------------------------------
cdef inline size_t imin(int a, int b, int c):
  if a < b:
    if c < a:
      return c
    return a
  if c < b:
    return c
  return b

#---------------------------------------------------------------------------
cpdef int editdistance( char *a, char *b ):
  """Given two byte strings ``a`` and ``b``, return their absolute Damerau-
  Levenshtein distance. Each deletion, insertion, substitution, and
  transposition is counted as one difference, so the edit distance between
  ``abc`` and ``ab``, ``abcx``, ``abx``, ``acb``, respectively, is ``1``."""

  #.........................................................................
  if strcmp( a, b ) == 0: return 0
  #.........................................................................
  cdef int    alen    = strlen( a )
  cdef int    blen    = strlen( b )
  cdef int    R
  cdef char   *ctmp
  cdef size_t i
  cdef size_t j
  cdef size_t achr
  cdef size_t bchr
  #.........................................................................
  if alen > blen:
    ctmp = a;
    a = b;
    b = ctmp;
    alen, blen = blen, alen
  #.........................................................................
  cdef char   *m1     = <char *>calloc(   blen + 2,    sizeof( char ) )
  cdef char   *m2     = <char *>calloc(   blen + 2,    sizeof( char ) )
  cdef char   *m3     = <char *>malloc( ( blen + 2 ) * sizeof( char ) )
  #.........................................................................
  for i from 0 <= i <= blen:
    m2[ i ] = i
  #.........................................................................
  for i from 1 <= i <= alen:
    m1[ 0 ] =    i + 1
    achr    = a[ i - 1 ]
    for j from 1 <= j <= blen:
      bchr = b[ j- 1 ]
      if achr == bchr:
        m1[ j ] = m2[ j - 1 ]
      else:
        m1[ j ] = 1 + imin( m1[ j - 1 ], m2[ j - 1 ], m2[ j ] )
      if i != 1 and j != 1 and achr == b[ j - 2 ] and bchr == a[ i - 2 ]:
        m1[ j ] = m3[ j - 1 ]
    #.......................................................................
    m1, m2 = m2, m1
    strcpy( m3, m2 )
  #.........................................................................
  R = <int>m2[ blen ]
  #.........................................................................
  # cleanup:
  free( m3 )
  free( m1 )
  free( m2 )
  #.........................................................................
  return R

The code runs fine and fast (300,000...400,000 comparisons per second on my PC).

the challenge is to make this code work with unicode strings as well. i am running Python 3.1 and retrieve texts from a database that are then matched to a query text.

encoding these strings to bytes before passing them to the Cython function for comparison is not be a good idea, since performance would suffer considerably (tested) and results would likely be wrong for any text containing characters outside of 7bit US ASCII.

the (very terse) Cython manual does mention unicode strings, but is hardly helpful for the problem at hand.

as i see it, a unicode string can be conceived of as an array of integer number, each representing a single codepoint, and the code above is basically operating on arrays of chars already, so my guess is that i should (1) extend it to handle C arrays of integers; (2) add code to convert a python unicode string to a C array; (3) profit!.

( Note: there are two potential issues with this approach: one is handling unicode surrogate characters, but i guess i know what to do with those. the other problem is that unicode codepoints do not really map 1:1 to the concept of 'characters'. i am well aware of that but i consider it outside of the scope of this question. please assume that one unicode codepoint is one unit of comparison.)

so i am asking for suggestions how to

  • write a fast Cython function that accepts a python unicode string and returns a C array of Cython unsigned ints (4 bytes);

  • modify the code shown to handle those arrays and do the correct memory allocations / deallocations (this is pretty foreign stuff to me).

Edit: John Machin has pointed out that the curious typecasts char *m1 etc are probably done for speed and/or memory optimization; these variables are still treated as arrays of numbers. i realize that the code does nothing to prevent a possible overflow with long strings; erroneous results may occur when one array element exceeds 127 or 255 (depending on the C compiler used). sort of surprising for code coming from a bioinformatics project.

that said, i am only interested in precise results for largely identical strings of less than say a hundred characters or so. results below 60% sameness could for my purposes be safely reported as 'completely different' (by returning the length of the longer text), so i guess it will be best to leave the char *m1 casts in place, but add some code to check against overflow and early abortion in case of rampant dissimilarity.

回答1:

Use ord() to convert characters to their integer code point. It works characters from either unicode or str string types:

codepoints = [ord(c) for c in text]


回答2:

Caveat lector: I've never done this. The following is a rough sketch of what I'd try.

You will need to use the PyUnicode_AsUnicode function and the next one, PyUnicode_GetSize. In declarations, where you currently have char, use Py_UNICODE instead. Presumably with a narrow (UCS2) build you will copy the internal structure, converting surrogate pairs as you go. With a wide (UCS4) build you might operate directly on the internal structure.



回答3:

i close this question because i have found a better algorithm... with its own problems. see you over there.