Fast transliteration for Arabic Text with Python

2019-03-21 16:53发布

I always work on Arabic text files and to avoid problems with encoding I transliterate Arabic characters into English according to Buckwalter's scheme (http://www.qamus.org/transliteration.htm)

Here is my code to do so but it's very SLOW even with small files like 400 kb. Ideas to make it faster?

Thanks

     def transliterate(file):
          data = open(file).read()
          buckArab = {"'":"ء", "|":"آ", "?":"أ", "&":"ؤ", "<":"إ", "}":"ئ", "A":"ا", "b":"ب", "p":"ة", "t":"ت", "v":"ث", "g":"ج", "H":"ح", "x":"خ", "d":"د", "*":"ذ", "r":"ر", "z":"ز", "s":"س", "$":"ش", "S":"ص", "D":"ض", "T":"ط", "Z":"ظ", "E":"ع", "G":"غ", "_":"ـ", "f":"ف", "q":"ق", "k":"ك", "l":"ل", "m":"م", "n":"ن", "h":"ه", "w":"و", "Y":"ى", "y":"ي", "F":"ً", "N":"ٌ", "K":"ٍ", "~":"ّ", "o":"ْ", "u":"ُ", "a":"َ", "i":"ِ"}    
          for char in data: 
               for k, v in arabBuck.iteritems():
                     data = data.replace(k,v)                 
      return data

标签: python arabic
5条回答
Bombasti
2楼-- · 2019-03-21 17:40

You're redoing the same work for every character. When you do data = data.replace(k, v), that replaces all occurrences of the given character in the entire file. But you do this over and over in a loop, when you only need to do it once for each transliteration pair. Just remove your outermost loop and it should speed your code up immensely.

If you need to optimize it more you could look at the string translate method. I'm not sure how that is performance-wise.

查看更多
Deceive 欺骗
3楼-- · 2019-03-21 17:40

Whenever I use str.translate on unicode objects it returns the same exact object. Perhaps this is due to the change in behavior alluded to by Martijn Peters.

If anyone else out there is struggling to transliterate unicode such as arabic to ascii, I've found that mapping ordinals to unicode literals works well.

>>> buckArab = {"'":"ء", "|":"آ", "?":"أ", "&":"ؤ", "<":"إ", "}":"ئ", "A":"ا", "b":"ب", "p":"ة", "t":"ت", "v":"ث", "g":"ج", "H":"ح", "x":"خ", "d":"د", "*":"ذ", "r":"ر", "z":"ز", "s":"س", "$":"ش", "S":"ص", "D":"ض", "T":"ط", "Z":"ظ", "E":"ع", "G":"غ", "_":"ـ", "f":"ف", "q":"ق", "k":"ك", "l":"ل", "m":"م", "n":"ن", "h":"ه", "w":"و", "Y":"ى", "y":"ي", "F":"ً", "N":"ٌ", "K":"ٍ", "~":"ّ", "o":"ْ", "u":"ُ", "a":"َ", "i":"ِ"}
>>> ordbuckArab = {ord(v.decode('utf8')): unicode(k) for (k, v) in buckArab.iteritems()}
>>> ordbuckArab
{1569: u"'", 1570: u'|', 1571: u'?', 1572: u'&', 1573: u'<', 1574: u'}', 1575: u'A', 1576: u'b', 1577: u'p', 1578: u't', 1579: u'v', 1580: u'g', 1581: u'H', 1582: u'x', 1583: u'd', 1584: u'*', 1585: u'r', 1586: u'z', 1587: u's', 1588: u'$', 1589: u'S', 1590: u'D', 1591: u'T', 1592: u'Z', 1593: u'E', 1594: u'G', 1600: u'_', 1601: u'f', 1602: u'q', 1603: u'k', 1604: u'l', 1605: u'm', 1606: u'n', 1607: u'h', 1608: u'w', 1609: u'Y', 1610: u'y', 1611: u'F', 1612: u'N', 1613: u'K', 1614: u'a', 1615: u'u', 1616: u'i', 1617: u'~', 1618: u'o'}
>>> u'طعصط'.translate(ordbuckArab)
u'TEST'
查看更多
Deceive 欺骗
4楼-- · 2019-03-21 17:41

Extending @larapsodia's answer, here is the complete code with dictionary:

# -*- coding: utf-8 -*-

# Arabic Transliteration based on Buckwalter
# dictionary source is buckwalter2unicode.py http://www.redhat.com/archives/fedora-extras-commits/2007-June/msg03617.html 

buck2uni = {"'": u"\u0621", # hamza-on-the-line
            "|": u"\u0622", # madda
            ">": u"\u0623", # hamza-on-'alif
            "&": u"\u0624", # hamza-on-waaw
            "<": u"\u0625", # hamza-under-'alif
            "}": u"\u0626", # hamza-on-yaa'
            "A": u"\u0627", # bare 'alif
            "b": u"\u0628", # baa'
            "p": u"\u0629", # taa' marbuuTa
            "t": u"\u062A", # taa'
            "v": u"\u062B", # thaa'
            "j": u"\u062C", # jiim
            "H": u"\u062D", # Haa'
            "x": u"\u062E", # khaa'
            "d": u"\u062F", # daal
            "*": u"\u0630", # dhaal
            "r": u"\u0631", # raa'
            "z": u"\u0632", # zaay
            "s": u"\u0633", # siin
            "$": u"\u0634", # shiin
            "S": u"\u0635", # Saad
            "D": u"\u0636", # Daad
            "T": u"\u0637", # Taa'
            "Z": u"\u0638", # Zaa' (DHaa')
            "E": u"\u0639", # cayn
            "g": u"\u063A", # ghayn
            "_": u"\u0640", # taTwiil
            "f": u"\u0641", # faa'
            "q": u"\u0642", # qaaf
            "k": u"\u0643", # kaaf
            "l": u"\u0644", # laam
            "m": u"\u0645", # miim
            "n": u"\u0646", # nuun
            "h": u"\u0647", # haa'
            "w": u"\u0648", # waaw
            "Y": u"\u0649", # 'alif maqSuura
            "y": u"\u064A", # yaa'
            "F": u"\u064B", # fatHatayn
            "N": u"\u064C", # Dammatayn
            "K": u"\u064D", # kasratayn
            "a": u"\u064E", # fatHa
            "u": u"\u064F", # Damma
            "i": u"\u0650", # kasra
            "~": u"\u0651", # shaddah
            "o": u"\u0652", # sukuun
            "`": u"\u0670", # dagger 'alif
            "{": u"\u0671", # waSla
}

def transString(string, reverse=0):
    '''Given a Unicode string, transliterate into Buckwalter. To go from
    Buckwalter back to Unicode, set reverse=1'''

    for k, v in buck2uni.items():
      if not reverse:
            string = string.replace(v, k)
      else:
            string = string.replace(k, v)

    return string


>>> print(transString(u'مرحبا'))
mrHbA
>>> print(transString('mrHbA', 1))
مرحبا
>>>
查看更多
劳资没心,怎么记你
5楼-- · 2019-03-21 17:54

Incidentally, someone already wrote a script that does this, so you might want to check that out before spending too much time on your own: buckwalter2unicode.py

It probably does more than what you need, but you don't have to use all of it: I copied just the two dictionaries and the transliterateString function (with a few tweaks, I think), and use that on my site.

Edit: The script above is what I have been using, but I'm just discovered that it is much slower than using replace, especially for a large corpus. This is the code I finally ended up with, that seems to be simpler and faster (this references a dictionary buck2uni):

def transString(string, reverse=0):
    '''Given a Unicode string, transliterate into Buckwalter. To go from
    Buckwalter back to Unicode, set reverse=1'''

    for k, v in buck2uni.items():
        if not reverse:
            string = string.replace(v, k)
        else:
            string = string.replace(k, v)

    return string
查看更多
贼婆χ
6楼-- · 2019-03-21 17:56

Whenever you have to do transliteration str.translate is the method to use:

>>> import timeit
>>> buckArab = {"'":"ء", "|":"آ", "?":"أ", "&":"ؤ", "<":"إ", "}":"ئ", "A":"ا", "b":"ب", "p":"ة", "t":"ت", "v":"ث", "g":"ج", "H":"ح", "x":"خ", "d":"د", "*":"ذ", "r":"ر", "z":"ز", "s":"س", "$":"ش", "S":"ص", "D":"ض", "T":"ط", "Z":"ظ", "E":"ع", "G":"غ", "_":"ـ", "f":"ف", "q":"ق", "k":"ك", "l":"ل", "m":"م", "n":"ن", "h":"ه", "w":"و", "Y":"ى", "y":"ي", "F":"ً", "N":"ٌ", "K":"ٍ", "~":"ّ", "o":"ْ", "u":"ُ", "a":"َ", "i":"ِ"}
>>> def repl(data, table):
...     for k,v in table.iteritems():
...         data = data.replace(k, v)
... 
>>> def trans(data, table):
...     return data.translate(table)
... 
>>> T = u'This is a test to see how fast is translitteration'
>>> timeit.timeit('trans(T, buckArab)', 'from __main__ import trans, T, buckArab', number=10**6)
6.766200065612793
>>> T = 'This is a test to see how fast is translitteration' #in python2 requires ASCII string
>>> timeit.timeit('repl(T, buckArab)', 'from __main__ import repl, T, buckArab', number=10**6)
12.668706893920898

As you can see even for small strings str.translate is 2 times faster.

查看更多
登录 后发表回答