In Python, how do I most efficiently chunk a UTF-8

2019-06-14 04:37发布

问题:

  1. I'll start out by saying I sort of understand what 'UTF-8' encoding is, that it is basically but not exactly unicode, and that ASCII is a smaller character set. I also understand that if I have:

    se_body = "&gt; Genesis 2:2 וַיְכַל אֱלֹהִים בַּיֹּום הַשְּׁבִיעִי מְלַאכְתֹּו אֲשֶׁר עָשָׂה וַיִּשְׁבֹּת בַּיֹּום הַשְּׁבִיעִי מִכָּל־מְלַאכְתֹּו אֲשֶׁר עָשָֽׂה׃ The word tr <excess removed ...> JV"
    print len(se_body)              #will return the number of characters in the string, in my case '1500'
    print sys.getsizeof(se_body)    #will return the number of bytes, which will be 3050
    
  2. My code is leveraging a RESTful API that I do not control. Said RESTful API's job is to parse a passed parameter for bible references out of the text, and has an interesting quirk - it only accepts 2000 characters at a time. If more than 2000 characters are sent, my API call will return a 404. Again, to stress, I am leveraging someone else's API, so please don't tell me "fix the server side." I can't :)

  3. My solution is take the string and chunk it in bits that are less than 2000 characters, let it scan each chunk, and then I'll reassemble and tag as needed. I'd like to be kind to said service and pass as few chunks as possible, meaning that each chunk should be large.

  4. My problem comes when I pass a string with Hebrew or Greek characters in it. (Yes, biblical answers often use Greek and Hebrew!) If I set the chunk size as low as 1000 characters, I can always safely pass it, but this just seems really small. In most cases, I should be able to chunk it larger.

  5. My question is this: Without resorting to too many heroics, what is the most efficient way I can chunk a UTF-8 into a correct size?

Here's the code:

# -*- coding: utf-8 -*-
import requests
import json

biblia_apikey = '************'
refparser_url = "http://api.biblia.com/v1/bible/scan/?"
se_body = "&gt; Genesis 2:2 וַיְכַל אֱלֹהִים בַּיֹּום הַשְּׁבִיעִי מְלַאכְתֹּו אֲשֶׁר עָשָׂה וַיִּשְׁבֹּת בַּיֹּום הַשְּׁבִיעִי מִכָּל־מְלַאכְתֹּו אֲשֶׁר עָשָֽׂה׃ The word translated as &quot;rest&quot; in English, is actually the conjugated word from which we get the English word `Sabbath`, which actually means to &quot;cease doing&quot;. &gt; וַיִּשְׁבֹּת or by its root: &gt; שָׁבַת Here&#39;s BlueletterBible&#39;s concordance entry: [Strong&#39;s H7673][1] It is actually the same root word that is conjugated to mean &quot;[to go on strike][2]&quot; in modern Hebrew. In Genesis it is used to refer to the fact that the creation process ceased, not that God &quot;rested&quot; in the sense of relieving exhaustion, as we would normally understand the term in English. The word &quot;rest&quot; in that sense is &gt; נוּחַ Which can be found in Genesis 8:9, for example (and is also where we get Noah&#39;s name). More here: [Strong&#39;s H5117][3] Jesus&#39; words are in reference to the fact that God is always at work, as the psalmist says in Psalm 54:4, He is the sustainer, something that implies a constant intervention (a &quot;work&quot; that does not cease). The institution of the Sabbath was not merely just so the Israelites would &quot;rest&quot; from their work but as with everything God institutes in the Bible, it had important theological significance (especially as can be gleaned from its prominence as one of the 10 commandments). The point of the Sabbath was to teach man that he should not think he is self-reliant (cf. instances such as Judges 7) and that instead they should rely upon God, but more specifically His mercy. The driving message throughout the Old Testament as well as the New (and this would be best extrapolated in c.se) is that man cannot, by his own efforts (&quot;works&quot;) reach God&#39;s standard: &gt; Ephesians 2:8 For by grace you have been saved through faith, and that not of yourselves; it is the gift of God, 9 not of works, lest anyone should boast. The Sabbath (and the penalty associated with breaking it) was a way for the Israelites to weekly remember this. See Hebrews 4 for a more in depth explanation of this concept. So there is no contradiction, since God never stopped &quot;working&quot;, being constantly active in sustaining His creation, and as Jesus also taught, the Sabbath was instituted for man, to rest, but also, to &quot;stop doing&quot; and remember that he is not self-reliant, whether for food, or for salvation. Hope that helps. [1]: http://www.blueletterbible.org/lang/lexicon/lexicon.cfm?Strongs=H7673&amp;t=KJV [2]: http://www.morfix.co.il/%D7%A9%D7%91%D7%99%D7%AA%D7%94 [3]: http://www.blueletterbible.org/lang/lexicon/lexicon.cfm?strongs=H5117&amp;t=KJV"

se_body = se_body.decode('utf-8')

nchunk_start=0
nchunk_size=1500
found_refs = []

while nchunk_start < len(se_body):
    body_chunk = se_body[nchunk_start:nchunk_size]
    if (len(body_chunk.strip())<4):
        break;

    refparser_params = {'text': body_chunk, 'key': biblia_apikey }
    headers = {'content-type': 'text/plain; charset=utf-8', 'Accept-Encoding': 'gzip,deflate,sdch'}
    refparse = requests.get(refparser_url, params = refparser_params, headers=headers)

    if (refparse.status_code == 200):
        foundrefs = json.loads(refparse.text)
        for foundref in foundrefs['results']:
            foundref['textIndex'] += nchunk_start
            found_refs.append( foundref ) 
    else:
        print "Status Code {0}: Failed to retrieve valid parsing info at {1}".format(refparse.status_code, refparse.url)
        print "  returned text is: =>{0}<=".format(refparse.text)

    nchunk_start += (nchunk_size-50)
    #Note: I'm purposely backing up, so that I don't accidentally split a reference across chunks


for ref in found_refs:
    print ref
    print se_body[ref['textIndex']:ref['textIndex']+ref['textLength']]

I know how to slice a string (body_chunk = se_body[nchunk_start:nchunk_size]) but I'm not sure how I would go about slicing the same string according to the length of the UTF-8 bit.

When I'm done, I need to pull out the selected references (I'm actually going to add SPAN tags). This is what the output would look like for now though:

{u'textLength': 11, u'textIndex': 5, u'passage': u'Genesis 2:2'}
Genesis 2:2
{u'textLength': 11, u'textIndex': 841, u'passage': u'Genesis 8:9'}
Genesis 8:9

回答1:

There could be several sizes:

  1. Size in memory returned by sys.getsizeof() e.g.,

    >>> import sys
    >>> sys.getsizeof(b'a')
    38
    >>> sys.getsizeof(u'Α')
    56
    

    i.e., a bytestring that contains a single byte b'a' may require 38 bytes in memory.
    You shouldn't care about it unless your local machine has memory problems

  2. The number of bytes in the text encoded as utf-8:

    >>> unicode_text = u'Α' # greek letter
    >>> bytestring = unicode_text.encode('utf-8')
    >>> len(bytestring)
    2
    
  3. The number of Unicode codepoints in the text:

    >>> unicode_text = u'Α' # greek letter
    >>> len(unicode_text)
    1
    

    In general, you might also be interested in number of grapheme clusters ("visual characters") in the text:

    >>> unicode_text = u'ё' # cyrillic letter
    >>> len(unicode_text) # number of Unicode codepoints
    2
    >>> import regex # $ pip install regex
    >>> chars = regex.findall(u'\\X', unicode_text)
    >>> chars
    [u'\u0435\u0308']
    >>> len(chars) # number of "user-perceived characters"
    1
    

If the API limits are defined by p. 2 (number of bytes in utf-8 encoded bytestring) then you could use answers from the question linked by @Martijn Pieters: Truncating unicode so it fits a maximum size when encoded for wire transfer. The first answer should work:

truncated = unicode_text.encode('utf-8')[:2000].decode('utf-8', 'ignore')

There is also a possibility that the length is limited by the url length:

>>> import urllib
>>> urllib.quote(u'\u0435\u0308'.encode('utf-8'))
'%D0%B5%CC%88'

To truncate it:

import re
import urllib

urlencoded = urllib.quote(unicode_text.encode('utf-8'))[:2000]
# remove `%` or `%X` at the end
urlencoded = re.sub(r'%[0-9a-fA-F]?$', '', urlencoded) 
truncated = urllib.unquote(urlencoded).decode('utf-8', 'ignore')

The issue with the url length might be solved using 'X-HTTP-Method-Override' http header that would allow to convert GET request into POST request if the service supports it. Here's code example that uses Google Translate API.

If it is allowed in your case, you could compress the html text by decoding html references and using NFC Unicode normalization form to combine some Unicode codepoints:

import unicodedata
from HTMLParser import HTMLParser

unicode_text = unicodedata.normalize('NFC', HTMLParser().unescape(unicode_text))