Converting character offsets into byte offsets (in

Suppose I have a bunch of files in UTF-8 that I send to an external API in unicode. The API operates on each unicode string and returns a list with (character_offset, substr) tuples.

The output I need is the begin and end byte offset for each found substring. If I'm lucky the input text contains only ASCII characters (making character offset and byte offset identical), but this is not always the case. How can I find the begin and end byte offsets for a known begin character offset and substring?

I've answered this question myself, but look forward to other solutions to this problem that are more robust, more efficient, and/or more readable.

标签： python offset unicode-string bytestring

2条回答

乱世女痞

2楼-- · 2019-05-31 08:54

To convert character offsets to byte offsets when needed, I encode('utf8') the text leading up to the found substring if there are any non-ASCII characters in the input text, and take its length as begin offset.

# Check if text contains non-ASCII characters
needs_offset_conversion = len(text) != len(text.encode('utf8'))

def get_byte_offsets(text, character_offset, substr, needs_conversion):
    if needs_conversion:
        begin_offset = len(text[:character_offset].encode('utf8'))
        end_offset = begin_offset + len(substr.encode('utf8'))
    else:
        begin_offset = character_offset
        end_offset = character_offset + len(substr)
    return begin_offset, end_offset

This implementation works, but it encodes a (large) part of the text for each found substring.

0人赞添加讨论(0) 举报

狗以群分

3楼-- · 2019-05-31 09:12

I'd solve this using a dictionary mapping character offsets to byte offsets and then looking up the offsets in that.

def get_char_to_byte_map(unicode_string):
    """
    Generates a dictionary mapping character offsets to byte offsets for unicode_string.
    """
    response = {}
    byte_offset = 0
    for char_offset, character in enumerate(unicode_string):
        response[char_offset] = byte_offset
        byte_offset += len(character.encode('utf-8'))
    return response

char_to_byte_map = get_char_to_byte_map(text)

for begin_offset, substring in api_response:
    begin_offset = char_to_byte_map[character_offset]
    end_offset = char_to_byte_map[character_offset + len(substring)]
    # do something

Performance of this solution as compared to yours depends a lot on the size of the input and the amount of substrings involved. Local micro-benchmarking suggests that encoding each individual character in a text takes about 1000 times as long as encoding the entire text at once.

0人赞添加讨论(0) 举报

Converting character offsets into byte offsets (in

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间