How to remove last utf8 char of a python string

2019-07-20 16:21发布

I have a string containing utf-8 encoded text. I need to remove the last utf-8 character.

So far I did

msg = msg[:-1]

but this only removes the last byte. It works as long as the last character is an ASCII code. It doesn't work anymore when the last character is a multibyte character.

标签： python python-2.7 utf-8

1条回答

可以哭但决不认输i

2楼-- · 2019-07-20 17:09

The simplest way is to decode your UTF-8 bytes to Unicode text:

without_last = msg.decode('utf8')[:-1]

You can always encode it again.

The alternative would be for you to search for a UTF-8 start byte; UTF-8 byte sequences always start with a byte with the most significant bit set to 0, or the two most significant bits set to 1, while continuation bytes always start with 10:

# find starting byte of last codepoint
pos = len(msg) - 1
while pos > -1 and ord(msg[pos]) & 0xC0 == 0x80:
    # character at pos is a continuation byte (bit 7 set, bit 6 not)
    pos -= 1
msg = msg[:pos]

0人赞添加讨论(0) 举报

How to remove last utf8 char of a python string

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间