How to split line at non-printing ascii character

2020-06-24 01:49发布

问题:

How can I split a line in Python at a non-printing ascii character (such as the long minus sign hex 0x97 , Octal 227)? I won't need the character itself. The information after it will be saved as a variable.

回答1:

You can use re.split.

>>> import re
>>> re.split('\W+', 'Words, words, words.')
['Words', 'words', 'words', '']

Adjust the pattern to only include the characters you want to keep.

See also: stripping-non-printable-characters-from-a-string-in-python


Example (w/ the long minus):

>>> # \xe2\x80\x93 represents a long dash (or long minus)
>>> s = 'hello – world'
>>> s
'hello \xe2\x80\x93 world'
>>> import re
>>> re.split("\xe2\x80\x93", s)
['hello ', ' world']

Or, the same with unicode:

>>> # \u2013 represents a long dash, long minus or so called en-dash
>>> s = u'hello – world'
>>> s
u'hello \u2013 world'
>>> import re
>>> re.split(u"\u2013", s)
[u'hello ', u' world']


回答2:

_, _, your_result= your_input_string.partition('\x97')

or

your_result= your_input_string.partition('\x97')[2]

If your_input_string does not contain a '\x97', then your_result will be empty. If your_input_string contains multiple '\x97' characters, your_result will contain everything after the first '\x97' character, including other '\x97' characters.



回答3:

Just use the string/unicode split method (They don't really care about the string you split upon (other than it is a constant. If you want to use a Regex then use re.split)

To get the split string either escape it like the other people have shown "\x97"

or

use chr(0x97) for strings (0-255) or unichr(0x97) for unicode

so an example would be

'will not be split'.split(chr(0x97))

'will be split here:\x97 and this is the second string'.split(chr(0x97))