Split string at commas except when in bracket envi

I would like to split a Python multiline string at its commas, except when the commas are inside a bracketed expression. E.g., the string

{J. Doe, R. Starr}, {Lorem
{i}psum dolor }, Dol. sit., am. et.

Should be split into

['{J. Doe, R. Starr}', '{Lorem\n{i}psum dolor }', 'Dol. sit.', 'am. et.']

This involves bracket matching, so probably regexes are not helping out here. PyParsing has commaSeparatedList which almost does what I need except that quoted (") environments are protected instead of {}-delimited ones.

Any hints?

标签： python regex parsing pyparsing

3条回答

聊天终结者

2楼-- · 2020-01-29 08:25

Write your own custom split-function:

 input_string = """{J. Doe, R. Starr}, {Lorem
 {i}psum dolor }, Dol. sit., am. et."""


 expected = ['{J. Doe, R. Starr}', '{Lorem\n{i}psum dolor }', 'Dol. sit.', 'am. et.']

 def split(s):
     parts = []
     bracket_level = 0
     current = []
     # trick to remove special-case of trailing chars
     for c in (s + ","):
         if c == "," and bracket_level == 0:
             parts.append("".join(current))
             current = []
         else:
             if c == "{":
                 bracket_level += 1
             elif c == "}":
                 bracket_level -= 1
             current.append(c)
     return parts

 assert split(input_string), expected

0人赞添加讨论(0) 举报

我想做一个坏孩纸

3楼-- · 2020-01-29 08:27

Lucas Trzesniewski's comment can actually be used in Python with PyPi regex module (I just replaced named group with a numbered one to make it shorter):

>>> import regex
>>> r = regex.compile(r'({(?:[^{}]++|\g<1>)*})(*SKIP)(*FAIL)|\s*,\s*')
>>> s = """{J. Doe, R. Starr}, {Lorem
{i}psum dolor }, Dol. sit., am. et."""
>>> print(r.split(s))
['{J. Doe, R. Starr}', None, '{Lorem\n{i}psum dolor }', None, 'Dol. sit.', None, 'am. et.']

The pattern - ({(?:[^{}]++|\g<1>)*})(*SKIP)(*FAIL) - matches {...{...{}...}...} like structures (as { matches {, (?:[^{}]++|\g<1>)* matches 0+ occurrences of 2 alternatives: 1) any 1+ characters other than { and } (the [^{}]++), 2) text matching the whole ({(?:[^{}]++|\g<1>)*}) subpattern). The (*SKIP)(*FAIL) verbs make the engine omit the whole matched value from the match buffer, thus, moving the index to the end of the match and holding nothing to return (we "skip" what we matched).

The \s*,\s* matches a comma enclosed with 0+ whitespaces.

The None values appear because there is a capture group in the first branch that is empty when the second branch matches. We need to use a capture group in the first alternative branch for recursion. To remove the empty elements, use comprehension:

>>> print([x for x in r.split(s) if x])
['{J. Doe, R. Starr}', '{Lorem\n{i}psum dolor }', 'Dol. sit.', 'am. et.']

0人赞添加讨论(0) 举报

三岁会撩人

4楼-- · 2020-01-29 08:33

You can use re.split in this case:

>>> from re import split
>>> data = '''\
... {J. Doe, R. Starr}, {Lorem
... {i}psum dolor }, Dol. sit., am. et.'''
>>> split(',\s*(?![^{}]*\})', data)
['{J. Doe, R. Starr}', '{Lorem\n{i}psum dolor }', 'Dol. sit.', 'am. et.']
>>>

Below is an explanation of what the Regex pattern matches:

,       # Matches ,
\s*     # Matches zero or more whitespace characters
(?!     # Starts a negative look-ahead assertion
[^{}]*  # Matches zero or more characters that are not { or }
\}      # Matches }
)       # Closes the look-ahead assertion

0人赞添加讨论(0) 举报

Split string at commas except when in bracket envi

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间