I need to split a string like this, on semicolons. But I don't want to split on semicolons that are inside of a string (' or "). I'm not parsing a file; just a simple string with no line breaks.

part 1;"this is ; part 2;";'this is ; part 3';part 4;this "is ; part" 5

Result should be:

part 1
"this is ; part 2;"
'this is ; part 3'
part 4
this "is ; part" 5

I suppose this can be done with a regex but if not; I'm open to another approach.

标签： python regex

16条回答

Anthone

2楼-- · 2020-01-23 04:20

>>> a='A,"B,C",D'
>>> a.split(',')
['A', '"B', 'C"', 'D']

It failed. Now try csv module
>>> import csv
>>> from StringIO import StringIO
>>> data = StringIO(a)
>>> data
<StringIO.StringIO instance at 0x107eaa368>
>>> reader = csv.reader(data, delimiter=',') 
>>> for row in reader: print row
... 
['A,"B,C",D']

0人赞添加讨论(0) 举报

冷血范

3楼-- · 2020-01-23 04:23

This regex will do that: (?:^|;)("(?:[^"]+|"")*"|[^;]*)

0人赞添加讨论(0) 举报

疯言疯语

4楼-- · 2020-01-23 04:27

since you do not have '\n', use it to replace any ';' that is not in a quote string

>>> new_s = ''
>>> is_open = False

>>> for c in s:
...     if c == ';' and not is_open:
...         c = '\n'
...     elif c in ('"',"'"):
...         is_open = not is_open
...     new_s += c

>>> result = new_s.split('\n')

>>> result
['part 1', '"this is ; part 2;"', "'this is ; part 3'", 'part 4', 'this "is ; part" 5']

0人赞添加讨论(0) 举报

祖国的老花朵

5楼-- · 2020-01-23 04:28

Most of the answers seem massively over complicated. You don't need back references. You don't need to depend on whether or not re.findall gives overlapping matches. Given that the input cannot be parsed with the csv module so a regular expression is pretty well the only way to go, all you need is to call re.split with a pattern that matches a field.

Note that it is much easier here to match a field than it is to match a separator:

import re
data = """part 1;"this is ; part 2;";'this is ; part 3';part 4;this "is ; part" 5"""
PATTERN = re.compile(r'''((?:[^;"']|"[^"]*"|'[^']*')+)''')
print PATTERN.split(data)[1::2]

and the output is:

['part 1', '"this is ; part 2;"', "'this is ; part 3'", 'part 4', 'this "is ; part" 5']

As Jean-Luc Nacif Coelho correctly points out this won't handle empty groups correctly. Depending on the situation that may or may not matter. If it does matter it may be possible to handle it by, for example, replacing ';;' with ';<marker>;' where <marker> would have to be some string (without semicolons) that you know does not appear in the data before the split. Also you need to restore the data after:

>>> marker = ";!$%^&;"
>>> [r.replace(marker[1:-1],'') for r in PATTERN.split("aaa;;aaa;'b;;b'".replace(';;', marker))[1::2]]
['aaa', '', 'aaa', "'b;;b'"]

However this is a kludge. Any better suggestions?

0人赞添加讨论(0) 举报

手持菜刀，她持情操

6楼-- · 2020-01-23 04:30

While it could be done with PCRE via lookaheads/behinds/backreferences, it's not really actually a task that regex is designed for due to the need to match balanced pairs of quotes.

Instead it's probably best to just make a mini state machine and parse through the string like that.

Edit

As it turns out, due to the handy additional feature of Python re.findall which guarantees non-overlapping matches, this can be more straightforward to do with a regex in Python than it might otherwise be. See comments for details.

However, if you're curious about what a non-regex implementation might look like:

x = """part 1;"this is ; part 2;";'this is ; part 3';part 4;this "is ; part" 5"""

results = [[]]
quote = None
for c in x:
  if c == "'" or c == '"':
    if c == quote:
      quote = None
    elif quote == None:
      quote = c
  elif c == ';':
    if quote == None:
      results.append([])
      continue
  results[-1].append(c)

results = [''.join(x) for x in results]

# results = ['part 1', '"this is ; part 2;"', "'this is ; part 3'",
#            'part 4', 'this "is ; part" 5']

0人赞添加讨论(0) 举报

狗以群分

7楼-- · 2020-01-23 04:31

A generalized solution:

import re
regex = '''(?:(?:[^{0}"']|"[^"]*(?:"|$)|'[^']*(?:'|$))+|(?={0}{0})|(?={0}$)|(?=^{0}))'''

delimiter = ';'
data2 = ''';field 1;"field 2";;'field;4';;;field';'7;'''
field = re.compile(regex.format(delimiter))
print(field.findall(data2))

Outputs:

['', 'field 1', '"field 2"', '', "'field;4'", '', '', "field';'7", '']

This solution:

captures all the empty groups (including at the beginning and at the end)
works for most popular delimiters including space, tab, and comma
treats quotes inside quotes of the other type as non-special characters
if an unmatched unquoted quote is encountered, treats the remainders of the line as quoted

0人赞添加讨论(0) 举报

How to split but ignore separators in quoted strin

Edit

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间