How to split but ignore separators in quoted strin

I need to split a string like this, on semicolons. But I don't want to split on semicolons that are inside of a string (' or "). I'm not parsing a file; just a simple string with no line breaks.

part 1;"this is ; part 2;";'this is ; part 3';part 4;this "is ; part" 5

Result should be:

part 1
"this is ; part 2;"
'this is ; part 3'
part 4
this "is ; part" 5

I suppose this can be done with a regex but if not; I'm open to another approach.

标签： python regex

16条回答

成全新的幸福

2楼-- · 2020-01-23 04:10

Here is an annotated pyparsing approach:

from pyparsing import (printables, originalTextFor, OneOrMore, 
    quotedString, Word, delimitedList)

# unquoted words can contain anything but a semicolon
printables_less_semicolon = printables.replace(';','')

# capture content between ';'s, and preserve original text
content = originalTextFor(
    OneOrMore(quotedString | Word(printables_less_semicolon)))

# process the string
print delimitedList(content, ';').parseString(test)

giving

['part 1', '"this is ; part 2;"', "'this is ; part 3'", 'part 4', 
 'this "is ; part" 5']

By using pyparsing's provided quotedString, you also get support for escaped quotes.

You also were unclear how to handle leading whitespace before or after a semicolon delimiter, and none of your fields in your sample text has any. Pyparsing would parse "a; b ; c" as:

['a', 'b', 'c']

0人赞添加讨论(0) 举报

Luminary・发光体

3楼-- · 2020-01-23 04:11

re.split(''';(?=(?:[^'"]|'[^']*'|"[^"]*")*$)''', data)

Each time it finds a semicolon, the lookahead scans the entire remaining string, making sure there's an even number of single-quotes and an even number of double-quotes. (Single-quotes inside double-quoted fields, or vice-versa, are ignored.) If the lookahead succeeds, the semicolon is a delimiter.

Unlike Duncan's solution, which matches the fields rather than the delimiters, this one has no problem with empty fields. (Not even the last one: unlike many other split implementations, Python's does not automatically discard trailing empty fields.)

0人赞添加讨论(0) 举报

老娘就宠你

4楼-- · 2020-01-23 04:11

>>> x = '''part 1;"this is ; part 2;";'this is ; part 3';part 4;this "is ; part" 5'''
>>> import re
>>> re.findall(r'''(?:[^;'"]+|'(?:[^']|\\.)*'|"(?:[^']|\\.)*")+''', x)
['part 1', "this is ';' part 2", "'this is ; part 3'", 'part 4', 'this "is ; part" 5']

0人赞添加讨论(0) 举报

Deceive 欺骗

5楼-- · 2020-01-23 04:12

we can create a function of its own

def split_with_commas_outside_of_quotes(string):
    arr = []
    start, flag = 0, False
    for pos, x in enumerate(string):
        if x == '"':
            flag= not(flag)
        if flag == False and x == ',':
            arr.append(string[start:pos])
            start = pos+1
    arr.append(string[start:pos])
    return arr

0人赞添加讨论(0) 举报

兄弟一词,经得起流年.

6楼-- · 2020-01-23 04:14

This seemed to me an semi-elegant solution.

New Solution:

import re
reg = re.compile('(\'|").*?\\1')
pp = re.compile('.*?;')
def splitter(string):
    #add a last semicolon
    string += ';'
    replaces = []
    s = string
    i = 1
    #replace the content of each quote for a code
    for quote in reg.finditer(string):
        out = string[quote.start():quote.end()]
        s = s.replace(out, '**' + str(i) + '**')
        replaces.append(out)
        i+=1
    #split the string without quotes
    res = pp.findall(s)

    #add the quotes again
    #TODO this part could be faster.
    #(lineal instead of quadratic)
    i = 1
    for replace in replaces:
        for x in range(len(res)):
            res[x] = res[x].replace('**' + str(i) + '**', replace)
        i+=1
    return res

Old solution:

I choose to match if there was an opening quote and wait it to close, and the match an ending semicolon. each "part" you want to match needs to end in semicolon. so this match things like this :

'foobar;.sska';
"akjshd;asjkdhkj..,";
asdkjhakjhajsd.jhdf;

Code:

mm = re.compile('''((?P<quote>'|")?.*?(?(quote)\\2|);)''')
res = mm.findall('''part 1;"this is ; part 2;";'this is ; part 3';part 4''')

you may have to do some postprocessing to res, but it contains what you want.

0人赞添加讨论(0) 举报

成全新的幸福

7楼-- · 2020-01-23 04:19

Even though I'm certain there is a clean regex solution (so far I like @noiflection's answer), here is a quick-and-dirty non-regex answer.

s = """part 1;"this is ; part 2;";'this is ; part 3';part 4;this "is ; part" 5"""

inQuotes = False
current = ""
results = []
currentQuote = ""
for c in s:
    if not inQuotes and c == ";":
        results.append(current)
        current = ""
    elif not inQuotes and (c == '"' or c == "'"):
        currentQuote = c
        inQuotes = True
    elif inQuotes and c == currentQuote:
        currentQuote = ""
        inQuotes = False
    else:
        current += c

results.append(current)

print results
# ['part 1', 'this is ; part 2;', 'this is ; part 3', 'part 4', 'this is ; part 5']

(I've never put together something of this sort, feel free to critique my form!)

0人赞添加讨论(0) 举报

1 2 3 下一页