How to split but ignore separators in quoted strin

2020-01-23 03:43发布

I need to split a string like this, on semicolons. But I don't want to split on semicolons that are inside of a string (' or "). I'm not parsing a file; just a simple string with no line breaks.

part 1;"this is ; part 2;";'this is ; part 3';part 4;this "is ; part" 5

Result should be:

  • part 1
  • "this is ; part 2;"
  • 'this is ; part 3'
  • part 4
  • this "is ; part" 5

I suppose this can be done with a regex but if not; I'm open to another approach.

标签: python regex
16条回答
看我几分像从前
2楼-- · 2020-01-23 04:31

Instead of splitting on a separator pattern, just capture whatever you need:

>>> import re
>>> data = '''part 1;"this is ; part 2;";'this is ; part 3';part 4;this "is ; part" 5'''
>>> re.findall(r';([\'"][^\'"]+[\'"]|[^;]+)', ';' + data)
['part 1', '"this is ; part 2;"', "'this is ; part 3'", 'part 4', 'this "is ', ' part" 5']
查看更多
唯我独甜
3楼-- · 2020-01-23 04:34

Although the topic is old and previous answers are working well, I propose my own implementation of the split function in python.

This works fine if you don't need to process large number of strings and is easily customizable.

Here's my function:

# l is string to parse; 
# splitchar is the separator
# ignore char is the char between which you don't want to split

def splitstring(l, splitchar, ignorechar): 
    result = []
    string = ""
    ignore = False
    for c in l:
        if c == ignorechar:
            ignore = True if ignore == False else False
        elif c == splitchar and not ignore:
            result.append(string)
            string = ""
        else:
            string += c
    return result

So you can run:

line= """part 1;"this is ; part 2;";'this is ; part 3';part 4;this "is ; part" 5"""
splitted_data = splitstring(line, ';', '"')

result:

['part 1', '"this is ; part 2;"', "'this is ; part 3'", 'part 4', 'this "is ; part" 5']

The advantage is that this function works with empty fields and with any number of separators in the string.

Hope this helps!

查看更多
唯我独甜
4楼-- · 2020-01-23 04:35

My approach is to replace all non-quoted occurrences of the semi-colon with another character which will never appear in the text, then split on that character. The following code uses the re.sub function with a function argument to search and replace all occurrences of a srch string, not enclosed in single or double quotes or parens, brackets or braces, with a repl string:

def srchrepl(srch, repl, string):
    """
    Replace non-bracketed/quoted occurrences of srch with repl in string.
    """
    resrchrepl = re.compile(r"""(?P<lbrkt>[([{])|(?P<quote>['"])|(?P<sep>["""
                          + srch + """])|(?P<rbrkt>[)\]}])""")
    return resrchrepl.sub(_subfact(repl), string)


def _subfact(repl):
    """
    Replacement function factory for regex sub method in srchrepl.
    """
    level = 0
    qtflags = 0
    def subf(mo):
        nonlocal level, qtflags
        sepfound = mo.group('sep')
        if  sepfound:
            if level == 0 and qtflags == 0:
                return repl
            else:
                return mo.group(0)
        elif mo.group('lbrkt'):
            if qtflags == 0:
                level += 1
            return mo.group(0)
        elif mo.group('quote') == "'":
            qtflags ^= 1            # toggle bit 1
            return "'"
        elif mo.group('quote') == '"':
            qtflags ^= 2            # toggle bit 2
            return '"'
        elif mo.group('rbrkt'):
            if qtflags == 0:
                level -= 1
            return mo.group(0)
    return subf

If you don't care about the bracketed characters, you can simplify this code a lot.
Say you wanted to use a pipe or vertical bar as the substitute character, you would do:

mylist = srchrepl(';', '|', mytext).split('|')

BTW, this uses nonlocal from Python 3.1, change it to global if you need to.

查看更多
Melony?
5楼-- · 2020-01-23 04:37

You appears to have a semi-colon seperated string. Why not use the csv module to do all the hard work?

Off the top of my head, this should work

import csv 
from StringIO import StringIO 

line = '''part 1;"this is ; part 2;";'this is ; part 3';part 4;this "is ; part" 5'''

data = StringIO(line) 
reader = csv.reader(data, delimiter=';') 
for row in reader: 
    print row 

This should give you something like
("part 1", "this is ; part 2;", 'this is ; part 3', "part 4", "this \"is ; part\" 5")

Edit:
Unfortunately, this doesn't quite work, (even if you do use StringIO, as I intended), due to the mixed string quotes (both single and double). What you actually get is

['part 1', 'this is ; part 2;', "'this is ", " part 3'", 'part 4', 'this "is ', ' part" 5'].

If you can change the data to only contain single or double quotes at the appropriate places, it should work fine, but that sort of negates the question a bit.

查看更多
登录 后发表回答