Regular expression troubles, escaped quotes

2019-06-21 20:55发布

Basically, I'm being passed a string and I need to tokenise it in much the same manner as command line options are tokenised by a *nix shell

Say I have the following string

"Hello\" World" "Hello Universe" Hi

How could I turn it into a 3 element list

  • Hello" World
  • Hello Universe
  • Hi

The following is my first attempt, but it's got a number of problems

  • It leaves the quote characters
  • It doesn't catch the escaped quote

Code:

public void test() {
    String str = "\"Hello\\\" World\" \"Hello Universe\" Hi";
    List<String> list = split(str);
}

public static List<String> split(String str) {
    Pattern pattern = Pattern.compile(
        "\"[^\"]*\"" + /* double quoted token*/
        "|'[^']*'" + /*single quoted token*/
        "|[A-Za-z']+" /*everything else*/
    );

    List<String> opts = new ArrayList<String>();
    Scanner scanner = new Scanner(str).useDelimiter(pattern);

    String token;
    while ((token = scanner.findInLine(pattern)) != null) {
        opts.add(token);
    }
    return opts;
}

So the incorrect output of the following code is

  • "Hello\"
  • World
  • " "
  • Hello
  • Universe
  • Hi

EDIT I'm totally open to a non regex solution. It's just the first solution that came to mind

5条回答
等我变得足够好
2楼-- · 2019-06-21 21:16

If you decide you want to forego regex, and do parsing instead, there are a couple of options. If you are willing to have just a double quote or a single quote (but not both) as your quote, then you can use StreamTokenizer to solve this easily:

public static List<String> tokenize(String s) throws IOException {
    List<String> opts = new ArrayList<String>();
    StreamTokenizer st = new StreamTokenizer(new StringReader(s));
    st.quoteChar('\"');
    while (st.nextToken() != StreamTokenizer.TT_EOF) {
        opts.add(st.sval);
    }

    return opts;
}

If you must support both quotes, here is a naive implementation that should work (caveat that a string like '"blah \" blah"blah' will yield something like 'blah " blahblah'. If that isn't OK, you will need to make some changes):

   public static List<String> splitSSV(String in) throws IOException {
        ArrayList<String> out = new ArrayList<String>();

        StringReader r = new StringReader(in);
        StringBuilder b = new StringBuilder();
        int inQuote = -1;
        boolean escape = false;
        int c;
        // read each character
        while ((c = r.read()) != -1) {
            if (escape) {  // if the previous char is escape, add the current char
                b.append((char)c);
                escape = false;
                continue;
            }
            switch (c) {
            case '\\':   // deal with escape char
                escape = true;
                break;
            case '\"':
            case '\'':  // deal with quote chars
                if (c == '\"' || c == '\'') {
                    if (inQuote == -1) {  // not in a quote
                        inQuote = c;  // now we are
                    } else {
                        inQuote = -1;  // we were in a quote and now we aren't
                    }
                }
                break;
            case ' ':
                if (inQuote == -1) {  // if we aren't in a quote, then add token to list
                    out.add(b.toString());
                    b.setLength(0);
                } else {
                    b.append((char)c); // else append space to current token
                }
                break;
            default:
                b.append((char)c);  // append all other chars to current token
            }
        }
        if (b.length() > 0) {
            out.add(b.toString()); // add final token to list
        }
        return out;
    }
查看更多
做个烂人
3楼-- · 2019-06-21 21:18

I'm pretty sure you can't do this by just tokenising on a regex. If you need to deal with nested and escaped delimiters, you need to write a parser. See e.g. http://kore-nordmann.de/blog/do_NOT_parse_using_regexp.html

There will be open source parsers which can do what you want, although I don't know any. You should also check out the StreamTokenizer class.

查看更多
祖国的老花朵
4楼-- · 2019-06-21 21:20

The first thing you need to do is stop thinking of the job in terms of split(). split() is meant for breaking down simple strings like this/that/the other, where / is always a delimiter. But you're trying to split on whitespace, unless the whitespace is within quotes, except if the quotes are escaped with backslashes (and if backslashes escape quotes, they probably escape other things, like other backslashes).

With all those exceptions-to-exceptions, it's just not possible to create a regex to match all possible delimiters, not even with fancy gimmicks like lookarounds, conditionals, reluctant and possessive quantifiers. What you want to do is match the tokens, not the delimiters.

In the following code, a token that's enclosed in double-quotes or single-quotes may contain whitespace as well as the quote character if it's preceded by a backslash. Everything except the enclosing quotes is captured in group #1 (for double-quoted tokens) or group #2 (single-quoted). Any character may be escaped with a backslash, even in non-quoted tokens; the "escaping" backslashes are removed in a separate step.

public static void test()
{
  String str = "\"Hello\\\" World\" 'Hello Universe' Hi";
  List<String> commands = parseCommands(str);
  for (String s : commands)
  {
    System.out.println(s);
  }
}

public static List<String> parseCommands(String s)
{
  String rgx = "\"((?:[^\"\\\\]++|\\\\.)*+)\""  // double-quoted
             + "|'((?:[^'\\\\]++|\\\\.)*+)'"    // single-quoted
             + "|\\S+";                         // not quoted
  Pattern p = Pattern.compile(rgx);
  Matcher m = p.matcher(s);
  List<String> commands = new ArrayList<String>();
  while (m.find())
  {
    String cmd = m.start(1) != -1 ? m.group(1) // strip double-quotes
               : m.start(2) != -1 ? m.group(2) // strip single-quotes
               : m.group();
    cmd = cmd.replaceAll("\\\\(.)", "$1");  // remove escape characters
    commands.add(cmd);
  }
  return commands;
}

output:

Hello" World
Hello Universe
Hi

This is about as simple as it gets for a regex-based solution--and it doesn't really deal with malformed input, like unbalanced quotes. If you're not fluent in regexes, you might be better off with a purely hand-coded solution or, even better, a dedicated command-line interpreter (CLI) library.

查看更多
倾城 Initia
5楼-- · 2019-06-21 21:27

To recap, you want to split on whitespace, except when surrounded by double quotes, which are not preceded by a backslash.

Step 1: tokenize the input: /([ \t]+)|(\\")|(")|([^ \t"]+)/

This gives you a sequence of SPACE, ESCAPED_QUOTE, QUOTE and TEXT tokens.

Step 2: build a finite state machine matching and reacting to the tokens:

State: START

  • SPACE -> return empty string
  • ESCAPED_QUOTE -> Error (?)
  • QUOTE -> State := WITHIN_QUOTES
  • TEXT -> return text

State: WITHIN_QUOTES

  • SPACE -> add value to accumulator
  • ESCAPED_QUOTE -> add quote to accumulator
  • QUOTE -> return and clear accumulator; State := START
  • TEXT -> add text to accumulator

Step 3: Profit!!

查看更多
欢心
6楼-- · 2019-06-21 21:28

I think if you use pattern like this:

Pattern pattern = Pattern.compile("\".*?(?<!\\\\)\"|'.*?(?<!\\\\)'|[A-Za-z']+");

Then it will give you desired output. When I ran with your input data I got this list:

["Hello\" World", "Hello Universe", Hi]


I used [A-Za-z']+ from your own question but shouldn't it be just : [A-Za-z]+

EDIT

Change your opts.add(token); line to:

opts.add(token.replaceAll("^\"|\"$|^'|'$", ""));
查看更多
登录 后发表回答