Regex tokenize issue

2019-08-12 22:17发布

问题:

I have strings input by the user and want to tokenize them. For that, I want to use regex and now have a problem with a special case. An example string is

Test + "Hello" + "Good\"more" + "Escape\"This\"Test" or the C# equivalent

@"Test + ""Hello"" + ""Good\""more"" + ""Escape\""This\""Test"""

I am able to match the Test and + tokens, but not the ones contained by the ". I use the " to let the user specify that this is literally a string and not a special token. Now if the user wants to use the " character in the string, I thought of allowing him to escape it with a \.

So the rule would be: Give me everything between two " ", but the character in front of the last " can not be a \.

The results I expect are: "Hello" "Good\"more" "Escape\"This\"Test" I need the " " characters to be in the final match so I know that this is a string.

I currently have the regex @"""([\w]*)(?<!\\"")""" which gives me the following results: "Hello" "more" "Test"

So the look behind isn't working as I want it to be. Does anyone know the correct way to get the string like I want?

回答1:

Here's an adaption of a regex I use to parse command lines:

(?!\+)((?:"(?:\\"|[^"])*"?|\S)+)

Example here at regex101

(adaption is the negative look-ahead to ignore + and checking for \" instead of "")

Hope this helps you.

Regards.

Edit:

If you aren't interested in surrounding quotes:

(?!\+)(?:"((?:\\"|[^"])*)"?|(\S+))


回答2:

To make it safer, I'd suggest getting all the substrings within unescaped pairs of "..." with the following regex:

^(?:[^"\\]*(?:\\.[^"\\]*)*("[^"\\]*(?:\\.[^"\\]*)*"))+

It matches

  • ^ - start of string (so that we could check each " and escape sequence)
  • (?: - Non-capturing group 1 serving as a container for the subsequent subpatterns
    • [^"\\]*(?:\\.[^"\\]*)* - matches 0+ characters other than " and \ followed with 0+ sequences of \\. (any escape sequence) followed with 0+ characters other than " and \ (thus, we avoid matching the first " that is escaped, and it can be preceded with any number of escape sequences)
    • ("[^"\\]*(?:\\.[^"\\]*)*") - Capture group 1 matching "..." substrings that may contain any escape sequences inside
  • )+ - end of the first non-capturing group that is repeated 1 or more times

See the regex demo and here is a C# demo:

var rx = "^(?:[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*(\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"))+";
var s = @"Test + ""Hello"" + ""Good\""more"" + \""Escape\""This\""Test\"" + ""f""";
var matches = Regex.Matches(s, rx)
        .Cast<Match>()
        .SelectMany(m => m.Groups[1].Captures.Cast<Capture>().Select(p => p.Value).ToArray())
        .ToList();
Console.WriteLine(string.Join("\n", matches));

UPDATE

If you need to remove the tokens, just match and capture all outside of them with this code:

var keep = "[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*";
var rx = string.Format("^(?:(?<keep>{0})\"{0}\")+(?<keep>{0})$", keep);
var s = @"Test + ""Hello"" + ""Good\""more"" + \""Escape\""This\""Test\"" + ""f""";
var matches = Regex.Matches(s, rx)
        .Cast<Match>()
        .SelectMany(m => m.Groups["keep"].Captures.Cast<Capture>().Select(p => p.Value).ToArray())
        .ToList();
Console.WriteLine(string.Join("", matches));

See another demo

Output: Test + + + \"Escape\"This\"Test\" + for @"Test + ""Hello"" + ""Good\""more"" + \""Escape\""This\""Test\"" + ""f""";.