I have strings input by the user and want to tokenize them. For that, I want to use regex and now have a problem with a special case.
An example string is
Test + "Hello" + "Good\"more" + "Escape\"This\"Test"
or the C# equivalent
@"Test + ""Hello"" + ""Good\""more"" + ""Escape\""This\""Test"""
I am able to match the Test
and +
tokens, but not the ones contained by the ". I use the " to let the user specify that this is literally a string and not a special token. Now if the user wants to use the " character in the string, I thought of allowing him to escape it with a \.
So the rule would be: Give me everything between two " ", but the character in front of the last " can not be a \.
The results I expect are: "Hello"
"Good\"more"
"Escape\"This\"Test"
I need the " " characters to be in the final match so I know that this is a string.
I currently have the regex @"""([\w]*)(?<!\\"")"""
which gives me the following results: "Hello"
"more"
"Test"
So the look behind isn't working as I want it to be. Does anyone know the correct way to get the string like I want?
Here's an adaption of a regex I use to parse command lines:
(?!\+)((?:"(?:\\"|[^"])*"?|\S)+)
Example here at regex101
(adaption is the negative look-ahead to ignore +
and checking for \"
instead of ""
)
Hope this helps you.
Regards.
Edit:
If you aren't interested in surrounding quotes:
(?!\+)(?:"((?:\\"|[^"])*)"?|(\S+))
To make it safer, I'd suggest getting all the substrings within unescaped pairs of "..."
with the following regex:
^(?:[^"\\]*(?:\\.[^"\\]*)*("[^"\\]*(?:\\.[^"\\]*)*"))+
It matches
^
- start of string (so that we could check each "
and escape sequence)
(?:
- Non-capturing group 1 serving as a container for the subsequent subpatterns
[^"\\]*(?:\\.[^"\\]*)*
- matches 0+ characters other than "
and \
followed with 0+ sequences of \\.
(any escape sequence) followed with 0+ characters other than "
and \
(thus, we avoid matching the first "
that is escaped, and it can be preceded with any number of escape sequences)
("[^"\\]*(?:\\.[^"\\]*)*")
- Capture group 1 matching "..."
substrings that may contain any escape sequences inside
)+
- end of the first non-capturing group that is repeated 1 or more times
See the regex demo and here is a C# demo:
var rx = "^(?:[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*(\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"))+";
var s = @"Test + ""Hello"" + ""Good\""more"" + \""Escape\""This\""Test\"" + ""f""";
var matches = Regex.Matches(s, rx)
.Cast<Match>()
.SelectMany(m => m.Groups[1].Captures.Cast<Capture>().Select(p => p.Value).ToArray())
.ToList();
Console.WriteLine(string.Join("\n", matches));
UPDATE
If you need to remove the tokens, just match and capture all outside of them with this code:
var keep = "[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*";
var rx = string.Format("^(?:(?<keep>{0})\"{0}\")+(?<keep>{0})$", keep);
var s = @"Test + ""Hello"" + ""Good\""more"" + \""Escape\""This\""Test\"" + ""f""";
var matches = Regex.Matches(s, rx)
.Cast<Match>()
.SelectMany(m => m.Groups["keep"].Captures.Cast<Capture>().Select(p => p.Value).ToArray())
.ToList();
Console.WriteLine(string.Join("", matches));
See another demo
Output: Test + + + \"Escape\"This\"Test\" +
for @"Test + ""Hello"" + ""Good\""more"" + \""Escape\""This\""Test\"" + ""f""";
.