I'm trying to write a regex that will match everything BUT an apostrophe that has not been escaped. Consider the following:
<?php $s = 'Hi everyone, we\'re ready now.'; ?>
My goal is to write a regular expression that will essentially match the string portion of that. I'm thinking of something such as
/.*'([^']).*/
in order to match a simple string, but I've been trying to figure out how to get a negative lookbehind working on that apostrophe to ensure that it is not preceded by a backslash...
Any ideas?
- JMT
<?php
$backslash = '\\';
$pattern = <<< PATTERN
#(["'])(?:{$backslash}{$backslash}?+.)*?{$backslash}1#
PATTERN;
foreach(array(
"<?php \$s = 'Hi everyone, we\\'re ready now.'; ?>",
'<?php $s = "Hi everyone, we\\"re ready now."; ?>',
"xyz'a\\'bc\\d'123",
"x = 'My string ends with with a backslash\\\\';"
) as $subject) {
preg_match($pattern, $subject, $matches);
echo $subject , ' => ', $matches[0], "\n\n";
}
prints
<?php $s = 'Hi everyone, we\'re ready now.'; ?> => 'Hi everyone, we\'re ready now.'
<?php $s = "Hi everyone, we\"re ready now."; ?> => "Hi everyone, we\"re ready now."
xyz'a\'bc\d'123 => 'a\'bc\d'
x = 'My string ends with with a backslash\\'; => 'My string ends with with a backslash\\'
Here's my solution with test cases:
/.*?'((?:\\\\|\\'|[^'])*+)'/
And my (Perl, but I don't use any Perl-specific features I don't think) proof:
use strict;
use warnings;
my %tests = ();
$tests{'Case 1'} = <<'EOF';
$var = 'My string';
EOF
$tests{'Case 2'} = <<'EOF';
$var = 'My string has it\'s challenges';
EOF
$tests{'Case 3'} = <<'EOF';
$var = 'My string ends with a backslash\\';
EOF
foreach my $key (sort (keys %tests)) {
print "$key...\n";
if ($tests{$key} =~ m/.*?'((?:\\\\|\\'|[^'])*+)'/) {
print " ... '$1'\n";
} else {
print " ... NO MATCH\n";
}
}
Running this shows:
$ perl a.pl
Case 1...
... 'My string'
Case 2...
... 'My string has it\'s challenges'
Case 3...
... 'My string ends with a backslash\\'
Note that the initial wildcard at the start needs to be non-greedy. Then I use non-backtracking matches to gobble up \\ and \' and then anything else that is not a standalone quote character.
I think this one probably mimics the compiler's built-in approach, which should make it pretty bullet-proof.
/.*'([^'\\]|\\.)*'.*/
The parenthesized portion looks for non-apostrophes/backslashes and backslash-escaped characters. If only certain characters can be escaped change the \\.
to \\['\\a-z]
, or whatever.
Via negative look behind:
/
.*?' #Match until '
(
.*? #Lazy match & capture of everything after the first apostrophe
)
(?<!(?<!\\)\\)' #Match first apostrophe that isn't preceded by \, but accept \\
.* #Match remaining text
/
Regex reg = new Regex("(?<!\\\\)'(?<string>.*?)(?<!\\\\)'");
This is for JavaScript:
/('|")(?:\\\\|\\\1|[\s\S])*?\1/
it...
- matches single or double quoted strings
- matches empty strings (length 0)
- matches strings with embedded whitespace (
\n
, \t
, etc.)
- skips inner escaped quotes (single or double)
- skips single quotes within double quotes and vice versa
Only the first quote is captured. You can capture the unquoted string in $2 with:
/('|")((?:\\\\|\\\1|[\s\S])*?)\1/