How can I adapt my regex to allow for escaped quot

2019-04-13 01:31发布

Introduction

First my general issue is that I want to string replace question marks in a string, but only when they are not quoted. So I found a similar answer on SO (link) and began testing out the code. Unfortunately, of course, the code does not take into account escaped quotes.

For example: $string = 'hello="is it me your are looking for\\"?" AND test=?';

I have adapted a regular expression and code from that answer to the question: How to replace words outside double and single quotes, which is reproduced here for ease of reading my question:

<?php
function str_replace_outside_quotes($replace,$with,$string){
    $result = "";
    $outside = preg_split('/("[^"]*"|\'[^\']*\')/',$string,-1,PREG_SPLIT_DELIM_CAPTURE);
    while ($outside)
        $result .= str_replace($replace,$with,array_shift($outside)).array_shift($outside);
    return $result;
}
?>

Actual issue

So I have attempted to adjust the pattern to allow for it to match anything that is not a quote " and quotes that are escaped \":

<?php
$pattern = '/("(\\"|[^"])*"' . '|' . "'[^']*')/";

// when parsed/echoed by PHP the pattern evaluates to
// /("(\"|[^"])*"|'[^']*')/
?>

But this does not work as I had hoped.

My test string is: hello="is it me your are looking for\"?" AND test=?

And I am getting the following matches:

array
  0 => string 'hello=' (length=6)
  1 => string '"is it me your are looking for\"?"' (length=34)
  2 => string '?' (length=1)
  3 => string ' AND test=?' (length=11)

Match index two should not be there. That question mark should be considered part of match index 1 only and not repeated separately.

Once resolved this same fix should also correct the other side of the main alternation for single quotes/apostrophes as well '.

After this is parsed by the complete function it should output:

echo str_replace_outside_quotes('?', '%s', 'hello="is it me your are looking for\\"?" AND test=?');
// hello="is it me your are looking for\"?" AND test=%s

I hope that this makes sense and I have provided enough information to answer the question. If not I will happily provide whatever you need.

Debug code

My current (complete) code sample is on codepad for forking as well:

function str_replace_outside_quotes($replace, $with, $string){
    $result = '';
    var_dump($string);
    $pattern = '/("(\\"|[^"])*"' . '|' . "'[^']*')/";
    var_dump($pattern);
    $outside = preg_split($pattern, $string, -1, PREG_SPLIT_DELIM_CAPTURE);
    var_dump($outside);
    while ($outside) {
        $result .= str_replace($replace, $with, array_shift($outside)) . array_shift($outside);
    }
    return $result;
}
echo str_replace_outside_quotes('?', '%s', 'hello="is it me your are looking for\\"?" AND test=?');

Sample input and expected output

In: hello="is it me your are looking for\\"?" AND test=? AND hello='is it me your are looking for\\'?' AND test=? hello="is it me your are looking for\\"?" AND test=?' AND hello='is it me your are looking for\\'?' AND test=?
Out: hello="is it me your are looking for\\"?" AND test=%s AND hello='is it me your are looking for\\'?' AND test=%s hello="is it me your are looking for\\"?" AND test=%s AND hello='is it me your are looking for\\'?' AND test=%s

In: my_var = ? AND var_test = "phoned?" AND story = 'he said \'where is it?!?\''
Out: my_var = %s AND var_test = "phoned?" AND story = 'he said \'where is it?!?\''

5条回答
forever°为你锁心
2楼-- · 2019-04-13 01:50

This regex matches valid quoted strings. This means it is aware of escaped quotes.

^("[^\"\\]*(?:\\.[^\"\\]*)*(?![^\\]\\)")|('[^\'\\]*(?:\\.[^\'\\]*)*(?![^\\]\\)')$

Ready for PHP use:

$pattern = '/^((?:"([^"\\\\]*(?:\\\\.[^"\\\\]*)*(?![^\\\\]\\\\))")|(?:\'([^\'\\\\]*(?:\\\\.[^\'\\\\]*)*(?![^\\\\]\\\\))\'))$/';

Adapted for str_replace_outside_quotes():

$pattern = '/((?:"(?:[^"\\\\]*(?:\\\\.[^"\\\\]*)*(?![^\\\\]\\\\))")|(?:\'(?:[^\'\\\\]*(?:\\\\.[^\'\\\\]*)*(?![^\\\\]\\\\))\'))/';
查看更多
SAY GOODBYE
3楼-- · 2019-04-13 02:03

Edit, changed answer. Does not works with regex(only what is now regex - I thought it would be better to use preg_replace instead of str_replace, but you can change that)):

function replace_special($what, $with, $str) {
   $res = '';
   $currPos = 0;
   $doWork = true;

   while (true) {
     $doWork = false; //pesimistic approach

     $pos = get_quote_pos($str, $currPos, $quoteType);
     if ($pos !== false) {
       $posEnd = get_specific_quote_pos($str, $quoteType, $pos + 1);
       if ($posEnd !== false) {
           $doWork = $posEnd !== strlen($str) - 1; //do not break if not end of string reached

           $res .= preg_replace($what, $with, 
                                substr($str, $currPos, $pos - $currPos));
           $res .= substr($str, $pos, $posEnd - $pos + 1);                      

           $currPos = $posEnd + 1;
       }
     }

     if (!$doWork) {
        $res .= preg_replace($what, $with, 
                             substr($str, $currPos, strlen($str) - $currPos + 1));
        break;
     }

   }   

   return $res;
}

function get_quote_pos($str, $currPos, &$type) {
   $pos1 = get_specific_quote_pos($str, '"', $currPos);
   $pos2 = get_specific_quote_pos($str, "'", $currPos);
   if ($pos1 !== false) {
      if ($pos2 !== false && $pos1 > $pos2) {
        $type = "'";
        return $pos2;
      }
      $type = '"';
      return $pos1;
   }
   else if ($pos2 !== false) {
      $type = "'";
      return $pos2;
   }

   return false;
}

function get_specific_quote_pos($str, $type, $currPos) {
   $pos = $currPos - 1; //because $fromPos = $pos + 1 and initial $fromPos must be currPos
   do {
     $fromPos = $pos + 1;
     $pos = strpos($str, $type, $fromPos);
   }
   //iterate again if quote is escaped!
   while ($pos !== false && $pos > $currPos && $str[$pos-1] == '\\');
   return $pos;
}

Example:

   $str = 'hello ? ="is it me your are looking for\\"?" AND mist="???" WHERE test=? AND dzo=?';
   echo replace_special('/\?/', '#', $str);

returns

hello # ="is it me your are looking for\"?" AND mist="???" WHERE test=# AND dzo=#

----

--old answer (I live it here because it does solve something although not full question)

<?php
function str_replace_outside_quotes($replace, $with, $string){
    $result = '';
    var_dump($string);
    $pattern = '/(?<!\\\\)"/';
    $outside = preg_split($pattern, $string, -1, PREG_SPLIT_DELIM_CAPTURE);
   var_dump($outside);
    for ($i = 0; $i < count($outside); ++$i) {
       $replaced = str_replace($replace, $with, $outside[$i]);
       if ($i != 0 && $i != count($outside) - 1) { //first and last are not inside quote
          $replaced = '"'.$replaced.'"';
       }
       $result .= $replaced;
    }
   return $result;
}
echo str_replace_outside_quotes('?', '%s', 'hello="is it me your are looking for\\"?" AND test=?');
查看更多
够拽才男人
4楼-- · 2019-04-13 02:04

As @ridgerunner mentions in the comments on the question there is another possible regex solution:

function str_replace_outside_quotes($replace, $with, $string){
    $result = '';
    $pattern = '/("[^"\\\\]*(?:\\\\.[^"\\\\]*)*")' // hunt down unescaped double quotes
             . "|('[^'\\\\]*(?:\\\\.[^'\\\\]*)*')/s"; // or single quotes
    $outside = array_filter(preg_split($pattern, $string, -1, PREG_SPLIT_DELIM_CAPTURE));
    while ($outside) {
        $result .= str_replace($replace, $with, array_shift($outside)) // outside quotes
                .  array_shift($outside); // inside quotes
    }
    return $result;
}

Note the use of array_filter to remove some matches that were coming back from the regex empty and breaking the alternating nature of this function.


A no regex approach that I knocked up quickly. It works, but I am sure there are some optimisations that could be done.

function str_replace_outside_quotes($replace, $with, $string){
    $string = str_split($string);
    $accumulation = '';
    $current_unquoted_string = null;
    $inside_quote = false;
    $quotes = array("'", '"');
    foreach($string as $char) {
        if ($char == $inside_quote && "\\" != substr($accumulation, -1)) {
            $inside_quote = false;
        } else if(false === $inside_quote && in_array($char, $quotes)) {
            $inside_quote = $char;
        }

        if(false === $inside_quote) {
            $current_unquoted_string .= $char;
        } else {
            if(null !== $current_unquoted_string) {
                $accumulation .= str_replace($replace, $with, $current_unquoted_string);
                $current_unquoted_string = null;
            }
            $accumulation .= $char;
        }
    }
    if(null !== $current_unquoted_string) {
        $accumulation .= str_replace($replace, $with, $current_unquoted_string);
        $current_unquoted_string = null;
    }
    return $accumulation;
}

In my benchmarking it takes double the time of the regex approach above and when the string length is increased the regex options resource use doesn't go up by much. The approach above on the other hand increases linearly with the length of text fed to it.

查看更多
女痞
5楼-- · 2019-04-13 02:06

» Code has been updated to solve ALL issues brought in comments and is now working properly «


Having $s an input, $p a phrase string and $v a replacement variable, use preg_replace as follows:

$r = '/\G((?:(?:[^\x5C"\']|\x5C(?!["\'])|\x5C["\'])*?(?:\'(?:[^\x5C\']|\x5C(?!\')' .
     '|\x5C\')*\')*(?:"(?:[^\x5C"]|\x5C(?!")|\x5C")*")*)*?)' . preg_quote($p) . '/';
$s = preg_match($r, $s) ? preg_replace($r, "$1" . $v, $s) : $s;

Check this demo.


Note: In regex, \x5C represents a \ character.

查看更多
Emotional °昔
6楼-- · 2019-04-13 02:11

The following tested script first checks that a given string is valid, consisting solely of single quoted, double quoted and un-quoted chunks. The $re_valid regex performs this validation task. If the string is valid, it then parses the string one chunk at a time using preg_replace_callback() and the $re_parse regex. The callback function processes the unquoted chunks using preg_replace(), and returns all quoted chunks unaltered. The only tricky part of the logic is passing the $replace and $with argument values from the main function to the callback function. (Note that PHP procedural code makes this variable passing from the main function to the callback function a bit awkward.) Here is the script:

<?php // test.php Rev:20121113_1500
function str_replace_outside_quotes($replace, $with, $string){
    $re_valid = '/
        # Validate string having embedded quoted substrings.
        ^                           # Anchor to start of string.
        (?:                         # Zero or more string chunks.
          "[^"\\\\]*(?:\\\\.[^"\\\\]*)*"  # Either a double quoted chunk,
        | \'[^\'\\\\]*(?:\\\\.[^\'\\\\]*)*\'  # or a single quoted chunk,
        | [^\'"\\\\]+               # or an unquoted chunk (no escapes).
        )*                          # Zero or more string chunks.
        \z                          # Anchor to end of string.
        /sx';
    if (!preg_match($re_valid, $string)) // Exit if string is invalid.
        exit("Error! String not valid.");
    $re_parse = '/
        # Match one chunk of a valid string having embedded quoted substrings.
          (                         # Either $1: Quoted chunk.
            "[^"\\\\]*(?:\\\\.[^"\\\\]*)*"  # Either a double quoted chunk,
          | \'[^\'\\\\]*(?:\\\\.[^\'\\\\]*)*\'  # or a single quoted chunk.
          )                         # End $1: Quoted chunk.
        | ([^\'"\\\\]+)             # or $2: an unquoted chunk (no escapes).
        /sx';
    _cb(null, $replace, $with); // Pass args to callback func.
    return preg_replace_callback($re_parse, '_cb', $string);
}
function _cb($matches, $replace = null, $with = null) {
    // Only set local static vars on first call.
    static $_replace, $_with;
    if (!isset($matches)) { 
        $_replace = $replace;
        $_with = $with;
        return; // First call is done.
    }
    // Return quoted string chunks (in group $1) unaltered.
    if ($matches[1]) return $matches[1];
    // Process only unquoted chunks (in group $2).
    return preg_replace('/'. preg_quote($_replace, '/') .'/',
        $_with, $matches[2]);
}
$data = file_get_contents('testdata.txt');
$output = str_replace_outside_quotes('?', '%s', $data);
file_put_contents('testdata_out.txt', $output);
?>
查看更多
登录 后发表回答