Regular expression preg_quote symbols are not dete

2019-01-10 02:41发布

问题:

I have a dictionary of swear words in the database, and the following works great

preg_match_all("/\b".$f."(?:ing|er|es|s)?\b/si",$t,$m,PREG_SET_ORDER);

$t is the input text and simply, $f = preg_quote("punk"); "punk" is from the database dictionary, so at this point in the loop the expression is as follows

preg_match_all("/\bpunk(?:ing|er|es|s)?\b/si",$t,$m,PREG_SET_ORDER);

preg_quote replaces symbols eg. # with \\# so that the expression is escaped, but when the dictionary is checking eg. "F@CK" or "A$$" these symbols are not detected in the input string with the above expression, I have both a$$ and f@ck in the dictionary, but they do not work. If I remove preg_quote() on the word, the regular expression is invalid as these symbols are not escaped.

Any suggestions on how I can detect "a$$" ???

Edit:

So I guess the expression that is not working as intended would be eg.

preg_match_all("/\bf\@ck(?:ing|er|es|s)?\b/si",$t,$m,PREG_SET_ORDER);

Which should find f@ck in $t

UPDATE:

This is my usage, simply put; if there are matches in $m replace them with "\*\*\*\*", this whole block is inside a loop through each word in the dictionary, $f is the dictionary word and $t is the input

$f = preg_quote($f);
preg_match_all("/\b$f(?:ing|er|es|s)?\b/si",$t,$m,PREG_SET_ORDER);
if (count($m) > 0) {
     $t = preg_replace("/(\b$f(?:ing|er|es|s)?\b)/si","\*\*\*\*\*",$t);
}

UPDATE: Behold, the var_dump:

preg_quote($f) = string(5) "a\$\$"
$t = string(18) "You're such an a$$"
expression = string(29) "/\ba\$\$(?:ing|er|es|s)?\b/si"

UPDATE: This is only happening when words end with a symbol. I tested "a$$hole" and it’s fine, but "a$$" doesn't work.

ANOTHER UPDATE: Try this simplified version, $words being a make-shift dictionary

$words = array("a$$","asshole","a$$hole","f@ck","f#ck","f*ck");
$text = "Input whatever you feel like here eg. a$$";

foreach ($words as $f) {
   $f = preg_quote($f,"/");
   $text = preg_replace("/\b".$f."(?:ing|er|es|s)?\b/si",
                         str_repeat("*",strlen($f)),
                        $t);
}

I should expect to see "Input whatever you feel like here eg. \*\*\*" as a result.

回答1:

Cannot Be Done

I'm sorry, but this “problem” is truly impossible to solve. Consider these:

  • ꜰᴜᴄᴋ   is U+A730.1D1C.1D04.1D0B, "\N{LATIN LETTER SMALL CAPITAL F}\N{LATIN LETTER SMALL CAPITAL U}\N{LATIN LETTER SMALL CAPITAL C}\N{LATIN LETTER SMALL CAPITAL K}"
  • ᶠᵘᶜᵏ   is U+1DA0.1D58.1D9C.1D4F, "\N{MODIFIER LETTER SMALL F}\N{MODIFIER LETTER SMALL U}\N{MODIFIER LETTER SMALL C}\N{MODIFIER LETTER SMALL K}"