Regex to strip out everything but words and number

2019-09-17 10:33发布

问题:

Im trying to clean a post string used in an ajax request (sanitize before db query) to allow only alphanumeric characters, spaces (1 per word, not multiple), can contain "-", and latin characters like "ç" and "é" without success, can anyone help or point me on the right direction?

This is the regex I'm using so far:

$string = preg_replace('/^[a-z0-9 àáâãäåçèéêëìíîïðñòóôõöøùúû-]+$/', '', mb_strtolower(utf8_encode($_POST['q'])));

Thank you.

回答1:

$string = mb_strtolower(utf8_encode($_POST['q'])));
$string = preg_replace('/[^a-z0-9 àáâãäåçèéêëìíîïðñòóôõöøùúû-]+/g', '', $string);
$string = preg_replace('/ +/g', ' ', $string);

Why not just use mysql_real_escape_string?



回答2:

$regEx = '/^[^\w\p{L}-]+$/iu';

\w - matches alphanumerics

\p{L} - matches a single Unicode Code Point in the 'Letters' category (see the Unicode Categories section here).

- at the end of the character class matches a single hyphen.

^ in the character classes negates the character class, so that the regex will match the opposite of the character class (anything you do not specify).

+ outside of the character class says match 1 or more characters

^ and $ outside of the character class will cause the engine to only accept matches that start at the beginning of a line and goes until the end of the line.

After the pattern, the i modifier says ignore case and the u tells the pattern matching engine that we're going to be sending UTF8 data it's way, and g modifier originally present has been removed since it's not necessary in PHP (instead global matching is dependent on which matching function is called)



回答3:

$string = preg_replace('/[^a-z0-9 àáâãäåçèéêëìíîïðñòóôõöøùúû\-]/u', '', mb_strtolower(utf8_encode($_POST['q']), 'UTF-8'));
$string = preg_replace( '/ +/', ' ', $string );

should do the trick. Note that

  • the character class is negated by putting ^ inside the character class
  • you need the u flag when dealing with unicode strings either in the pattern or in the subject
  • it's better to specify the character set explicitly in mb_* functions because otherwise they will fall back on your system defaults, and that may not be UTF-8.
  • the hyphen character needed escaping (\- instead of - at the end of your character class)