Need regex for utf8 multilingual search query

2019-09-19 17:37发布

问题:

I need a Regex for to use with preg_replace php function in the search form input to use in SQL full text search in a MySQL multilingual utf8 database. I have considered using php filter_var with FILTER_SANITIZE_STRING, but I ended up with preg_replace:

I want these features:

  1. keep spaces and only one if more in a row (serial spaces)
  2. keep double quotes and only one if more in a row(so that I could use it in phrase in IN BOOLEAN MODE)
  3. keep - & + & '~' and only one if more in a row
  4. as I want it to be multi lingual it should consider Unicode (utf8) letters too
  5. I do not have/need accents to be considered.

This is what I have done:

$q = addslashes($q);
$q = preg_replace('/[^\w\d\s\s+\p{L}]/u', "", $q);

But the output does not satisfy me with like with quotes(") and minus (-). How can I write a safe query string to use in my search box?

Are there any better practises than using preg_replace?

回答1:

You have to do 2 preg_replace.

1- Replace invalid characters by nothing:

$q = preg_replace('/[^\p{L}\d\s~+"-]+/', '', $q);

2- Replace multiple char like spaces, ~, +, ", - by only one:

$q = preg_replace('/([\s~+"-])\1+/', "$1", $q);