Regular expressions for a range of unicode points

2019-01-14 23:11发布

问题:

I'm trying to strip all characters from a string from a string except:

  • Alphanumeric characters
  • Dollar sign ($)
  • Underscore (_)
  • Unicode characters between code points U+0080 and U+FFFF

I've got the first three conditions by doing this:

preg_replace('/[^a-zA-Z\d$_]+/', '', $foo);

How do I go about matching the fourth condition? I looked at using \X but there has to be a better way than listing out 65000+ characters.

回答1:

You can use:

$foo = preg_replace('/[^\w$\x{0080}-\x{FFFF}]+/u', '', $foo);
  • \w - is equivalent of [a-zA-Z0-9_]
  • \x{0080}-\x{FFFF} to match characters between code points U+0080andU+FFFF`
  • /u for unicode support in regex