I'm using the following regex to remove all invisible characters from an UTF-8 string:
$string = preg_replace('/\p{C}+/u', '', $string);
This works fine, but how do I alter it so that it removes all invisible characters EXCEPT newlines? I tried some stuff using [^\n] etc. but it doesn't work.
Thanks for helping out!
Edit: newline character is '\n'
Use a "double negation":
$string = preg_replace('/[^\P{C}\n]+/u', '', $string);
Explanation:
\P{C}
is the same as [^\p{C}]
.
- Therefore
[^\P{C}]
is the same as \p{C}
.
- Since we now have a negated character class, we can substract other characters like
\n
from it.
My using a negative assertion you can a character class except what the assertion matches, so:
$res = preg_replace('/(?!\n)\p{C}/', '', $input);
(PHP's dialect of regular expressions doesn't support character class subtraction which would, otherwise, be another approach: [\p{C}-[\n]]
.)
Before you do it, replace newlines (I suppose you are using something like \n
) with a random string like ++++++++
(any string that will not be removed by your regular expression and does not naturally occur in your string in the first place), then run your preg_replace, then replace ++++++++
with \n
again.
$string=str_replace('\n','++++++++',$string); //Replace \n
$string=preg_replace('/\p{C}+/u', '', $string); //Use your regexp
$string=str_replace('++++++++','\n',$string); //Insert \n again
That should do. If you are using <br/>
instead of \n
simply use nl2br to preserve line breaks and replace <br/>
instead of \n