Hi i would like to remove all invalid XML characters from a string. i would like to use a regular expression with the string.replace method.
like
line.replace(regExp,"");
what is the right regExp to use ?
invalid XML character is everything that is not this :
[#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
thanks.
Jun's solution, simplified. Using
StringBuffer#appendCodePoint(int)
, I need nochar current
orString#charAt(int)
. I can tell a surrogate pair by checking ifcodePoint
is greater than0xFFFF
.(It is not necessary to do the i++, since a low surrogate wouldn't pass the filter. But then one would re-use the code for different code points and it would fail. I prefer programming to hacking.)
From Mark McLaren's Weblog
All these answers so far only replace the characters themselves. But sometimes an XML document will have invalid XML entity sequences resulting in errors. For example, if you have

in your xml, a java xml parser will throwIllegal character entity: expansion character (code 0x2 at ...
.Here is a simple java program that can replace those invalid entity sequences.
Should we consider surrogate characters? otherwise '(current >= 0x10000) && (current <= 0x10FFFF)' will never be true.
Also tested that the regex way seems slower than the following loop.
From Best way to encode text data for XML in Java?
Java's regex supports supplementary characters, so you can specify those high ranges with two UTF-16 encoded chars.
Here is the pattern for removing characters that are illegal in XML 1.0:
Most people will want the XML 1.0 version.
Here is the pattern for removing characters that are illegal in XML 1.1:
You will need to use
String.replaceAll(...)
and notString.replace(...)
.