escapeXml function is converting ѭ Ѯ to ѭ Ѯ which I guess it should not. What I read is that it Supports only the five basic XML entities (gt, lt, quot, amp, apos).
Is there a function that only converts these five basic xml entities.
escapeXml function is converting ѭ Ѯ to ѭ Ѯ which I guess it should not. What I read is that it Supports only the five basic XML entities (gt, lt, quot, amp, apos).
Is there a function that only converts these five basic xml entities.
public String escapeXml(String s) {
return s.replaceAll("&", "&").replaceAll(">", ">").replaceAll("<", "<").replaceAll("\"", """).replaceAll("'", "'");
}
The javadoc for the 3.1 version of the library says:
Note that Unicode characters greater than 0x7f are as of 3.0, no longer escaped. If you still wish this functionality, you can achieve it via the following: StringEscapeUtils.ESCAPE_XML.with( NumericEntityEscaper.between(0x7f, Integer.MAX_VALUE) );
So you probably use an older version of the library. Update your dependencies (or reimplement the escape yourself: it's not rocket science)
The javadoc of StringEscapeUtils.escapeXml
says that we have to use
StringEscapeUtils.ESCAPE_XML.with( new UnicodeEscaper(Range.between(0x7f, Integer.MAX_VALUE)) );
But instead of UnicodeEscaper
, NumericEntityEscaper
has to be used. UnicodeEscaper
will change everything to \u1234
symbols, but NumericEntityEscaper
escapes as &#123;
, that was expected.
package mypackage;
import org.apache.commons.lang3.StringEscapeUtils;
import org.apache.commons.lang3.text.translate.CharSequenceTranslator;
import org.apache.commons.lang3.text.translate.NumericEntityEscaper;
public class XmlEscaper {
public static void main(final String[] args) {
final String xmlToEscape = "<hello>Hi</hello>" + "_ _" + "__ __" + "___ ___" + "after "; // the line cont
// no Unicode escape
final String escapedXml = StringEscapeUtils.escapeXml(xmlToEscape);
// escape Unicode as numeric codes. For instance, escape non-breaking space as  
final CharSequenceTranslator translator = StringEscapeUtils.ESCAPE_XML.with( NumericEntityEscaper.between(0x7f, Integer.MAX_VALUE) );
final String escapedXmlWithUnicode = translator.translate(xmlToEscape);
System.out.println("xmlToEscape: " + xmlToEscape);
System.out.println("escapedXml: " + escapedXml); // does not escape Unicode characters like non-breaking space
System.out.println("escapedXml with unicode: " + escapedXmlWithUnicode); // escapes Unicode characters
}
}
In times of XML documents in UTF-8 having readable characters is sometimes preferred. This should work and recomposition of the String
only happens once.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
private static final Pattern ESCAPE_XML_CHARS = Pattern.compile("[\"&'<>]");
public static String escapeXml(String s) {
Matcher m = ESCAPE_XML_CHARS.matcher(s);
StringBuffer buf = new StringBuffer();
while (m.find()) {
switch (m.group().codePointAt(0)) {
case '"':
m.appendReplacement(buf, """);
break;
case '&':
m.appendReplacement(buf, "&");
break;
case '\'':
m.appendReplacement(buf, "'");
break;
case '<':
m.appendReplacement(buf, "<");
break;
case '>':
m.appendReplacement(buf, ">");
break;
}
}
m.appendTail(buf);
return buf.toString();
}