I have a string of XML data. I need to escape the values within the nodes, but not the nodes themselves.
Ex:
<node1>R&R</node1>
should escape to:
<node1>R&R</node1>
should not escape to:
<node1>R&R</node1>
I have been working on this for the last couple of days, but haven't had much success. I'm not an expert with Java, but the following are things that I have tried that will not work:
- Parsing string xml into a document. Does not work since the data within the nodes contains invalid xml data.
- Escaping all of the characters. Does not work since the program receiving this data will not accept it in this format.
- Escaping all characters then parsing into document. Throws all sorts of errors.
Any help would be much appreciated.
You could use regular expression matching to find all the strings between angled brackets, and loop through/process each of those. In this example I've used the Apache Commons Lang to do the XML escaping.
public String sanitiseXml(String xml)
{
// Match the pattern <something>text</something>
Pattern xmlCleanerPattern = Pattern.compile("(<[^/<>]*>)([^<>]*)(</[^<>]*>)");
StringBuilder xmlStringBuilder = new StringBuilder();
Matcher matcher = xmlCleanerPattern.matcher(xml);
int lastEnd = 0;
while (matcher.find())
{
// Include any non-matching text between this result and the previous result
if (matcher.start() > lastEnd) {
xmlStringBuilder.append(xml.substring(lastEnd, matcher.start()));
}
lastEnd = matcher.end();
// Sanitise the characters inside the tags and append the sanitised version
String cleanText = StringEscapeUtils.escapeXml10(matcher.group(2));
xmlStringBuilder.append(matcher.group(1)).append(cleanText).append(matcher.group(3));
}
// Include any leftover text after the last result
xmlStringBuilder.append(xml.substring(lastEnd));
return xmlStringBuilder.toString();
}
This looks for matches of <something>text</something>, captures the tag names and contained text, sanitises the contained text, and then puts it back together.
The issue is that <node1>R&R</node1>
is not XML.
But I think the best solution would be to get correct XML in the first place:
Fix the XML source by using an XML lib to create the data. (And never do String concatenation to create XML)
If the data is provided for you, create an XML-Schema and insist on validity of your input data.
What you've presented isn't XML. It's XPL. XPL is structured just like XML but allows XML's "special characters" in text fields. You can easily do the XPL to XML conversions with the XPL utilities. http://hll.nu
I've used Nameless Voices answer but with a regex of:
Pattern xmlCleanerPattern = Pattern.compile("(<[^<>]*>)(.*)(<\\/[^<>]*>)")
I find this captures all the values within the nodes themselves a bit better