Escape xml characters within nodes of string xml i

2019-08-04 18:50发布

问题:

I have a string of XML data. I need to escape the values within the nodes, but not the nodes themselves.

Ex:
<node1>R&R</node1>
should escape to:
<node1>R&amp;R</node1>
should not escape to:
&lt;node1&gt;R&amp;R&lt;/node1&gt;

I have been working on this for the last couple of days, but haven't had much success. I'm not an expert with Java, but the following are things that I have tried that will not work:

  1. Parsing string xml into a document. Does not work since the data within the nodes contains invalid xml data.
  2. Escaping all of the characters. Does not work since the program receiving this data will not accept it in this format.
  3. Escaping all characters then parsing into document. Throws all sorts of errors.

Any help would be much appreciated.

回答1:

You could use regular expression matching to find all the strings between angled brackets, and loop through/process each of those. In this example I've used the Apache Commons Lang to do the XML escaping.

public String sanitiseXml(String xml)
{
    // Match the pattern <something>text</something>
    Pattern xmlCleanerPattern = Pattern.compile("(<[^/<>]*>)([^<>]*)(</[^<>]*>)");

    StringBuilder xmlStringBuilder = new StringBuilder();

    Matcher matcher = xmlCleanerPattern.matcher(xml);
    int lastEnd = 0;
    while (matcher.find())
    {
        // Include any non-matching text between this result and the previous result
        if (matcher.start() > lastEnd) {
            xmlStringBuilder.append(xml.substring(lastEnd, matcher.start()));
        }
        lastEnd = matcher.end();

        // Sanitise the characters inside the tags and append the sanitised version
        String cleanText = StringEscapeUtils.escapeXml10(matcher.group(2));
        xmlStringBuilder.append(matcher.group(1)).append(cleanText).append(matcher.group(3));
    }
    // Include any leftover text after the last result
    xmlStringBuilder.append(xml.substring(lastEnd));

    return xmlStringBuilder.toString();
}

This looks for matches of <something>text</something>, captures the tag names and contained text, sanitises the contained text, and then puts it back together.



回答2:

The issue is that <node1>R&R</node1> is not XML.

  • Using XML parsers will not help. The purpose of an XML parser is to filter out this kind of data.

  • You can try a different parser that was made to parse "dirty" HTML.

But I think the best solution would be to get correct XML in the first place:

  • Fix the XML source by using an XML lib to create the data. (And never do String concatenation to create XML)

  • If the data is provided for you, create an XML-Schema and insist on validity of your input data.



回答3:

What you've presented isn't XML. It's XPL. XPL is structured just like XML but allows XML's "special characters" in text fields. You can easily do the XPL to XML conversions with the XPL utilities. http://hll.nu



回答4:

I've used Nameless Voices answer but with a regex of:

Pattern xmlCleanerPattern = Pattern.compile("(<[^<>]*>)(.*)(<\\/[^<>]*>)")

I find this captures all the values within the nodes themselves a bit better