-->

Reading XML document nodes containing special char

2019-08-12 07:00发布

问题:

My code does not retrieve the entirety of element nodes that contain special characters. For example, for this node:

<theaterName>P&G Greenbelt</theaterName>

It would only retrieve "P" due to the ampersand. I need to retrieve the entire string.

Here's my code:

public List<String> findTheaters() {

    //Clear theaters application global
    FilmhopperActivity.tData.clearTheaters();

    ArrayList<String> theaters = new ArrayList<String>();

    NodeList theaterNodes = doc.getElementsByTagName("theaterName");

    for (int i = 0; i < theaterNodes.getLength(); i++) {

        Node node = theaterNodes.item(i);
        if (node.getNodeType() == Node.ELEMENT_NODE) {

            //Found theater, add to return array
            Element element = (Element) node;
            NodeList children = element.getChildNodes();
            String name = children.item(0).getNodeValue();
            theaters.add(name);

            //Logging
            android.util.Log.i("MoviefoneFetcher", "Theater found: " + name);

            //Add theater to application global
            Theater t = new Theater(name);
            FilmhopperActivity.tData.addTheater(t);
        }
    }

    return theaters;
}

I tried adding code to extend the name string to concatenate additional children.items, but it didn't work. I'd only get "P&".

...
String name = children.item(0).getNodeValue();
for (int j = 1; j < children.getLength() - 1; j++) {
    name += children.item(j).getNodeValue();
}

Thanks for your time.


UPDATE: Found a function called normalize() that you can call on Nodes, that combines all text child nodes so doing a children.item(0) contains the text of all the children, including ampersands!

回答1:

The & is an escape character in XML. XML that looks like this:

<theaterName>P&G Greenbelt</theaterName>

should actually be rejected by the parser. Instead, it should look like this:

<theaterName>P&amp;G Greenbelt</theaterName>

There are a few such characters, such as < (&lt;), > (&gt;), " (&quot;) and ' (&apos;). There are also other ways to escape characters, such as via their Unicode value, as in &#x2022; or &#12345;.

For more information, the XML specification is fairly clear.

Now, the other thing it might be, depending on how your tree was constructed, is that the character is escaped properly, and the sample you showed isn't what's actually there, and it's how the data is represented in the tree.

For example, when using SAX to build a tree, entities (the &-thingies) are broken apart and delivered separately. This is because the SAX parser tries to return contiguous chunks of data, and when it gets to the escape character, it sends what it has, and starts a new chunk with the translated &-value. So you might need to combine consecutive text nodes in your tree to get the whole value.



回答2:

The file you are trying to read is not valid XML. No self-respecting XML parser will accept it.

I'm retrieving my XML dynamically from the web. What's the best way to replace all my escape characters after fetching the Document object?

You are taking the wrong approach. The correct approach is to inform the people responsible for creating that file that it is invalid, and request that they fix it. Simply writing hacks to (try to) fix broken XML is not in your (or other peoples') long term interest.

If you decide to ignore this advice, then one approach is to read the file into a String, use String.replaceAll(regex, replacement) with a suitable regex to turn these bogus "&" characters into proper character entities ("&amp;"), then give the fixed XML string to the XML parser. You need to carefully design the regex so that it doesn't break valid character entities as an unwanted side-effect. A second approach is to do the parsing and replacement by hand, using appropriate heuristics to distinguish the bogus "&" characters from well-formed character entities.

But this all costs you development and test time, and slows down your software. Worse, there is a significant risk that your code will be fragile as a result of your efforts to compensate for the bad input files. (And guess who will get the blame ...)



回答3:

You need to either encode it properly or wrap it in a CDATA section. I'd recommend the former.



回答4:

The numeric character references "&#60;" and "&#38;" may be used to escape < and & when they occur in character data.
All XML processors MUST recognize these entities whether they are declared or not. For interoperability, valid XML documents SHOULD declare these entities, like any others, before using them. If the entities lt or amp are declared, they MUST be declared as internal entities whose replacement text is a character reference to the respective character (less-than sign or ampersand) being escaped; the double escaping is REQUIRED for these entities so that references to them produce a well-formed result. If the entities gt, apos, or quot are declared, they MUST be declared as internal entities whose replacement text is the single character being escaped (or a character reference to that character; the double escaping here is OPTIONAL but harmless). For example:

<!ENTITY lt     "&#38;#60;">
<!ENTITY gt     "&#62;">
<!ENTITY amp    "&#38;#38;">
<!ENTITY apos   "&#39;">
<!ENTITY quot   "&#34;">