Parse XML multi line string in Java

2019-07-18 17:18发布

问题:

I'm trying to parse a multi line XML attribute in Java using the classic DOM. The parsing is working just fine. However, it's destroying the line breaks so, when I render my parsed string, line breaks get replaced by simple spaces.

<string key="help_text" value="This is a multi line long
                               text. This should be parsed
                               and rendered in multiple lines" />

To get the attribute I'm using:

attributes.getNamedItem("value").getTextContent()

If I just pass a manually typed string to the render method using "\n", the text gets drawn as intended.

Any ideas?

回答1:

I've used JDom for this on the past. It saves you a lot of trouble when decoding multilined attributes and really enhances XML parsing/writing on Java. JDom is also compatible with Android development and it's really tiny (only one jar file).

https://github.com/hunterhacker/jdom



回答2:

According to the XML specification the XML parser MUST normalize attribute whitespace, such as replacing a line break character with a space. I.e. if you require line breaks to be preserved you cannot use an attribute value.

In general, whitespace handling in XML is a lot of trouble. In particular, the difference between CR, LF, and CRLF isn't preserved anywhere.

You might find it better to encode newlines in attributes as &lt;br /&gt; (that is, the encoded version of <br />) and then decode them later.



回答3:

From the XML specifcation: 3.3.3 Attribute-Value Normalization. You will see that all white spaces are normallised to single spaces:

Before the value of an attribute is passed to the application or checked for validity, the XML processor MUST normalize the attribute value by applying the algorithm below, or by using some other method such that the value passed to the application is the same as that produced by the algorithm. All line breaks MUST have been normalized on input to #xA as described in 2.11 End-of-Line Handling, so the rest of this algorithm operates on text normalized in this way.

Begin with a normalized value consisting of the empty string.

For each character, entity reference, or character reference in the unnormalized attribute value, beginning with the first and continuing to the last, do the following:

For a character reference, append the referenced character to the normalized value.

For an entity reference, recursively apply step 3 of this algorithm to the replacement text of the entity.

For a white space character (#x20, #xD, #xA, #x9), append a space character (#x20) to the normalized value.

For another character, append the character to the normalized value.