Keep XML entities in output (jSoup)

2019-07-11 03:31发布

问题:

I'm using jsoup to do some xml processing. Problem is, it is replacing xml entities, ie.: » with html entities: »

How could I keep original (xml) entities?

Groovy script:

import org.jsoup.Jsoup
import org.jsoup.nodes.Document
import org.jsoup.nodes.Entities
import org.jsoup.parser.Parser

String HTML_STRING = '''
    <html>
    <div></div>
    <div>Some text &#187;</div>
    </html>
  '''

Document doc = Jsoup.parse(new ByteArrayInputStream(HTML_STRING.getBytes("UTF-8")), "UTF-8", "", Parser.xmlParser())
doc.outputSettings().charset("UTF-8")
doc.outputSettings().escapeMode(Entities.EscapeMode.base)


println doc.toString()

Result:

<html> 
 <div></div> 
 <div>
  Some text &raquo;
 </div> 
</html>

If I use Entities.EscapeMode.xhtml the result is:

<html> 
 <div></div> 
 <div>
  Some text »
 </div> 
</html>

Thanks.

回答1:

You want to use a combination of EscapeMode.xhtml (which is the default if you use the XML parser, not the HTML parser), and ascii as the output character set.

The default output charset is UTF-8, and jsoup will prefer to not use entity escapes if the output charset supports the character directly (because why waste CPU and bandwidth with unnecessary escapes).

If you change the output charset to ascii using Document.OutputSettings.charset("ascii") you'll get the output you want.

You also probably want to set the output syntax to XML if you are working with HTML, as otherwise the HTML parser will try to make the output confirm to HTML and can munge your XML DOM tree.

(Source: author of jsoup)