Flying Saucer not recognizing html entities

2019-08-27 20:17发布

I'm trying to use an html file as a template for a pdf, but Flying Saucer isn't recognizing the HTML5 entities (&trade, &nbsp etc). If I replace them with their hex values, then the program runs fine.

My code is as follows:

  public static InputStream create(String content) throws PDFUtilException {

try (ByteArrayOutputStream baos = new ByteArrayOutputStream();) {
  ITextRenderer iTextRenderer = new ITextRenderer();
  iTextRenderer.getSharedContext()
               .setReplacedElementFactory(new MediaReplacedElementFactory(iTextRenderer.getSharedContext()
                                                                                       .getReplacedElementFactory()));

  iTextRenderer.setDocumentFromString(closeOutTags(content), null);
  iTextRenderer.layout();
  iTextRenderer.createPDF(baos);
  return new ByteArrayInputStream(baos.toByteArray());
} catch (IOException | DocumentException e) {
  throw new PDFUtilException("Unable to create PDF", e);
}

}

Thanks,

Oliver

2条回答
forever°为你锁心
2楼-- · 2019-08-27 20:58

Michael is correct in saying that Flying Saucer needs well-formed XML, but if your only problem are predefined HTML entities (which aren't part of XML), then you can declare them yourself at the begin of your document like so:

<!DOCTYPE html [
  <!ENTITY % htmlentities SYSTEM "https://www.w3.org/2003/entities/2007/htmlmathml-f.ent">
  %htmlentities;
]>
<!-- your XHTML text following here -->

This pulls-in the entity declarations from their official URL into the htmlentities parameter entity, then references (eg. "executes") the pulled-in declarations. If you only need trade and nbsp, or if Flying Saucer won't allow you to access URLs from the net, you can declare them manually as well:

<!DOCTYPE html [
  <!ENTITY trade "&#x02122;">
  <!ENTITY nbsp "&#x000A0;">
]>
<!-- your XHTML text following here -->

Now if you actually have a proper HTML (not XHTML) file, then you won't be able to use an XML processor directly with it, because HTML uses markup features not supported by XML (for example, empty elements such as the img element, omitted tags, and attribute shortforms). But you can use an SGML processor to first convert HTML to XHTML (XML), and then use Flying Saucer on the result XML file (SGML is the superset of both HTML and XML, and the original markup language on which HTML and XML are based). The process involves using an HTML DTD grammar such as the original W3C HTML4 DTD (from 1999) or my HTML5 DTD on sgmljs.net plus an SGML processor. Before going into details, though, first check if merely adding entity declarations as already described solves your problem.

查看更多
做自己的国王
3楼-- · 2019-08-27 21:00

I've never heard of Flying Saucer until today but the first sentence of the documentation says "Flying Saucer is a pure-Java library for rendering arbitrary well-formed XML (or XHTML)" which suggests rather strongly that it expects well-formed XML input, rather than HTML.

查看更多
登录 后发表回答