PDF report with embedded HTML

2019-02-12 15:34发布

We have a Java-based system that reads data from a database, merges individual data fields with preset XSL-FO tags and converts the result to PDF with Apache FOP.

In XSL-FO format it looks like this:

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE Html [
<!ENTITY nbsp  "&#160;"> 
    <!-- all other entities -->
]>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:fo="http://www.w3.org/1999/XSL/Format">
    <xsl:output method="xml" indent="yes" />
    <xsl:template match="/">

        <fo:root xmlns:fo="http://www.w3.org/1999/XSL/Format" xmlns:svg="http://www.w3.org/2000/svg" font-family="..." font-size="...">
            <fo:layout-master-set>          
                <fo:simple-page-master master-name="Letter Page" page-width="8.500in" page-height="11.000in">

                    <!-- appropriate settings -->

                </fo:simple-page-master>
            </fo:layout-master-set>
            <fo:page-sequence master-reference="Letter Page">

                <!-- some static content -->

            <fo:flow flow-name="xsl-region-body">
                    <fo:block>
                        <fo:table ...>
                            <fo:table-column ... />
                            <fo:table-body>
                                <fo:table-row>
                                    <fo:table-cell ...>
                                        <fo:block text-align="...">
                                            <fo:inline font-size="..." font-weight="...">
                                                <!-- Header / Title -->
                                            </fo:inline>
                                        </fo:block>
                                    </fo:table-cell>
                                </fo:table-row>
                            </fo:table-body>
                        </fo:table>
                    </fo:block>

                    <fo:block>

                        <fo:table ...>
                            <fo:table-column ... />
                            <fo:table-body> 
                                <fo:table-row>
                                    <fo:table-cell>
                                        <fo:block ...>
                                            <!-- Field A -->                                
                                        </fo:block>
                                    </fo:table-cell>
                                </fo:table-row>
                            </fo:table-body>
                        </fo:table>

                        <!-- Other fields in a very similar fashion as the above "Field A" -->

                    </fo:block>

                </fo:flow>      

            </fo:page-sequence>

        </fo:root>              

    </xsl:template>

</xsl:stylesheet>

Now I am looking for a way to allow some of the fields to contain static HTML-formatted content. This content will be generated by our HTML-enabled editor (something along the lines of CLEditor, CKEditor, etc.) or pasted from outside.

My plan is to follow the recipe from this JavaWorld article:

  • use JTidy to convert HTML-formatted string to proper XHTML
  • further modify xhtml2fo.xsl from Antenna House to remove all document-wide and page-wide transformations
  • apply this modified XSLT to my XHTML string (javax.xml.transform)
  • extract all the nodes under the root with XPath (javax.xml.xpath)
  • feed the result directly into existing XSL-FO document

I have a bare-bone version of such code and got the following error:

(Location of error unknown)org.apache.fop1.fo.ValidationException: "{http://www.w3.org/1999/XSL/Format}table-body" is not a valid child of "fo:block"! (No context info available)

My questions:

  1. What would be the way to troubleshoot this issue?
  2. Can <fo:block> serve as a generic container with other objects (including tables) nested inside?
  3. Is this an overall reasonable approach to solving the task?

If someone already "been there done that", please share your experience.

2条回答
霸刀☆藐视天下
2楼-- · 2019-02-12 15:54

The best way to troubleshoot is to use a validating viewer/editor to examine the XSL FO. Many (such as oXygen) will show you errors in XSL FO structure as you open them and they will describe the issue (just as the error reported).

In your case, you obviously have an fo:table-body as a child of fo:block. It cannot be. An fo:table-body have but one valid parent, fo:table. You are either missing the fo:table tag or you have erroneously inserted an fo:block in this position.

In my opinion, I might do things slightly different. I would put the XHTML content inline into the XSL FO right where you want it. Then I would create an identity transform that copies over all the content that is fo-based, but converts the XHTML parts using XSL. This way, you can actually step that transform in an XSL editor like oXygen and see where errors occur and exactly why. Like any other degugger.

Note: You may wish to look at other XSLs also, especially if your HTML may have any style="" CSS attributes. If this is the case it is not simple HTML, then you will need a better method for processing the HTML with CSS to FO.

http://www.cloudformatter.com/css2pdf is based on this complete transform. That general stylesheet is available here: http://xep.cloudformatter.com/doc/XSL/xeponline-fo-translate-2.xsl

I am the author of that stylesheet. It does much more than you ask, but has a fairly complex parsing recursion for converting CSS styling into XSL FO attributes.

查看更多
虎瘦雄心在
3楼-- · 2019-02-12 16:05
  1. If you use an XSLT debugger such as in oXygen or XML Spy, then you can step through the transformation. With oXygen -- not sure about XML Spy or other editors -- if you click on the markup in the debugger output, oXygen highlights the markup from both the source and the stylesheet that produced that node.

    Once you have the FO, the focheck framework (https://github.com/AntennaHouse/focheck) has the most complete validation of FO currently available.

  2. fo:block can contain tables, etc. In the XSL 1.1 spec, the definition of every FO includes a 'Contents' subsection that lists its allowed content. See, e.g., http://www.w3.org/TR/xsl11/#fo_block. The definitions of the 'parameter entities' in the content models are at http://www.w3.org/TR/xsl11/#d0e6532, but some FOs have additional restrictions in the text of their definitions.
  3. The article that you cite doesn't seem to have the 'extract all the nodes under the root with XPath' step, and I'm not sure why you need it. Other than that, it looks like a reasonable approach for doing the job using Java.

Instead of inserting the FO transformed from your JTidy-ed HTML into the static FO, you could replace your <!-- Field A --> with non-FO markup that provides enough information to make a reference to the field to insert. You can then make an XSLT stylesheet that transforms the template+references document into straight FO by doing an identity transform on the FO parts -- as in the answer from @kevin-brown -- and using the information in the reference markup to construct the URI to use with the document() function (http://www.w3.org/TR/xslt#document) to find the markup to insert.

If the FO for the field content is sitting on the disk, then using document() is straightforward. If it's not, then you'd have to do something like overriding the URIResolver used by the XSLT processor so that, rather than looking on the disk, it does the right thing to retrieve the content. You may even be able to have the JTidying happen as part of the URIResolver retrieving the HTML. You could also do the transformation to FO 'inside' the URIResolver or, also as @kevin-brown suggested, do it as a separate mode. If the transformation is done before or during the URIResolver retrieving the FO, then the 'main' transformation of template+references to FO just needs to extract the right part of the FO sub-document, e.g. document('constructed-URI')/fo:root/fo:page-sequence/*. However, if you're modifying the stylesheet from Antenna House, then you should be able to modify it to not produce an outer fo:root, etc., anyway.

I did something similar years ago with overriding the URI resolver for the libxslt XSLT processor for an XSLT-based server: the context for successive runs of the inner XSLT processor was saved as documents at special URIs and weren't necessarily written to the file system at all.

You could, instead, possibly write an extension function that does the lookup of the references to the fields. The Print and Page Layout Community Group @ W3C, for example, has produced extension functions for multiple XSLT processors that runs an FO processor in the middle of the XSLT transformation to get back the XML for an area tree for the formatted result. See http://www.w3.org/community/ppl/wiki/XSLTExtensions

查看更多
登录 后发表回答