I have a database full of small HTML documents and I need to programatically insert several into, say, a PDF document with iText or a Word document with Aspose.Words. I need to preserve any formatting within the HTML documents (within reason, honouring <b> tags is a must, CSS like <span style="blah"> is a nice-to-have).
Both iText and Aspose work (roughly) along the lines:
Document document = new Document( Size.A4, Aspect.PORTRAIT );
document.setFont( "Helvetica", 20, Font.BOLD );
document.insert( "some string" )
document.setBold( true );
document.insert( "A bold string" );
Therefore (I think) I need some kind of HTML parser which will I can inspect for strings and styles to insert into my document.
Can anybody suggest a good library or sensible approach to this problem? Platform is Java
You'd probably be better off getting a component that goes directly from HTML to PDF, or Word, then to try to parse the HTML document and duplicate the formatting yourself based on the HTML. If you want to convert HTML to PDF, and you use .Net, Winnovative provides a good solution.
Adobe Acrobat Pro allows you to grab sites via HTTP and does an excellent job of preserving the style and layout. I haven't used it from an API aspect, but it may be worth looking into.
If the HTML is "well-formed XML" (XHTML) why not use an XML parser (such as Xerces) and then inspect programatically the DOM tree.
HTMLparser is a good HTML parser.
I have used this to parse HTML on one of my projects.
You can write your own filters to parse the HTML for what you want, so the
<br>
tag shouldn't be difficult to parse outYo can parse out CSS usin the CssSelectorNodeFilter
Check out the flying saucer xhtml renderer- they render well-formed XHTML files to PDF, and let you control the output using CSS.