How do I programatically inspect a HTML document

2019-03-20 16:02发布

I have a database full of small HTML documents and I need to programatically insert several into, say, a PDF document with iText or a Word document with Aspose.Words. I need to preserve any formatting within the HTML documents (within reason, honouring <b> tags is a must, CSS like <span style="blah"> is a nice-to-have).

Both iText and Aspose work (roughly) along the lines:

Document document = new Document( Size.A4, Aspect.PORTRAIT );

document.setFont( "Helvetica", 20, Font.BOLD );
document.insert( "some string" )
document.setBold( true );
document.insert( "A bold string" );

Therefore (I think) I need some kind of HTML parser which will I can inspect for strings and styles to insert into my document.

Can anybody suggest a good library or sensible approach to this problem? Platform is Java

5条回答
Anthone
2楼-- · 2019-03-20 16:05

You'd probably be better off getting a component that goes directly from HTML to PDF, or Word, then to try to parse the HTML document and duplicate the formatting yourself based on the HTML. If you want to convert HTML to PDF, and you use .Net, Winnovative provides a good solution.

查看更多
我想做一个坏孩纸
3楼-- · 2019-03-20 16:07

Adobe Acrobat Pro allows you to grab sites via HTTP and does an excellent job of preserving the style and layout. I haven't used it from an API aspect, but it may be worth looking into.

查看更多
甜甜的少女心
4楼-- · 2019-03-20 16:21

If the HTML is "well-formed XML" (XHTML) why not use an XML parser (such as Xerces) and then inspect programatically the DOM tree.

查看更多
Rolldiameter
5楼-- · 2019-03-20 16:26

HTMLparser is a good HTML parser.

I have used this to parse HTML on one of my projects.

You can write your own filters to parse the HTML for what you want, so the <br> tag shouldn't be difficult to parse out

Yo can parse out CSS usin the CssSelectorNodeFilter

查看更多
Bombasti
6楼-- · 2019-03-20 16:29

Check out the flying saucer xhtml renderer- they render well-formed XHTML files to PDF, and let you control the output using CSS.

查看更多
登录 后发表回答