How do I programatically inspect a HTML document

I have a database full of small HTML documents and I need to programatically insert several into, say, a PDF document with iText or a Word document with Aspose.Words. I need to preserve any formatting within the HTML documents (within reason, honouring <b> tags is a must, CSS like <span style="blah"> is a nice-to-have).

Both iText and Aspose work (roughly) along the lines:

Document document = new Document( Size.A4, Aspect.PORTRAIT );

document.setFont( "Helvetica", 20, Font.BOLD );
document.insert( "some string" )
document.setBold( true );
document.insert( "A bold string" );

Therefore (I think) I need some kind of HTML parser which will I can inspect for strings and styles to insert into my document.

Can anybody suggest a good library or sensible approach to this problem? Platform is Java

标签： java html parsing

5条回答

Anthone

2楼-- · 2019-03-20 16:05

You'd probably be better off getting a component that goes directly from HTML to PDF, or Word, then to try to parse the HTML document and duplicate the formatting yourself based on the HTML. If you want to convert HTML to PDF, and you use .Net, Winnovative provides a good solution.

0人赞添加讨论(0) 举报

我想做一个坏孩纸

3楼-- · 2019-03-20 16:07

Adobe Acrobat Pro allows you to grab sites via HTTP and does an excellent job of preserving the style and layout. I haven't used it from an API aspect, but it may be worth looking into.

0人赞添加讨论(0) 举报

甜甜的少女心

4楼-- · 2019-03-20 16:21

If the HTML is "well-formed XML" (XHTML) why not use an XML parser (such as Xerces) and then inspect programatically the DOM tree.

0人赞添加讨论(0) 举报

Rolldiameter

5楼-- · 2019-03-20 16:26

HTMLparser is a good HTML parser.

I have used this to parse HTML on one of my projects.

You can write your own filters to parse the HTML for what you want, so the <br> tag shouldn't be difficult to parse out

Yo can parse out CSS usin the CssSelectorNodeFilter

0人赞添加讨论(0) 举报

Bombasti

6楼-- · 2019-03-20 16:29

Check out the flying saucer xhtml renderer- they render well-formed XHTML files to PDF, and let you control the output using CSS.

0人赞添加讨论(0) 举报

How do I programatically inspect a HTML document

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间