How do I use the java library “HTML Parser” to rem

2019-08-12 07:16发布

I need to perform several action on a html file such as removing a specific tag or delete attributes. I decided to use HTML Parser, a java library: http://htmlparser.sourceforge.net/

First of all, I want to remove all the style tags. I managed to get a NodeList containing all the styles tag by doing this:

Parser parser = new Parser (url);
NodeList list = parser.parse (null);            
NodeList styles = list.extractAllNodesThatMatch (new TagNameFilter ("STYLE"), true);

Now I don't know how to delete this style attributes from the whole list of nodes. Do I have to fetch the whole list?

After that, I want to be able to delete all the attributes inside the tags or delete only the alt attributes for example. Is there a method which does that automatically?

标签： java html parsing tags

1条回答

Fickle 薄情

2楼-- · 2019-08-12 08:12

From the documentation, the Parser returns a list of trees that contains all of your html's nodes (think of the parser as the root node of a big tree of Node and each "level" of that tree is a NodeList).

You can iterate through the tree recursively, test each node's type against StyleTag and delete it from the appropriate NodeList when applicable. Keep descending into the tree recursively until you visit all its nodes.

NodeTreeWalker is your friend and can help you with the recursive tree traversal.

jsoup is another nice alternative that has a simpler interface (see this other question).

0人赞添加讨论(0) 举报

How do I use the java library “HTML Parser” to rem

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间