How do I use the java library “HTML Parser” to rem

2019-08-12 07:16发布

I need to perform several action on a html file such as removing a specific tag or delete attributes. I decided to use HTML Parser, a java library: http://htmlparser.sourceforge.net/

First of all, I want to remove all the style tags. I managed to get a NodeList containing all the styles tag by doing this:

Parser parser = new Parser (url);
NodeList list = parser.parse (null);            
NodeList styles = list.extractAllNodesThatMatch (new TagNameFilter ("STYLE"), true);

Now I don't know how to delete this style attributes from the whole list of nodes. Do I have to fetch the whole list?

After that, I want to be able to delete all the attributes inside the tags or delete only the alt attributes for example. Is there a method which does that automatically?

1条回答
Fickle 薄情
2楼-- · 2019-08-12 08:12

From the documentation, the Parser returns a list of trees that contains all of your html's nodes (think of the parser as the root node of a big tree of Node and each "level" of that tree is a NodeList).

You can iterate through the tree recursively, test each node's type against StyleTag and delete it from the appropriate NodeList when applicable. Keep descending into the tree recursively until you visit all its nodes.

NodeTreeWalker is your friend and can help you with the recursive tree traversal.

jsoup is another nice alternative that has a simpler interface (see this other question).

查看更多
登录 后发表回答