I need to perform several action on a html file such as removing a specific tag or delete attributes. I decided to use HTML Parser, a java library: http://htmlparser.sourceforge.net/
First of all, I want to remove all the style tags. I managed to get a NodeList containing all the styles tag by doing this:
Parser parser = new Parser (url);
NodeList list = parser.parse (null);
NodeList styles = list.extractAllNodesThatMatch (new TagNameFilter ("STYLE"), true);
Now I don't know how to delete this style attributes from the whole list of nodes. Do I have to fetch the whole list?
After that, I want to be able to delete all the attributes inside the tags or delete only the alt attributes for example. Is there a method which does that automatically?
From the documentation, the
Parser
returns a list of trees that contains all of your html's nodes (think of the parser as the root node of a big tree ofNode
and each "level" of that tree is aNodeList
).You can iterate through the tree recursively, test each node's type against
StyleTag
and delete it from the appropriateNodeList
when applicable. Keep descending into the tree recursively until you visit all its nodes.NodeTreeWalker
is your friend and can help you with the recursive tree traversal.jsoup is another nice alternative that has a simpler interface (see this other question).