I am looking at being able to extract all plain text and analyse/amend from HTML/XHTML document and then replace if needed. Can I do this using HTML::Parser or should it be XML::Parser?
Are there any good demonstrations that anyone knows of?
I am looking at being able to extract all plain text and analyse/amend from HTML/XHTML document and then replace if needed. Can I do this using HTML::Parser or should it be XML::Parser?
Are there any good demonstrations that anyone knows of?
Say in someone's StackOverflow user page you want to replace all instances of PERL with Perl. You could do so with
The output is what we expect:
The "PERL:" sore thumb is part of an element attribute, not a text section.
You should also look at Web::Scraper.
I find this module easier than the HTML::Parser modules, but it helps if your are familiar with XPath.
Parsing of HTML is very unpredictable depending on the actual pages - it is like pdf-display and not data-oriented.
The approach of HTML::Parser is based on tokens and callbacks. I find it very convenient when you have particularly complex conditions on the context in which the data you whish to extract or to change occurs.
Otherwise I prefer a tree based approach. HTML::TreeBuilder::XPath (based ultimely on HTML::Parser) allows you to find nodes with XPath. It returns HTML::Elements. The documentation is a little scarce (well, spread over a couple of modules). But still the quick way to mine into HTML.
If you deal with pure XML, XML::Twig is an outstanding parser: very good memory management, allows to combine the tree and stream approaches. And the documentation is very good.
Which module you should use depends on what you are trying to do. For starters, HTML::Parser comes with great examples which also include a script that extracts plain text from an HTML document.
Do not try to parse HTML documents using an XML parser: You will find yourself in a world of pain as a lot of valid HTML constructs are not valid XML.
Do not try to parse XML documents using an HTML parser: You will lose all the advantages of the stricter requirement that an XML document be well formed before it can be parsed.