Should I use HTML::Parser or XML::Parser to extrac

2019-02-14 18:34发布

I am looking at being able to extract all plain text and analyse/amend from HTML/XHTML document and then replace if needed. Can I do this using HTML::Parser or should it be XML::Parser?

Are there any good demonstrations that anyone knows of?

4条回答
戒情不戒烟
2楼-- · 2019-02-14 18:54

Say in someone's StackOverflow user page you want to replace all instances of PERL with Perl. You could do so with

#! /usr/bin/perl

use warnings;
use strict;

use HTML::Parser;
use LWP::Simple;

my $html = get "http://stackoverflow.com/users/201469/phil-jackson";
die "$0: get failed" unless defined $html;

sub replace_text {
  my($skipped,$markup) = @_;
  $skipped =~ s/\bPERL\b/Perl/g;
  print $skipped, $markup;
}

my $p = HTML::Parser->new(
  api_version => 3,
  marked_sections => 1,
  case_sensitive => 1,
  unbroken_text => 1,
  xml_mode => 1,
  start_h => [ \&replace_text => "skipped_text, text" ],
  end_h => [ \&replace_text => "skipped_text, text" ],
);

# your page may use a different encoding
binmode STDOUT, ":utf8" or die "$0: binmode: $!";
$p->parse($html);

The output is what we expect:

$ wget -O phil-jackson.html http://stackoverflow.com/users/201469
$ ./replace-text >out.html
$ diff -ub phil-jackson.html out.html
--- phil-jackson.html
+++ out.html
@@ -327,7 +327,7 @@

 PERL:  

-#$linkTrue =  &hellip; ">comparing PERL md5() and PHP md5()</a></h3>
+#$linkTrue =  &hellip; ">comparing Perl md5() and PHP md5()</a></h3>

         <div class="tags t-php t-perl t-md5">
             <a href="/questions/tagged/php" class="post-tag" title="show questions tagged 'php'" rel="tag">php</a> <a href="/questions/tagged/perl" class="post-tag" title="show questions tagged 'perl'" rel="tag">perl</a> <a href="/questions/tagged/md5" class="post-tag" title="show questions tagged 'md5'" rel="tag">md5</a> 

The "PERL:" sore thumb is part of an element attribute, not a text section.

查看更多
看我几分像从前
3楼-- · 2019-02-14 18:56

You should also look at Web::Scraper.
I find this module easier than the HTML::Parser modules, but it helps if your are familiar with XPath.
Parsing of HTML is very unpredictable depending on the actual pages - it is like pdf-display and not data-oriented.

查看更多
爱情/是我丢掉的垃圾
4楼-- · 2019-02-14 19:01

The approach of HTML::Parser is based on tokens and callbacks. I find it very convenient when you have particularly complex conditions on the context in which the data you whish to extract or to change occurs.

Otherwise I prefer a tree based approach. HTML::TreeBuilder::XPath (based ultimely on HTML::Parser) allows you to find nodes with XPath. It returns HTML::Elements. The documentation is a little scarce (well, spread over a couple of modules). But still the quick way to mine into HTML.

If you deal with pure XML, XML::Twig is an outstanding parser: very good memory management, allows to combine the tree and stream approaches. And the documentation is very good.

查看更多
走好不送
5楼-- · 2019-02-14 19:08

Which module you should use depends on what you are trying to do. For starters, HTML::Parser comes with great examples which also include a script that extracts plain text from an HTML document.

Do not try to parse HTML documents using an XML parser: You will find yourself in a world of pain as a lot of valid HTML constructs are not valid XML.

Do not try to parse XML documents using an HTML parser: You will lose all the advantages of the stricter requirement that an XML document be well formed before it can be parsed.

查看更多
登录 后发表回答