I want to parse a Website into a Perl data structure. First I load the page with
use LWP::Simple;
my $html = get("http://f.oo");
Now I know two ways to deal with it. First are the regular expressions and secound the modules.
I started with reading about HTML::Parser and found some examples. But I'm not that sure about by Perl knowledge.
My code example goes on
my @links;
my $p = HTML::Parser->new();
$p->handler(start => \&start_handler,"tagname,attr,self");
$p->parse($html);
foreach my $link(@links){
print "Linktext: ",$link->[1],"\tURL: ",$link->[0],"\n";
}
sub start_handler{
return if(shift ne 'a');
my ($class) = shift->{href};
my $self = shift;
my $text;
$self->handler(text => sub{$text = shift;},"dtext");
$self->handler(end => sub{push(@links,[$class,$text]) if(shift eq 'a')},"tagname");
}
I don't understand why there is two times a shift. The secound should be the self pointer. But the first makes me think that the self reference is allready shiftet, used as a Hash and the Value for href is stored in $class
. Could someone Explain this line (my ($class) = shift->{href};
)?
Beside this lack, I do not want to parse all the URLs, I want to put all the code between <div class ="foo">
and </div>
into a string, where lots of code is between, specially other <div></div>
tags. So I or a module has to find the right end.
After that I planed to scan the string again, to find special classes, like <h1>,<h2>, <p class ="foo2"></p>
, etc.
I hope this informations helps you to give me some usefull advices, and please have in mind that first of all I want an easy understanding way, which has not to be a great performance in the first level!
No need to get so complicated. You can retrieve and find elements in the DOM using CSS selectors with Mojo::UserAgent:
or, loop through the elements found:
or, loop using a callback:
HTML::Parser is more of a tokenizer than a parser. It leaves a lot of hard work up to you. Have you considered using HTML::TreeBuilder (which uses HTML::Parser) or XML::LibXML (a great library which has support for HTML)?
Use HTML::TokeParser::Simple.
Untested code based on your description:
According to the docs, the handler's signature is
(\%attr, \@attr_seq, $text)
. There are three shifts, one for each argument.is equivalent to: