I'm an old-newbie in Perl, and Im trying to create a subroutine in perl using HTML::TokeParser and URI.
I need to extract ALL valid links enclosed within on div called "zone-extract"
This is my code:
#More perl above here... use strict and other subs
use HTML::TokeParser;
use URI;
sub extract_links_from_response {
my $response = $_[0];
my $base = URI->new( $response->base )->canonical;
# "canonical" returns it in the one "official" tidy form
my $stream = HTML::TokeParser->new( $response->content_ref );
my $page_url = URI->new( $response->request->uri );
print "Extracting links from: $page_url\n";
my($tag, $link_url);
while ( my $div = $stream->get_tag('div') ) {
my $id = $div->get_attr('id');
next unless defined($id) and $id eq 'zone-extract';
while( $tag = $stream->get_tag('a') ) {
next unless defined($link_url = $tag->[1]{'href'});
next if $link_url =~ m/\s/; # If it's got whitespace, it's a bad URL.
next unless length $link_url; # sanity check!
$link_url = URI->new_abs($link_url, $base)->canonical;
next unless $link_url->scheme eq 'http'; # sanity
$link_url->fragment(undef); # chop off any "#foo" part
print $link_url unless $link_url->eq($page_url); # Don't note links to itself!
}
}
return;
}
As you can see, I have 2 loops, first using get_tag 'div' and then look for id = 'zone-extract'. The second loop looks inside this div and retrieve all links (or that was my intention)...
The inner loop works, it extracts all links correctly working standalone, but I think there is some issues inside the first loop, looking for my desired div 'zone-extract'... Im using this post as a reference: How can I find the contents of a div using Perl's HTML modules, if I know a tag inside of it?
But all I have by the moment is this error:
Can't call method "get_attr" on unblessed reference
Some ideas? Help! My HTML (Note URL_TO_EXTRACT_1 & 2):
<more html above here>
<div class="span-48 last">
<div class="span-37">
<div id="zone-extract" class="...">
<h2 class="genres"><img alt="extracting" class="png"></h2>
<li><a title="Extr 2" href="**URL_TO_EXTRACT_1**">2</a></li>
<li><a title="Con 1" class="sel" href="**URL_TO_EXTRACT_2**">1</a></li>
<li class="first">Pàg</li>
</div>
</div>
</div>
<more stuff from here>
I find that TokeParser is a very crude tool requiring too much code, its fault is that only supports the procedural style of programming.
A better alternatives which require less code due to declarative programming is Web::Query:
Code is untested because there is no example HTML in the question.