I am developing a web crawler in Perl. It extracts contents from the page and then a pattern match is done to check the language of the content. Unicode values are used to match the content.
Sometimes the extracted content contains text in multiple languages. The pattern match I used here prints all the text, but I want to print only the text that matches the Unicode values specified in the pattern.
my $uu = LWP::UserAgent->new('Mozilla 1.3');
my $extractorr = HTML::ContentExtractor->new();
# create response object to get the url
my $responsee = $uu->get($url);
my $contentss = $responsee->decoded_content();
$range = "([\x{0C00}-\x{0C7F}]+)"; # match particular language
if ($contentss =~ m/$range/) {
$extractorr->extract($url, $contentss);
print "$url\n";
binmode(STDOUT, ":utf8");
print $extractorr->as_text;
}