Perl print matched content only

2019-09-14 05:50发布

问题:

I am developing a web crawler in Perl. It extracts contents from the page and then a pattern match is done to check the language of the content. Unicode values are used to match the content.

Sometimes the extracted content contains text in multiple languages. The pattern match I used here prints all the text, but I want to print only the text that matches the Unicode values specified in the pattern.

my $uu         = LWP::UserAgent->new('Mozilla 1.3');
my $extractorr = HTML::ContentExtractor->new();

# create response object to get the url
my $responsee = $uu->get($url);
my $contentss = $responsee->decoded_content();

$range = "([\x{0C00}-\x{0C7F}]+)";    # match particular language

if ($contentss =~ m/$range/) {
  $extractorr->extract($url, $contentss);
  print "$url\n";
  binmode(STDOUT, ":utf8");
  print $extractorr->as_text;
}

回答1:

It would be better to match characters with a particular Unicode property, rather than trying to formulate an appropriate character class.

The code points in the range 0x0C00...0x0C7F correspond to characters in Telugu (one of the Indian languages) which you can match using the regex /\p{Telugu}/.

The other properties you will probably need are /\p{Kannada}/, /\p{Malayalam}/, /\p{Devanagari}/, and /\p{Tamil}/