How to detect Arabic chars using perl regex?

2019-05-19 02:51发布

问题:

I'm parsing some html pages, and need to detect any Arabic char inside.. Tried various regexs, but no luck..

Does anyone know working way to do that?

Thanks


Here is the page I'm processing: http://pastie.org/2509936

And my code is:

#!/usr/bin/perl 
use LWP::UserAgent; 
@MyAgent::ISA = qw(LWP::UserAgent); 

# set inheritance 
$ua = LWP::UserAgent->new; 
$q = 'pastie.org/2509936';; 
$request = HTTP::Request->new('GET', $q); 
$response = $ua->request($request); 
if ($response->is_success) { 
    if ($response->content=~/[\p{Script=Arabic}]/g) { 
        print "found arabic"; 
    } else { 
        print "not found"; 
    } 
}

回答1:

EDIT (as I have obviously wandered into tchrist's area of expertise). Skip using $response->content, which always returns a raw byte string, and use $response->decoded_content, which applies any decoding hints it gets from the response headers.


The page you are downloading is UTF-8 encoded, but you are not reading it as UTF-8 (in fairness, there are no hints on the page about what the encoding is [update: the server does return the header Content-Type: text/html; charset=utf-8, though]).

You can see if this if you examine $response->content:

use List::Util 'max';
my $max_ord = max map{ord}split //, $response->content;
print "max ord of response content is $max_ord\n";

If you get a value less than 256, then you are reading this content in as raw bytes, and your strings will never match /\p{Arabic}/. You must decode the input as UTF-8 before you apply the regex:

use Encode;
my $content = decode('utf-8', $response->content);
# now check  $content =~ /\p{Arabic}/

Sometimes (and now I am wading well outside my area of expertise) the page you are loading contains hints about how it is decoded, and $response->content may already be decoded correctly. In that case, the decode call above is unnecessary and may be harmful. See other SO posts on detecting the encoding of an arbitrary string.



回答2:

If you're using Perl, you should be able to use the Unicode script matching operator. /\p{Arabic}/

If that doesn't work, you'll have to look up the range of Unicode characters for Arabic, and test them something like this /[\x{0600}\x{0601}...\x{06FF}]/.



回答3:

Just for the record, at least in .NET regexps, you need to use \p{IsArabic}.