Why does Perl's LWP gives me a different encod

2019-05-13 17:43发布

问题:

Lets say i have this code:

use strict;
use LWP qw ( get );

my $content = get ( "http://www.msn.co.il" );

print STDERR $content;

The error log shows something like "\xd7\x9c\xd7\x94\xd7\x93\xd7\xa4\xd7\xa1\xd7\x94" which i'm guessing it's utf-16 ?

The website's encoding is with

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=windows-1255">

so why these characters appear and not the windows-1255 chars ?

And, another weird thing is that i have two servers:

the first server returning CP1255 chars and i can simply convert it to utf8, and the current server gives me these chars and i can't do anything with it ...

is there any configuration file in apache/perl/module that is messing up the encoding ? forcing something ... ?

The result in my website at the second server, is that the perl file and the headers are all utf8, so when i write text that aren't english chars, the content from the example above is showing ok ( even though it's weird utf chars ) but my own static text are look like "×ס'××ר××:"

One more thing that i tested is ...

Through perl:

my $content = `curl "http://www.anglo-saxon.co.il"`;    

I get utf8 encoding.

Through Bash:

curl "http://www.anglo-saxon.co.il"

and here i get CP1255 ( Windows-1255 ) encoding ...

Also, when i run the script in bash - it gives CP1255, and when run it through the web - then it's utf8 again ...

fixed the problem by changin the content from utf8 - to what is supposed to, and then back to utf8:

use Text::Iconv;

my $converter = Text::Iconv->new("utf8", "CP1255");
   $content=$converter->convert($content);

my $converter = Text::Iconv->new("CP1255", "utf8");
   $content=$converter->convert($content);

回答1:

The string with the hex values that you gave appears to be a UTF-8 encoding. You are getting this because Perl ‘likes to’ use UTF-8 when it deals with strings. The LWP::Simple->get() method automatically decodes the content from the server which includes undoing any Content-Encoding as well as converting to UTF-8.

You could dig into the internals and get a version that does change the character encoding (see HTTP::Message's decoded_content, which is used by HTTP::Response's decoded_content, which you can get from LWP::UserAgent's get). But it may be easier to re-encode the the data in your desired encoding with something like

use Encode; 
...; 
$cp1255_bytes = encode('CP1255', decode('UTF_8', $utf8_bytes));

The mixed readable/garbage characters you see are due to mixing multiple, incompatible encodings in the same stream. Probably the stream is labeled as UTF-8 but you are putting CP1255 encoded characters into it. You either need to label the stream as CP1255 and put only CP1255-encoded data into it, or label it as UTF-8 and put only UTF-8-encoded data into it. Remind yourself that bytes are not characters and convert between them appropriately.



回答2:

All of this manual encoding and decoding is unnecessary. The HTML is lying to you when it says that the page is encoded in windows-1255; the server says it's serving UTF-8, and it is. Blame Microsoft HTML-generation tools.

Anyway, since the server does return the correct encoding, this works:

my $response = LWP::UserAgent->new->get("http://www.msn.co.il/");
my $content = $res->decoded_content;

$content is now a perl character string, ready to do whatever you need. If you want to convert it to some other encoding, then calling Encode::encode on it is appropriate; do not use Encode::decode as it's already been decoded once.



回答3:

http://www.msn.co.il is in UTF-8, and indicates that properly. The string "\xd7\x9c\xd7\x94\xd7\x93\xd7\xa4\xd7\xa1\xd7\x94" is also proper UTF-8 (להדפסה). I don't see the problem.

I think your second problem is due to you mixing different encodings (UTF-8 and Windows-1252). You might want to encode/decode your strings properly.



回答4:

First, note that you should import get from LWP::Simple. Second, everything works fine with:

#!/usr/bin/perl
use strict; use warnings;
use LWP::Simple qw ( getstore );
getstore 'http://www.msn.co.il', 'test.html';

which indicates to me that the problem is the encoding of the filehandle to which you are sending the output.