How can I decode HTML entities?

2019-03-09 14:16发布

问题:

Here's a quick Perl question:

How can I convert HTML special characters like ü or ' to normal ASCII text?

I started with something like this:

s/\&#(\d+);/chr($1)/eg;

and could write it for all HTML characters, but some function like this probably already exists?

Note that I don't need a full HTML->Text converter. I already parse the HTML with the HTML::Parser. I just need to convert the text with the special chars I'm getting.

回答1:

Take a look at HTML::Entities:

use HTML::Entities;

my $html = "Snoopy & Charlie Brown";

print decode_entities($html), "\n";

You can guess the output.



回答2:

The above answers tell you how to decode the entities into Perl strings, but you also asked how to change those into ASCII.

Assuming that this is really what you want and you don't want all the unicode characters you can look at the Text::Unidecode module from CPAN to Zap all those odd characters back into a roughly similar collection of ASCII characters:

use Text::Unidecode qw(unidecode);
use HTML::Entities qw(decode_entities);

my $source = '北亰';  
print unidecode(decode_entities($source));

# That prints: Bei Jing 


回答3:

Note that there are hex-specified characters too. They look like this: é (é).

Use HTML::Entities' decode_entities to translate the entities into actual characters. To convert that to ASCII requires more work. I've used iconv (perl interface: Text::Iconv) with the transliterate option on with some success in the past. But if you are dealing with a limited set of entities, or you don't actually need it reduced to ASCII equivalents, you may be better off limiting what decode_entities produces or providing it with custom conversion maps. See the HTML::Entities doc.



回答4:

There are a handful of predefined HTML entities - & " > and so on - that you could hard code.

However, the larger case of numberic entities - { - is going to be much harder, as those values are Unicode, and conversion to ASCII is going to range from difficult to impossible.