Here's a quick Perl question:
How can I convert HTML special characters like ü
or '
to normal ASCII text?
I started with something like this:
s/\&#(\d+);/chr($1)/eg;
and could write it for all HTML characters, but some function like this probably already exists?
Note that I don't need a full HTML->Text converter. I already parse the HTML with the HTML::Parser
. I just need to convert the text with the special chars I'm getting.
Take a look at HTML::Entities:
use HTML::Entities;
my $html = "Snoopy & Charlie Brown";
print decode_entities($html), "\n";
You can guess the output.
The above answers tell you how to decode the entities into Perl strings, but you also asked how to change those into ASCII.
Assuming that this is really what you want and you don't want all the unicode characters you can look at the Text::Unidecode module from CPAN to Zap all those odd characters back into a roughly similar collection of ASCII characters:
use Text::Unidecode qw(unidecode);
use HTML::Entities qw(decode_entities);
my $source = '北亰';
print unidecode(decode_entities($source));
# That prints: Bei Jing
Note that there are hex-specified characters too. They look like this: é (é).
Use HTML::Entities' decode_entities to translate the entities into actual characters. To convert that to ASCII requires more work. I've used iconv (perl interface: Text::Iconv)
with the transliterate option on with some success in the past. But if you are dealing
with a limited set of entities, or you don't actually need it reduced to ASCII equivalents,
you may be better off limiting what decode_entities produces or providing it with custom
conversion maps. See the HTML::Entities doc.
There are a handful of predefined HTML entities - &
"
>
and so on - that you could hard code.
However, the larger case of numberic entities - {
- is going to be much harder, as those values are Unicode, and conversion to ASCII is going to range from difficult to impossible.