I have the following situation:
There is a tool that gets an XSLT from a web interface and embeds the XSLT in an XML file (Someone should have been fired). "Unfortunately" I work in a French speaking country and therefore the XSLT has a number of words with accents. When the XSLT is embedded in the XML, the tool converts all the accents to their HTML codes (Iacute, igrave, etc...) .
My Perl code is retrieving the XSLT from the XML and is executing it against an other XML using Xalan command line tool. Every time there is some accent in the XSLT the Xalan tool throws an exception.
I initially though to do a regexp to change all the accents in the XSLT usch as:
# the & is omitted in the codes becuase it will be rendered in the page
$xslt =~s/Aacute;/Á/gso;
$xslt =~s/aacute;/á/gso;
$xslt =~s/Agrave;/À/gso;
$xslt =~s/Acirc;/Â/gso;
$xslt =~s/agrave;/à/gso;
but doing so means that I have to write a regexp for each of the accent codes....
My question is, is there anyway to do this without writing a regexp per code? (thinking that is the only solution makes be want to vomit.)
By the way the tool is TeamSite, and it sucks.....
Edited: I forgot to mention that I need to have a Perl only solution, security does not let me install any type of libs they have not checked for a week or so :(
You can try something like HTML::Entities. From the POD:
use HTML::Entities;
$a = "Våre norske tegn bør æres";
decode_entities($a);
#encode_entities($a, "\200-\377"); ## not needed for what you are doing
In response to your edit, HTML::Entities is not in the perl core. It might still be installed on your system because it is used by a lot of other libraries. You can check by running this command:
perl -MHTML::Entities -le 'print "If this prints, the it is installed"'
For your purpose is HTML::Entities far best solution but if you will not found some existing package fits your needs following approach is more effective than multiple s///
statements
# this part do in inter function module code which is executed in compile time
# or place in BEGIN or do once before first s/// statement using it
my %trans = (
'Aacute;' => 'Á',
'aacute;' => 'á',
'Agrave;' => 'À',
'Acirc;' => 'Â',
'agrave;' => 'à',
); # remember you can generate parts of this hash for example by map
my $re = qr/${ \(join'|', map quotemeta, keys %trans)}/;
# this code place in your functions or methods
s/($re)/$trans{$1}/g; # 'o' is almost useless here because $re has been compiled yet
Edit: There is no need of e
regexp modifier as mentioned by Chas. Owens.
I don't suppose it's possible to make TeamSite leave it as utf-8/convert it to utf-8?
CGI.pm has an (undocumented) unescapeHTML function. However, since it IS undocumented (and I haven't looked through the source), I don't know if it just handles basic HTML entities (<, >, &) or more. However, I'd GUESS that it only does the basic entities.
Why should someone be fired for putting XSL, which is XML, into an XML file?