I'm in the process of converting files generated by the ancient DOS-based library program of our university's Chinese Studies Department into something more useful and accesible.
Among the problems I'm dealing with is that the exported text files (about 80MB in size) are in mixed encoding. I'm on Windows.
German umlauts and other higher-ASCII characters are encoded in cp1252, I think and CJK-characters in GB18030. Due to "overlapping" encodings, I can't just drag the whole file into Word or something and let it do the conversion, because I will get something like this:
orig:
+Autor:
-Yan, Lianke / ÑÖÁ¬¿Æ # encoded Chinese characters
+Co-Autor:
-Min, Jie / (šbers.) # encoded German U-umlaut (Ü)
result:
+Autor:
-Yan, Lianke / 阎连科 # good
+Co-Autor:
-Min, Jie / (歜ers.) # bad... (should be: "Übers.")
So I wrote a script with several subroutines that converts non-ASCII characters in several steps. It does the following things (among others):
replace some higher-order ASCII characters (š, á, etc.) with alphanumeric codes (unlikely to naturally appear anywhere else in the file). Ex.:
-Min, Jie / (šbers.)
->-Min, Jie / (uumlautgrossbers.)
Note: I did the "conversion table" by hand, so I only took the special characters actually appearing in my document into consideration. The conversion is thus not fully complete, but yields adequate results in my case, as our books are mostly in German, English and Chinese, with only very few in languages such as Italian, Spanish, French, etc and almost none in Czech etc.replace
á, £, ¢, ¡, í
with alphanumeric codes only if they are not preceded or followed by another character in the high ASCII-range\x80-\xFF
. (these are the cp1252 encoded versions ofß, ú, ó, í
and "small nordic o with cross-stroke
" and appear both in cp1252- and GB18030-encoded strings.)read the whole file in and convert it from GB18030 to UTF8, thus converting encoded Chinese characters in real Chinese characters.
Convert the alphanumeric codes back to their Unicode equivalents.
Although the script mostly works, the following problem arises:
- After converting the original 80MB file, Notepad++ still thinks it is an ANSI file and displays it as such. I need to press "Encoding->Encode in UTF-8" in order to display it correctly.
What I'd like to know is:
Generally, is there a better approach to convert a mixed-encoding file into UTF-8?
If not, should i use
use utf8
so that I can directly input the characters instead of their hex-representation in thecodes2char
subroutine?Would a BOM at the beginning of the file solve the problem of NP++ displaying it initially as an ANSI file? If so, how should I modify my script so that the output file has a BOM?
After the conversion I may want to call some more subroutines (e.g. to convert the whole file to CSV or ODS format). Do I need to continue using the opening statement from the
codes2char
subroutine?
The code is composed of several subroutines which are called at the end:
!perl -w
use strict;
use warnings;
use Encode qw(decode encode);
use Encode::HanExtra;
our $input = "export.txt";
our $output = "export2.txt";
sub switch_var { # switch Input and Output file between steps
($input, $output) = ($output, $input);
}
sub specialchars2codes {
open our $in, "<$input" or die "$!\n";
open our $out, ">$output" or die "$!\n";
while( <$in> ) {
## replace higher ASCII characters such as a-umlaut etc. with codes.
s#\x94#oumlautklein#g;
s#\x84#aumlautklein#g;
s#\x81#uumlautklein#g;
## ... and some more. (ö, Ö, ä, Ä, Ü, ü, ê, è, é, É, â, á, à, ì, î,
## û, ù, ô, ò, ç, ï, a°, e-umlaut and ñ in total.)
## replace problematic special characters (ß, ú, ó, í, ø, ') with codes.
s#(?<![\x80-\xFF])\xE1(?![\x80-\xFF])#eszett#g;
s#(?<![\x80-\xFF])\xA3(?![\x80-\xFF])#uaccentaiguklein#g;
s#(?<![\x80-\xFF])\xA2(?![\x80-\xFF])#oaccentaiguklein#g;
s#(?<![\x80-\xFF])\xA1(?![\x80-\xFF])#iaccentaiguklein#g;
s#(?<![\x80-\xFF])\xED(?![\x80-\xFF])#nordischesoklein#g;
print $out $_;
}
close $out;
close $in;
}
sub convert2unicode {
open(our $in, "< :encoding(GB18030)", $input) or die "$!\n";
open(our $out, "> :encoding(UTF-8)", $output) or die "$!\n";
print "Convert ASCII to UTF-8\n\n";
while (<$in>) {
print $out $_;
}
close $in;
close $out;
}
sub codes2char {
open(our $in, "< :encoding(UTF-8)", $input) or die "$!\n";
open(our $out, "> :encoding(UTF-8)", $output) or die "$!\n";
print "replace Codes with original characters.\n";
while (<$in>) {
s#lidosoumlautklein#\xF6#g;
s#lidosaumlautklein#\xE4#g;
s#lidosuumlautklein#\xFC#g;
## ... and some more.
s#eszett#\xDF#g;
s#uaccentaiguklein#\xFA#g;
s#oaccentaiguklein#\xF3#g;
s#iaccentaiguklein#\xED#g;
s#nordischesoklein#\xF8#g;
print $out $_;
}
close($in) or die "can't close $input: $!";
close($out) or die "can't close $output: $!";
}
##################
## Main program ##
##################
&specialchars2codes;
&switch_var;
&convert2unicode;
&switch_var;
&codes2char;
wow, this was long. I hope it's not too convoluted
EDIT:
This is a hexdump of the example string above:
01A36596 2B 41 +A
01A365A9 75 74 6F 72 3A 0D 0A 2D 59 61 6E 2C 20 4C 69 61 6E 6B 65 utor: -Yan, Lianke
01A365BC 20 2F 20 D1 D6 C1 AC BF C6 0D 0A 2B 43 6F 2D 41 75 74 6F / ÑÖÁ¬¿Æ +Co-Auto
01A365CF 72 3A 0D 0A 2D 4D 69 6E 2C 20 4A 69 65 20 2F 20 28 9A 62 r: -Min, Jie / (šb
01A365E2 65 72 73 2E 29 0D 0A ers.)
and another two to illustrate:
1.
000036B3 2D 52 75 -Ru
000036C6 E1 6C 61 6E 64 0D 0A áland
2.
015FE030 2B 54 69 74 65 6C 3A 0D 0A 2D 57 65 6E 72 6F 75 +Titel: -Wenrou
015FE043 64 75 6E 68 6F 75 20 20 CE C2 C8 E1 B6 D8 BA F1 20 28 47 dunhou ÎÂÈá¶Øºñ (G
015FE056 65 6E 74 6C 65 6E 65 73 73 20 61 6E 64 20 4B 69 6E 64 6E entleness and Kindn
015FE069 65 73 73 29 2E 0D 0A ess).
In both cases, there is the Hex-value E1. In the first case, it stands in place for a German sharp-s (ß, "Rußland"="Russia") and in the second instance it is part of the multi-byte CJK character 柔 (reading: "rou").
In the library program, the Chinese characters are entered and displayed with an additional program which has to be loaded first and, as far as I can tell, is hooked into the graphics driver at a low-level, catching encoded Chinese characters and displaying them as characters while leaving everything else alone. The German umlauts etc. are handled by the library program itself.
I don't fully understand how this works, i.e. how the programs know whether HexE1 is to be treated as a single character á
and thus converted according to codepage X
and when it is part of a multi-byte character and thus converted according to codepage Y
The closest approximation I have found is that a special characters is likely to be part of a chinese string if there are other special characters before or behind it. (e.g. ÎÂÈá¶Øºñ
)