I'm developing a JAVA program which processes the XML content of docx files and converts it to a specific format. It's working quite well, but I have problems if the Word file contains Symbol characters e.g. greek letters. In this case I see only little squares.
I checked the source and see something like this:
<w:r w:rsidRPr="008E65F6"><w:rPr><w:rFonts w:ascii="Symbol" w:hAnsi="Symbol"/></w:rPr><w:t>ďˇ</w:t></w:r>
Or if I set the encoding to UTF-8:
<w:r w:rsidRPr="008E65F6"><w:rPr><w:rFonts w:ascii="Symbol" w:hAnsi="Symbol"/></w:rPr><w:t></w:t></w:r>
When I view as Hexa, it seems that the greek characters are encoded as EF 81 A1
for alpha, EF 81 A2
for beta and so on.
I also tried val.getBytes(Charset.forName("utf8"))
where val is the value of the <w:t>
tag. The result is e.g. [-17, -127, -95]
. The negative values are quite surprising for me.
So my question is, what is a safe and reliable way to covert these symbols to regular UTF-8 characters?
Meanwhile, I have found the solution, so I add it as answer for future reference.
I checked the Symbol font with a glyph viewer software and I realized that it uses the Private Use Area of Unicode for its characters. Other fonts like Times New Roman store the concerned characters (e.g. greek letters) in normal Unicode range.
So the solution is to map the Symbol glyphs with standard Unicode glyphs. I have created a conversion table by hand for the greek letters (upper/lower case), punctuations, numbers and mathematical symbols available in the Symbol font. Note that even the order of the characters in variuos ranges differ from each other, e.g. the greek alphabet is not in the same order in Symbol and Unicode. So I had to check the character codes one by one.
When I had the conversion table, I stored it in a txt file. When my application finds a segment (run) in the Word file which is formatted with Symbol font (<w:rFonts>
tag in the example), it calls the conversion method. In this method, I parse the txt file to a HashMap
, and change the characters one by one from Symbol code to Unicode:
public String convert(String symbolString) {
StringBuilder sb = new StringBuilder();
for(int k=0; k<symbolString.length(); k++){
int origCode = Character.codePointAt(symbolString, k);
Integer replaceCode = conversionTable.get(origCode);
if(replaceCode != null) {
sb.append(Character.toChars(replaceCode));
} else {
sb.append("?");
}
}
return sb.toString();
}
Where conversionTable
is the HashMap
object containing the replace codes as hex values.