I have an xml-file. When I open it with Emacs it displays chinese characters (see attachment). This happens on my Windows 7 PC with Emacs and Notepad and also on my Windows XP (see figure A). Figure B is the hexl-mode of A.
If I use the Windows XP PC of a collegue and open the file with Notepad there are no chinese characters but a strange character character. I saved it as txt-file and sent it by email to my Windows7-PC (see figure C). The strange character was replaced with "?". (Due to restriction I could not use the PC of my collegue and reproduce the notepad file with the strange character).
My questions: it seems that there are characters in the XML-file which creates problems. I don't know how to cope with that. Does anybody has an idea how I can manage this problem? Does it have something to do with encoding? Thanks for hints.
By figure B, it looks like this file is encoded with a mixture of big-endian and little-endian UTF-16. It starts with
fe ff
, which is the byte order mark for big-endian UTF-16, and the XML declaration (<?xml version=...
) is also big-endian, but the part starting with<report
is little-endian. You can tell because the letters appear on even positions in the first part of the hexl display, but on odd positions further down.Also, there is a null character (encoded as two bytes,
00 00
) right before<report
. Null characters are not allowed in XML documents.However, since some of the XML elements appear correctly in figure A, it seems that the confusion goes on through the file. The file is corrupt, and this probably needs to be resolved manually.
If there are no non-ASCII characters in the file, I would try to open the file in Emacs as binary (
M-x revert-buffer-with-coding-system
and specifybinary
), remove all null bytes (M-% C-q C-@ RET RET
), save the file and hope for the best.Another possible solution is to mark each region appearing with Chinese characters and recode it with
M-x recode-region
, giving "Text was really in" asutf-16-le
and "But was interpreted as" asutf-16-be
.The solution of legoscia using the possibility of Emacs to change encoding within a file solved my problem. An other possibility is:
In my case it worked with Atom, but not with Notepad++.
PS: The reason why I used this way is that Emacs could not open anymore this kind of corrupted files. I don't know why but this is another issue.
Edit 1: Since copy, paste and merge is cumbersome I found the solution how to open currupted files with emacs:
emacs -q xmlfile.xml
. Using emacs like legoscia suggested is the best way to repair such files.