Character encoding of Microsoft Word DOC and DOCX

2019-04-22 17:36发布

问题:

I'm not too familiar with the encoding that Microsoft Word uses. If someone where to save a .doc or .docx file from Word, what is the standard encoding that is used?

I'm guessing it's not UTF-8 as the resulting text (pasted in a UTF-8 encoded text file) does not honour certain punctuation (e.g quotes).

For example, an opening Word 'smart quote' when pasted in a UTF-8 text file, results in an ì symbol. If Word does indeed encode in UTF-8, then how does Word attempt to render the actual UTF-8 character?

Edit

After doing a little digging, I can see that a Microsoft Word .docx file is actually a compressed format. Unzipping it results in a number of .xml files to be unpacked.

However, the inability for a UTF-8 encoded text file to honour these 'smart' quotes is still perplexing. Any enlightening information would be helpful.

回答1:

These days a docx file is really a bunch of compressed xml files. One of these files, is the document.xml file, which starts with the following line (i.e. an xml prolog):

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

As you can see, it's an UTF-8 encoding.

EDIT

UTF-8 supports the full set of Unicode characters. Just for the sake of completeness, that does not mean that all UTF-8 characters can actually be used in an xml file. Even a CDATA block has its limitations. But having said all that, storing an ` or an ì isn't a problem.

And more importantly, the file format does not really have anything to do with copy-paste behavior of the application itself.

Nevertheless, here's how word would store an ` and ì symbol.

CORRECTION

A bit confusing, but I just realized that by "smart quote" you probably refer to the mechanism that Word has to represent the curly quotes. In my previous answer I thought you meant "backticks", which is a different thing. - Sorry for the confusion.

Well, anyway, here are the unicodes for these smart quotes:

Let's put them in a simple UTF-8 encoded text file. The result is not that spectacular:

  • U+2018 is encoded in UTF-8 as E2 80 98
  • U+2019 is encoded in UTF-8 as E2 80 99
  • U+201C is encoded in UTF-8 as E2 80 9C
  • U+201D is encoded in UTF-8 as E2 80 9D

So, I went 1 step further and put them in a word file. I entered a line with regular quotes, and one with smart quotes.

“ this is a test “ 
“ this is another test ”

And then, I saved the thing and looked how it was stored in Word's xml structure. And actually it is exactly stored as expected.