I'm not too familiar with the encoding that Microsoft Word uses. If someone where to save a .doc or .docx file from Word, what is the standard encoding that is used?
I'm guessing it's not UTF-8 as the resulting text (pasted in a UTF-8 encoded text file) does not honour certain punctuation (e.g quotes).
For example, an opening Word 'smart quote' when pasted in a UTF-8 text file, results in an ì
symbol. If Word does indeed encode in UTF-8, then how does Word attempt to render the actual UTF-8 character?
Edit
After doing a little digging, I can see that a Microsoft Word .docx file is actually a compressed format. Unzipping it results in a number of .xml files to be unpacked.
However, the inability for a UTF-8 encoded text file to honour these 'smart' quotes is still perplexing. Any enlightening information would be helpful.
These days a docx
file is really a bunch of compressed xml files. One of these files, is the document.xml
file, which starts with the following line (i.e. an xml prolog):
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
As you can see, it's an UTF-8 encoding.
EDIT
UTF-8 supports the full set of Unicode characters. Just for the sake of completeness, that does not mean that all UTF-8 characters can actually be used in an xml file. Even a CDATA block has its limitations. But having said all that, storing an ` or an ì isn't a problem.
And more importantly, the file format does not really have anything to do with copy-paste behavior of the application itself.
Nevertheless, here's how word would store an ` and ì symbol.
CORRECTION
A bit confusing, but I just realized that by "smart quote" you probably refer to the mechanism that Word has to represent the curly quotes. In my previous answer I thought you meant "backticks", which is a different thing. - Sorry for the confusion.
Well, anyway, here are the unicodes for these smart quotes:
Let's put them in a simple UTF-8 encoded text file.
The result is not that spectacular:
U+2018
is encoded in UTF-8 as E2 80 98
U+2019
is encoded in UTF-8 as E2 80 99
U+201C
is encoded in UTF-8 as E2 80 9C
U+201D
is encoded in UTF-8 as E2 80 9D
So, I went 1 step further and put them in a word file.
I entered a line with regular quotes, and one with smart quotes.
“ this is a test “
“ this is another test ”
And then, I saved the thing and looked how it was stored in Word's xml structure. And actually it is exactly stored as expected.