I have a script, VBS or Ruby, that saves a Word document as 'Filtered HTML', but the encoding parameter is ignored. The HTML file is always encoded in Windows-1252. I'm using Word 2007 SP3 on Windows 7 SP1.
Ruby Example:
require 'win32ole'
word = WIN32OLE.new('Word.Application')
word.visible = false
word_document = word.documents.open('C:\whatever.doc')
word_document.saveas({'FileName' => 'C:\whatever.html', 'FileFormat' => 10, 'Encoding' => 65001})
word_document.close()
word.quit
VBS Example:
Option Explicit
Dim MyWord
Dim MyDoc
Set MyWord = CreateObject("Word.Application")
MyWord.Visible = False
Set MyDoc = MyWord.Documents.Open("C:\whatever.doc")
MyDoc.SaveAs "C:\whatever2.html", 10, , , , , , , , , , 65001
MyDoc.Close
MyWord.Quit
Set MyDoc = Nothing
Set MyWord = Nothing
Documentation:
Document.SaveAs: http://msdn.microsoft.com/en-us/library/bb221597.aspx
msoEncoding values: http://msdn.microsoft.com/en-us/library/office/aa432511(v=office.12).aspx
Any suggestions, how to make Word save the HTML file in UTF-8?
Word can't do this as far as I know.
However, you could add the following lines to the end of your Ruby script
If you have an older version of Ruby, you may need to use
Iconv
. If you have special characters in'C:\whatever.html'
, you'll want to look into your invalid/undefined replacement options.You'll also probably want to update the charset in the HTML
meta
tag:before you write to the file.
My solution was to open the HTML file using the same character set, as Word used to save it. I also added a whitelist filter (Sanitize), to clean up the HTML. Further cleaning is done using Nokogiri, which Sanitize also rely on.
HTML Sanitizer: https://github.com/rgrove/sanitize/
HTML parser and modifier: http://nokogiri.org/
In Word 2010 there is a new method, SaveAs2: http://msdn.microsoft.com/en-us/library/ff836084(v=office.14).aspx
I haven't tested SaveAs2, since I don't have Word 2010.
Hi Bo Frederiksen and kardeiz,
I also encountered the problem of "Word Document.SaveAs ignores encoding" today in my "Word 2003 (11.8411.8202) SP3" version.
Luckily I managed to make msoEncodingUTF8(namely, 65001) work in VBA code. However, I have to change the Word document's settings first. Steps are:
1) From Word's 'Tools' menu, choose 'Options'.
2) Then click 'General'.
3) Press the 'Web Options' button.
4) In the popping-up 'Web Options' dialogue, click 'Encoding'.
5) You can find a combobox, now you can change the encoding, for example, from 'GB2312' to 'Unicode (UTF-8)'.
6) Save the changes and try to rerun the VBA code.
I hope my answer can help you. Below is my code.