Word Document.SaveAs ignores encoding, when callin

2019-02-18 23:41发布

问题:

I have a script, VBS or Ruby, that saves a Word document as 'Filtered HTML', but the encoding parameter is ignored. The HTML file is always encoded in Windows-1252. I'm using Word 2007 SP3 on Windows 7 SP1.

Ruby Example:

require 'win32ole'
word = WIN32OLE.new('Word.Application')
word.visible = false
word_document = word.documents.open('C:\whatever.doc')
word_document.saveas({'FileName' => 'C:\whatever.html', 'FileFormat' => 10, 'Encoding' => 65001})
word_document.close()
word.quit

VBS Example:

Option Explicit
Dim MyWord
Dim MyDoc
Set MyWord = CreateObject("Word.Application")
MyWord.Visible = False
Set MyDoc = MyWord.Documents.Open("C:\whatever.doc")
MyDoc.SaveAs "C:\whatever2.html", 10, , , , , , , , , , 65001
MyDoc.Close
MyWord.Quit
Set MyDoc = Nothing
Set MyWord = Nothing

Documentation:

Document.SaveAs: http://msdn.microsoft.com/en-us/library/bb221597.aspx

msoEncoding values: http://msdn.microsoft.com/en-us/library/office/aa432511(v=office.12).aspx

Any suggestions, how to make Word save the HTML file in UTF-8?

回答1:

Word can't do this as far as I know.

However, you could add the following lines to the end of your Ruby script

text_as_utf8 = File.read('C:\whatever.html').encode('UTF-8')
File.open('C:\whatever.html','wb') {|f| f.print text_as_utf8}

If you have an older version of Ruby, you may need to use Iconv. If you have special characters in 'C:\whatever.html', you'll want to look into your invalid/undefined replacement options.

You'll also probably want to update the charset in the HTML meta tag:

text_as_utf8.gsub!('charset=windows-1252', 'charset=UTF-8')

before you write to the file.



回答2:

My solution was to open the HTML file using the same character set, as Word used to save it. I also added a whitelist filter (Sanitize), to clean up the HTML. Further cleaning is done using Nokogiri, which Sanitize also rely on.

require 'sanitize'

# ... add some code converting a Word file to HTML.

# Post export cleanup.
html_file = File.open(html_file_name, "r:windows-1252:utf-8")
html = '<!DOCTYPE html>' + html_file.read()
html_document = Nokogiri::HTML::Document.parse(html)
Sanitize.new(Sanitize::Config::RESTRICTED).clean_node!(html_document)
html_document.css('html').first['lang'] = 'en-US'
html_document.css('meta[name="Generator"]').first.remove()

# ... add more cleaning up of Words HTML noise.

sanitized_html = html_document.to_html({:encoding => 'utf-8', :indent => 0})
# writing output to (new) file
sanitized_html_file_name = word_file_name.sub(/(.*)\..*$/, '\1.html')
File.open(sanitized_html_file_name, 'w:UTF-8') do |f|
    f.write sanitized_html
end

HTML Sanitizer: https://github.com/rgrove/sanitize/

HTML parser and modifier: http://nokogiri.org/

In Word 2010 there is a new method, SaveAs2: http://msdn.microsoft.com/en-us/library/ff836084(v=office.14).aspx

I haven't tested SaveAs2, since I don't have Word 2010.



回答3:

Hi Bo Frederiksen and kardeiz,

I also encountered the problem of "Word Document.SaveAs ignores encoding" today in my "Word 2003 (11.8411.8202) SP3" version.

Luckily I managed to make msoEncodingUTF8(namely, 65001) work in VBA code. However, I have to change the Word document's settings first. Steps are:

1) From Word's 'Tools' menu, choose 'Options'.

2) Then click 'General'.

3) Press the 'Web Options' button.

4) In the popping-up 'Web Options' dialogue, click 'Encoding'.

5) You can find a combobox, now you can change the encoding, for example, from 'GB2312' to 'Unicode (UTF-8)'.

6) Save the changes and try to rerun the VBA code.

I hope my answer can help you. Below is my code.

Public Sub convert2html()
    With ActiveDocument.WebOptions
        .Encoding = msoEncodingUTF8
    End With

    ActiveDocument.SaveAs FileName:=ActiveDocument.Path & "\" & "file_name.html", FileFormat:=wdFormatFilteredHTML, Encoding:=msoEncodingUTF8

End Sub