Can I export excel data with UTF-8 without BOM?

2020-01-28 02:26发布

问题:

I export Microsoft Excel data by Excel Macro(VBScript). Because file is lua script, I export it as UTF-8. The only way I can make UTF-8 in Excel is using adodb.stream like this

set fileLua = CreateObject("adodb.stream")
fileLua.Type = 2
fileLua.Mode = 3
fileLua.Charset = "UTF-8"
fileLua.Open
fileLua.WriteText("test")
fileLua.SaveToFile("Test.lua")
fileLua.flush
fileLua.Close

I want to make eliminate BOM from Test.lua but I don't know how. (Because Test.lua has some unicode text, I have to use UTF-8 format.)

Do you know how to make UTF-8 file without BOM in excel file? Thanks in advance.

回答1:

I have also the same issue: have to export data from Excel (Office 2003, VBA6.5) to UTF-8 encoded file. Found the answer from your question ! Below my example where I also strip the BOM using trick #2 from boost's (thanks!) answer. I didn't get #1 working and never tried #3.

Sub WriteUTF8WithoutBOM()
    Dim UTFStream As Object
    Set UTFStream = CreateObject("adodb.stream")
    UTFStream.Type = adTypeText
    UTFStream.Mode = adModeReadWrite
    UTFStream.Charset = "UTF-8"
    UTFStream.LineSeparator = adLF
    UTFStream.Open
    UTFStream.WriteText "This is an unicode/UTF-8 test.", adWriteLine
    UTFStream.WriteText "First set of special characters: öäåñüûú€", adWriteLine
    UTFStream.WriteText "Second set of special characters: qwertzuiopõúasdfghjkléáûyxcvbnm\|Ä€Í÷×äðÐ[]í³£;?¤>#&@{}<;>*~¡^¢°²`ÿ´½¨¸0", adWriteLine

    UTFStream.Position = 3 'skip BOM

    Dim BinaryStream As Object
    Set BinaryStream = CreateObject("adodb.stream")
    BinaryStream.Type = adTypeBinary
    BinaryStream.Mode = adModeReadWrite
    BinaryStream.Open

    'Strips BOM (first 3 bytes)
    UTFStream.CopyTo BinaryStream

    'UTFStream.SaveToFile "d:\adodb-stream1.txt", adSaveCreateOverWrite
    UTFStream.Flush
    UTFStream.Close

    BinaryStream.SaveToFile "d:\adodb-stream2.txt", adSaveCreateOverWrite
    BinaryStream.Flush
    BinaryStream.Close
End Sub

The ADO Stream Object reference I used.



回答2:

If anyone else is struggling with the adTypeText constant, you need to include "Microsoft ActiveX Data Objects 2.5 Object Library" under Tools->References.



回答3:

A few possibilities:

  1. Put the text into the buffer as UTF-8, Type=2, but then set Type=1 (as binary) and write that out. That might convince ADODB.Stream to skip adding the BOM.

  2. Create another buffer, as type binary, and use the CopyTo to copy the data to that buffer from a point after the BOM.

  3. Read the file in again using Scripting.FileSystemObject, trim off the BOM, write out again



回答4:

Edit

A comment from rellampec alerted me to a better of dropping the LF I had discovered was added to the end of the file by user272735's method. I have added a new version of my routine at the end.

Original post

I had been using user272735's method successfully for a year when I discovered it added a LF at the end of the file. I failed to notice this extra LF until I did some very detailed testing so this is not an important error. However, my latest version discards that LF just in case it ever became important.

Public Sub PutTextFileUtf8(ByVal PathFileName As String, ByVal FileBody As String)

  ' Outputs FileBody as a text file (UTF-8 encoding without leading BOM)
  ' named PathFileName

  ' Needs reference to "Microsoft ActiveX Data Objects n.n Library"
  ' Addition to original code says version 2.5. Tested with version 6.1.

  '  1Nov16  Copied from http://stackoverflow.com/a/4461250/973283
  '          but replaced literals with parameters.
  ' 15Aug17  Discovered routine was adding an LF to the end of the file.
  '          Added code to discard that LF.

  ' References: http://stackoverflow.com/a/4461250/973283
  '             https://www.w3schools.com/asp/ado_ref_stream.asp

  Dim BinaryStream As Object
  Dim UTFStream As Object

  Set UTFStream = CreateObject("adodb.stream")

  UTFStream.Type = adTypeText
  UTFStream.Mode = adModeReadWrite
  UTFStream.Charset = "UTF-8"
  ' The LineSeparator will be added to the end of FileBody. It is possible
  ' to select a different value for LineSeparator but I can find nothing to
  ' suggest it is possible to not add anything to the end of FileBody
  UTFStream.LineSeparator = adLF
  UTFStream.Open
  UTFStream.WriteText FileBody, adWriteLine

  UTFStream.Position = 3 'skip BOM

  Set BinaryStream = CreateObject("adodb.stream")
  BinaryStream.Type = adTypeBinary
  BinaryStream.Mode = adModeReadWrite
  BinaryStream.Open

  UTFStream.CopyTo BinaryStream

  ' Oriinally I planned to use "CopyTo Dest, NumChars" to not copy the last
  ' byte.  However, NumChars is described as an integer whereas Position is
  ' described as Long. I was concerned by "integer" they mean 16 bits.
  'Debug.Print BinaryStream.Position
  BinaryStream.Position = BinaryStream.Position - 1
  BinaryStream.SetEOS
  'Debug.Print BinaryStream.Position

  UTFStream.Flush
  UTFStream.Close
  Set UTFStream = Nothing

  BinaryStream.SaveToFile PathFileName, adSaveCreateOverWrite
  BinaryStream.Flush
  BinaryStream.Close
  Set BinaryStream = Nothing

End Sub

New version of routine

This version omits the code to discard the unwanted LF added at the end because it avoids adding the LF in the first place. I have retained the original version in case anyone is interested in the technique for deleting trailing characters.

Public Sub PutTextFileUtf8NoBOM(ByVal PathFileName As String, ByVal FileBody As String)

  ' Outputs FileBody as a text file named PathFileName using
  ' UTF-8 encoding without leading BOM

  ' Needs reference to "Microsoft ActiveX Data Objects n.n Library"
  ' Addition to original code says version 2.5. Tested with version 6.1.

  '  1Nov16  Copied from http://stackoverflow.com/a/4461250/973283
  '          but replaced literals with parameters.
  ' 15Aug17  Discovered routine was adding an LF to the end of the file.
  '          Added code to discard that LF.
  ' 11Oct17  Posted to StackOverflow
  '  9Aug18  Comment from rellampec suggested removal of adWriteLine from
  '          WriteTest statement would avoid adding LF.
  ' 30Sep18  Amended routine to remove adWriteLine from WriteTest statement
  '          and code to remove LF from file. Successfully tested new version.

  ' References: http://stackoverflow.com/a/4461250/973283
  '             https://www.w3schools.com/asp/ado_ref_stream.asp

  Dim BinaryStream As Object
  Dim UTFStream As Object

  Set UTFStream = CreateObject("adodb.stream")

  UTFStream.Type = adTypeText
  UTFStream.Mode = adModeReadWrite
  UTFStream.Charset = "UTF-8"
  UTFStream.Open
  UTFStream.WriteText FileBody

  UTFStream.Position = 3 'skip BOM

  Set BinaryStream = CreateObject("adodb.stream")
  BinaryStream.Type = adTypeBinary
  BinaryStream.Mode = adModeReadWrite
  BinaryStream.Open

  UTFStream.CopyTo BinaryStream

  UTFStream.Flush
  UTFStream.Close
  Set UTFStream = Nothing

  BinaryStream.SaveToFile PathFileName, adSaveCreateOverWrite
  BinaryStream.Flush
  BinaryStream.Close
  Set BinaryStream = Nothing

End Sub


回答5:

Uf you prefer native T-SQL instead of external code

DECLARE @FILE_NAME              VARCHAR(255)    = 'd:\utils\test.xml'       --drive:\path\filename\
DECLARE @FILE_DATA              VARCHAR(MAX)    = '<?xml version="1.0" encoding="UTF-8"?>test</xml>'            --binary as varchar(max)

DECLARE @FILE_NAME_TO           VARCHAR(255)                        --Temp name for text stream
DECLARE @FSO_ID_TXTSTRM         INT                                 --Text Stream
DECLARE @FSO_ID_BINSTRM         INT                                 --Binary Stream
DECLARE @RC                     INT 

EXEC @RC = sp_OACreate 'ADODB.Stream',  @FSO_ID_TXTSTRM OUTPUT
EXEC @RC = sp_OASetProperty             @FSO_ID_TXTSTRM,    'Type',             2                           --1 = binary, 2 = text
EXEC @RC = sp_OASetProperty             @FSO_ID_TXTSTRM,    'Mode',             3                           --0 = not set, 1 read, 2 write, 3 read/write
EXEC @RC = sp_OASetProperty             @FSO_ID_TXTSTRM,    'Charset',          'UTF-8'                     --'ISO-8859-1'
EXEC @RC = sp_OASetProperty             @FSO_ID_TXTSTRM,    'LineSeparator',    'adLF'
EXEC @RC = sp_OAMethod                  @FSO_ID_TXTSTRM,    'Open'  
EXEC @RC = sp_OAMethod                  @FSO_ID_TXTSTRM,    'WriteText',        NULL,       @FILE_DATA      --text method

--Create binary stream
EXEC @RC = sp_OACreate 'ADODB.Stream',  @FSO_ID_BINSTRM OUTPUT
EXEC @RC = sp_OASetProperty             @FSO_ID_BINSTRM,    'Type',             1                           --1 = binary, 2 = text
EXEC @RC = sp_OAMethod                  @FSO_ID_BINSTRM,    'Open'
EXEC @RC = sp_OASetProperty             @FSO_ID_BINSTRM,    'Mode',             3                           --0 = not set, 1 read, 2 write, 3 read/write    

--Move 3 positions forward in text stream (BOM is first 3 positions)
EXEC @RC = sp_OASetProperty             @FSO_ID_TXTSTRM,    'Position',         3

--Copy text stream to binary stream
EXEC @RC = sp_OAMethod                  @FSO_ID_TXTSTRM,    'CopyTo',           NULL,       @FSO_ID_BINSTRM

--Commit data and close text stream
EXEC @RC = sp_OAMethod                  @FSO_ID_TXTSTRM,    'Flush'
EXEC @RC = sp_OAMethod                  @FSO_ID_TXTSTRM,    'Close'
EXEC @RC = sp_OADestroy                 @FSO_ID_TXTSTRM

--Save binary stream to file and close
EXEC @RC = sp_OAMethod                  @FSO_ID_BINSTRM,    'SaveToFile',       NULL,       @FILE_NAME, 2   --1 = notexist 2 = overwrite
EXEC @RC = sp_OAMethod                  @FSO_ID_BINSTRM,    'Close'
EXEC @RC = sp_OADestroy                 @FSO_ID_BINSTRM


回答6:

Here's another BOM-disposal hack, from an answer that overlaps your question.

Apologies for the late answer - this is more for other people who are encountering Byte Order Markers - and the page views on this question tell me that your question is relevant to several related problems: it's surprisingly difficult to write a BOM-Free file in VBA - even some of the common streams libraries deposit a BOM in your output, whether you asked for it or not.

I say my answer 'overlaps' because the code below is solving a slightly different problem - the primary purpose is writing a Schema file for a folder with a heterogeneous collection of files - but it's a working example of BOM-removal and BOM-free file writing in use, and the relevant segment is clearly marked.

The key functionality is that we iterate through all the '.csv' files in a folder, and we test each file with a quick nibble of the first four bytes: and we only only undertake the onerous task of stripping out a the marker if we see one.

We're working with low-level file-handling code from the primordial C. We have to, all the way down to using byte arrays, because everything else that you do in VBA will deposit the Byte Order Markers embedded in the structure of a string variable.

So, without further adodb, here's the code:

BOM-Disposal code for text files in a schema.ini file:

Public Sub SetSchema(strFolder As String)
On Error Resume Next 
' Write a Schema.ini file to the data folder.
' This is necessary if we do not have the registry privileges to set the ' correct 'ImportMixedTypes=Text' registry value, which overrides IMEX=1
' The code also checks for ANSI or UTF-8 and UTF-16 files, and applies a ' usable setting for CharacterSet ( UNICODE|ANSI ) with a horrible hack.
' OEM codepage-defined text is not supported: further coding is required
' ...And we strip out Byte Order Markers, if we see them - the OLEDB SQL ' provider for textfiles can't deal with a BOM in a UTF-16 or UTF-8 file
' Not implemented: handling tab-delimited files or other delimiters. The ' code assumes a header row with columns, specifies 'scan all rows', and ' imposes 'read the column as text' if the data types are mixed.
Dim strSchema As String Dim strFile As String Dim hndFile As Long Dim arrFile() As Byte Dim arrBytes(0 To 4) As Byte
If Right(strFolder, 1) <> "\" Then strFolder = strFolder & "\"
' Dir() is an iterator function when you call it with a wildcard:
strFile = VBA.FileSystem.Dir(strFolder & "*.csv")
Do While Len(strFile) > 0
hndFile = FreeFile Open strFolder & strFile For Binary As #hndFile Get #hndFile, , arrBytes Close #hndFile
strSchema = strSchema & "[" & strFile & "]" & vbCrLf strSchema = strSchema & "Format=CSVDelimited" & vbCrLf strSchema = strSchema & "ImportMixedTypes=Text" & vbCrLf strSchema = strSchema & "MaxScanRows=0" & vbCrLf
If arrBytes(2) = 0 Or arrBytes(3) = 0 Then ' this is a hack strSchema = strSchema & "CharacterSet=UNICODE" & vbCrLf Else strSchema = strSchema & "CharacterSet=ANSI" & vbCrLf End If
strSchema = strSchema & "ColNameHeader = True" & vbCrLf strSchema = strSchema & vbCrLf
' ***********************************************************
' BOM disposal - Byte order marks break the Access OLEDB text provider:
If arrBytes(0) = &HFE And arrBytes(1) = &HFF _ Or arrBytes(0) = &HFF And arrBytes(1) = &HFE Then
hndFile = FreeFile Open strFolder & strFile For Binary As #hndFile ReDim arrFile(0 To LOF(hndFile) - 1) Get #hndFile, , arrFile Close #hndFile
BigReplace arrFile, arrBytes(0) & arrBytes(1), ""
hndFile = FreeFile Open strFolder & strFile For Binary As #hndFile Put #hndFile, , arrFile Close #hndFile Erase arrFile
ElseIf arrBytes(0) = &HEF And arrBytes(1) = &HBB And arrBytes(2) = &HBF Then
hndFile = FreeFile Open strFolder & strFile For Binary As #hndFile ReDim arrFile(0 To LOF(hndFile) - 1) Get #hndFile, , arrFile Close #hndFile BigReplace arrFile, arrBytes(0) & arrBytes(1) & arrBytes(2), ""
hndFile = FreeFile Open strFolder & strFile For Binary As #hndFile Put #hndFile, , arrFile Close #hndFile Erase arrFile
End If
' ***********************************************************

strFile = "" strFile = Dir
Loop
If Len(strSchema) > 0 Then
strFile = strFolder & "Schema.ini"
hndFile = FreeFile Open strFile For Binary As #hndFile Put #hndFile, , strSchema Close #hndFile
End If

End Sub

Public Sub BigReplace(ByRef arrBytes() As Byte, _ ByRef SearchFor As String, _ ByRef ReplaceWith As String) On Error Resume Next
Dim varSplit As Variant
varSplit = Split(arrBytes, SearchFor) arrBytes = Join$(varSplit, ReplaceWith)
Erase varSplit
End Sub

The code's easier to understand if you know that a Byte Array can be assigned to a VBA.String, and vice versa. The BigReplace() function is a hack that sidesteps some of VBA's inefficient string-handling, especially allocation: you'll find that large files cause serious memory and performance problems if you do it any other way.