Convert CSV file from any type to UTF-8

2019-08-30 08:46发布

Hello I am creating a simple console application in vb.net in order to convert a file from any type to utf8 but i can't figure out how this thing works with the encoding. I know that the source file is in Unicode, but when i convert it to a new format i get junk. Any suggestions? i am not sure if my code is correct

this is my code.

Imports System.IO
Imports System.Text

Module Module1
    Sub Main()
        Console.Write("Please give the filepath (example:c:/tesfile.csv):")
        Dim filepath As String = Console.ReadLine()
        Dim sEncoding As String = DetermineFileType(filepath)
        Dim strContents As String
        Dim strEncodedContents As String
        Dim objReader As StreamReader
        Dim ErrInfo As String
        Dim bString As Byte()
        Try

            'Read the file
            objReader = New StreamReader(filepath)
            'Read untill the end
            strContents = objReader.ReadToEnd()
            'Close The file
            objReader.Close()
            'Write Contents on DOS
            Console.WriteLine(strContents)
            Console.WriteLine("")

            bString = EncodeString(strContents, "UTF-8")
            strEncodedContents = System.Text.Encoding.UTF8.GetString(bString)
            Dim objWriter As New System.IO.StreamWriter(filepath.Replace(".csv", "_encoded.csv"))
            objWriter.WriteLine(strEncodedContents)
            objWriter.Close()
            Console.WriteLine("Encoding Finished")

        Catch Ex As Exception
            ErrInfo = Ex.Message
            Console.WriteLine(ErrInfo)
        End Try        
        Console.ReadKey()
    End Sub

    Public Function DetermineFileType(ByVal aFileName As String) As String
        Dim sEncoding As String = String.Empty

        Dim oSR As New StreamReader(aFileName, True)
        oSR.ReadToEnd()
        ' Add this line to read the file.
        sEncoding = oSR.CurrentEncoding.EncodingName

        Return sEncoding
    End Function

    Function EncodeString(ByRef SourceData As String, ByRef CharSet As String) As Byte()
        'get a byte pointer To the source data
        Dim bSourceData As Byte() = System.Text.Encoding.Unicode.GetBytes(SourceData)

        'get destination encoding 
        Dim OutEncoding As System.Text.Encoding = System.Text.Encoding.GetEncoding(CharSet)

        'Encode the data To destination code page/charset
        Return System.Text.Encoding.Convert(OutEncoding, System.Text.Encoding.UTF8, bSourceData)
    End Function
End Module

2条回答
爱情/是我丢掉的垃圾
2楼-- · 2019-08-30 09:21

StreamReader has a constructor that takes an Encoding if you know the encoding of the file you should pass that into the constructor of StreamReader

objReader = New StreamReader(filepath, Encoding.UTF32)

EDIT

You say in a comment that the file is Encoded as UCS-2 from Wikipedia

The older UCS-2 (2-byte Universal Character Set) is a similar character encoding that was superseded by UTF-16 in version 2.0 of the Unicode standard in July 1996.2 It produces a fixed-length format by simply using the code point as the 16-bit code unit and produces exactly the same result as UTF-16 for 96.9% of all the code points in the range 0-0xFFFF, including all characters that had been assigned a value at that time.

In which case you can try to decode using UTF-16 which is called Unicode with in System.Text.Encoding so try

objReader = New StreamReader(filepath, Encoding.Unicode)

FYI Unicode is a standard which has a variety of encodings including

  • UTF-8
  • UTF-16 (BigEndian)
  • UTF-16 (LittleEndian)
  • UTF-32 (BigEndian)
  • UTF-32 (LittleEndian)

For Microsoft to call UTF-16 Unicode is a little misleading but not inaccurate, UTF-16 is one encoding possible for Unicode.

查看更多
Anthone
3楼-- · 2019-08-30 09:23

StreamReader already assumes utf-8 encoding if you don't specify it in the constructor call. So re-encoding it to utf-8 cannot solve your problem. Use the StreamReader(String, Encoding) overload and specify the encoding that was used when the file was created. If you have no clue what it might be then Enoding.Default is usually the best guess. Talk to the programmer that wrote the code for the .csv file creator to be sure. When you get it right, you don't need this code anymore either.

查看更多
登录 后发表回答