Hello I am creating a simple console application in vb.net in order to convert a file from any type to utf8 but i can't figure out how this thing works with the encoding. I know that the source file is in Unicode, but when i convert it to a new format i get junk. Any suggestions? i am not sure if my code is correct
this is my code.
Imports System.IO
Imports System.Text
Module Module1
Sub Main()
Console.Write("Please give the filepath (example:c:/tesfile.csv):")
Dim filepath As String = Console.ReadLine()
Dim sEncoding As String = DetermineFileType(filepath)
Dim strContents As String
Dim strEncodedContents As String
Dim objReader As StreamReader
Dim ErrInfo As String
Dim bString As Byte()
Try
'Read the file
objReader = New StreamReader(filepath)
'Read untill the end
strContents = objReader.ReadToEnd()
'Close The file
objReader.Close()
'Write Contents on DOS
Console.WriteLine(strContents)
Console.WriteLine("")
bString = EncodeString(strContents, "UTF-8")
strEncodedContents = System.Text.Encoding.UTF8.GetString(bString)
Dim objWriter As New System.IO.StreamWriter(filepath.Replace(".csv", "_encoded.csv"))
objWriter.WriteLine(strEncodedContents)
objWriter.Close()
Console.WriteLine("Encoding Finished")
Catch Ex As Exception
ErrInfo = Ex.Message
Console.WriteLine(ErrInfo)
End Try
Console.ReadKey()
End Sub
Public Function DetermineFileType(ByVal aFileName As String) As String
Dim sEncoding As String = String.Empty
Dim oSR As New StreamReader(aFileName, True)
oSR.ReadToEnd()
' Add this line to read the file.
sEncoding = oSR.CurrentEncoding.EncodingName
Return sEncoding
End Function
Function EncodeString(ByRef SourceData As String, ByRef CharSet As String) As Byte()
'get a byte pointer To the source data
Dim bSourceData As Byte() = System.Text.Encoding.Unicode.GetBytes(SourceData)
'get destination encoding
Dim OutEncoding As System.Text.Encoding = System.Text.Encoding.GetEncoding(CharSet)
'Encode the data To destination code page/charset
Return System.Text.Encoding.Convert(OutEncoding, System.Text.Encoding.UTF8, bSourceData)
End Function
End Module
StreamReader has a constructor that takes an Encoding if you know the encoding of the file you should pass that into the constructor of StreamReader
EDIT
You say in a comment that the file is Encoded as UCS-2 from Wikipedia
In which case you can try to decode using UTF-16 which is called Unicode with in System.Text.Encoding so try
FYI Unicode is a standard which has a variety of encodings including
For Microsoft to call UTF-16 Unicode is a little misleading but not inaccurate, UTF-16 is one encoding possible for Unicode.
StreamReader already assumes utf-8 encoding if you don't specify it in the constructor call. So re-encoding it to utf-8 cannot solve your problem. Use the StreamReader(String, Encoding) overload and specify the encoding that was used when the file was created. If you have no clue what it might be then Enoding.Default is usually the best guess. Talk to the programmer that wrote the code for the .csv file creator to be sure. When you get it right, you don't need this code anymore either.