I can read the file to bytes array
but when I convert it to string
it treat the utf16 bytes as ascii
How to convert it correctly?
package main
import ("fmt"
"os"
"bufio"
)
func main(){
// read whole the file
f, err := os.Open("test.txt")
if err != nil {
fmt.Printf("error opening file: %v\n",err)
os.Exit(1)
}
r := bufio.NewReader(f)
var s,b,e = r.ReadLine()
if e==nil{
fmt.Println(b)
fmt.Println(s)
fmt.Println(string(s))
}
}
output:
false
[255 254 91 0 83 0 99 0 114 0 105 0 112 0 116 0 32 0 73 0 110 0 102 0 111 0 93 0 13 0]
S c r i p t I n f o ]
Update:
After I tested the two examples, I have understanded what is the exact problem now.
In windows, if I add the line break (CR+LF) at the end of the line, the CR will be read in the line. Because the readline function cannot handle unicode correctly ([OD OA]=ok, [OD 00 OA 00]=not ok).
If the readline function can recognize unicode, it should understand [OD 00 OA 00] and return []uint16 rather than []bytes.
So I think I should not use bufio.NewReader as it is not able to read utf16, I don't see bufio.NewReader.ReadLine can accept parameter as flag to indicate the reading text is utf8, utf16le/be or utf32. Is there any readline function for unicode text in go library?
The latest version of
golang.org/x/text/encoding/unicode
makes it easier to do this because it includesunicode.BOMOverride
, which will intelligently interpret the BOM.Here is ReadFileUTF16(), which is like os.ReadFile() but decodes UTF-16.
Here is NewScannerUTF16 which is like os.Open() but returns a scanner.
FYI: I have put these functions into an open source module and have made further improvements. See https://github.com/TomOnTime/utfutil/
If you want anything to print as a string you could use
fmt.Sprint
For example:
(Also here)
Output:
UTF16, UTF8, and Byte Order Marks are defined by the Unicode Consortium: UTF-16 FAQ, UTF-8 FAQ, and Byte Order Mark (BOM) FAQ.
Here's a program which uses the Unicode rules to convert UTF16 text file lines to Go UTF8 encoded strings. The code has been revised to take advantage of the new
bufio.Scanner
interface in Go 1.1.Output:
Here is the simplest way to read it:
since Windows use little-endian order by default link, we use unicode.UseBOM policy to retrieve BOM from the text, and unicode.LittleEndian as a fallback