Writing UTF-8 without BOM

2019-06-24 06:37发布

This code,

OutputStream out = new FileOutputStream(new File("C:/file/test.txt"));
out.write("A".getBytes());

And this,

OutputStream out = new FileOutputStream(new File("C:/file/test.txt"));
out.write("A".getBytes(StandardCharsets.UTF_8));

produce the same result(in my opinion), which is UTF-8 without BOM. However, Notepad++ is not showing any information about encoding. I'm expecting notepad++ to show here as Encode in UTF-8 without BOM, but no encoding is being selected in the "Encoding" menu.

Now, this code write the file in UTF-8 with BOM encoding.

 OutputStream out = new FileOutputStream(new File("C:/file/test.txt"));
 byte[] bom = { (byte) 239, (byte) 187, (byte) 191 };
 out.write(bom);
 out.write("A".getBytes()); 

Notepad++ is also displaying the encoding type as Encode in UTF-8.

Question: What is wrong with the first two codes which are suppose to write the file in UTF-8 without BOM? Is my Java code doing the right thing? If so, is there a problem with notepad++ trying to detect the encoding type?

Is notepad++ only guessing around?

2条回答
孤傲高冷的网名
2楼-- · 2019-06-24 07:26

I do not know if my answer is correct but let me put my understanding here,

As explained above if you write "A" simply notepad++ has no way to understand which type of encoding it is but if you want notepad++ to show "Encode in UTF-8 without BOM" as shown in figure below

enter image description here

Then you must fool Notepad++ which you can do it using following piece of code enter image description here

If you want notepad++ to show "Encode in UTF-8" then you should remove the substring part from osw.write("\uFEFF") because this is a BOM character which you are trying to insert. When you insert this character then the file encoding type would become "Encode to UTF-8" and when you remove programmatically then it would become "Encode in UTF-8 without BOM" as you have removed this BOM character.

Another setting you have to do is change the preferences of Notepad++ as shown below, By doing this only will the Notepad++ be able to recognize the encoding you want to.

enter image description here

However if you simply write text it would be treated as "ANSI" by notepad++.

Hope my explanation is clear and my analysis would help someone. However this approach is a work around and is not suggested but in a helpless scenario this works.

If you do not want your Notepad++ preferences to be changed and still you want the encoding to be "Encode in UTF-8 without BOM" then you must do something like this,

enter image description here

I have explained samething probably in a better way in my blog here

查看更多
该账号已被封号
3楼-- · 2019-06-24 07:34

"A" written using UTF-8 without a BOM produces exactly the same file as "A" written using ASCII or ISO-8859-* or any other ASCII-compatible encodings. That file contains a single byte with the decimal value 65.

Think of it this way:

  • "A".getBytes("UTF-8") returns a new byte[] { 65 }
  • "A".getBytes("ISO-8859-1") returns a new byte[] { 65 }
  • You write the results of those calls into a file
  • How is the consumer of the file supposed to distinguish the two?

There's nothing in that file that suggests that UTF-8 needs to be used to decode it.

Try writing "Käsekuchen" or something else that's not encodable in ASCII and see if Notepad++ guesses the encoding correctly (because that's exactly what it does: it makes an educated guess, there's no metadata that tells it which encoding to use).

查看更多
登录 后发表回答