This code,
OutputStream out = new FileOutputStream(new File("C:/file/test.txt"));
out.write("A".getBytes());
And this,
OutputStream out = new FileOutputStream(new File("C:/file/test.txt"));
out.write("A".getBytes(StandardCharsets.UTF_8));
produce the same result(in my opinion), which is UTF-8 without BOM. However, Notepad++ is not showing any information about encoding. I'm expecting notepad++ to show here as Encode in UTF-8 without BOM
, but no encoding is being selected in the "Encoding" menu.
Now, this code write the file in UTF-8 with BOM encoding.
OutputStream out = new FileOutputStream(new File("C:/file/test.txt"));
byte[] bom = { (byte) 239, (byte) 187, (byte) 191 };
out.write(bom);
out.write("A".getBytes());
Notepad++ is also displaying the encoding type as Encode in UTF-8
.
Question: What is wrong with the first two codes which are suppose to write the file in UTF-8 without BOM? Is my Java code doing the right thing? If so, is there a problem with notepad++ trying to detect the encoding type?
Is notepad++ only guessing around?
I do not know if my answer is correct but let me put my understanding here,
As explained above if you write "A" simply notepad++ has no way to understand which type of encoding it is but if you want notepad++ to show "Encode in UTF-8 without BOM" as shown in figure below
Then you must fool Notepad++ which you can do it using following piece of code
If you want notepad++ to show "Encode in UTF-8" then you should remove the substring part from osw.write("\uFEFF") because this is a BOM character which you are trying to insert. When you insert this character then the file encoding type would become "Encode to UTF-8" and when you remove programmatically then it would become "Encode in UTF-8 without BOM" as you have removed this BOM character.
Another setting you have to do is change the preferences of Notepad++ as shown below, By doing this only will the Notepad++ be able to recognize the encoding you want to.
However if you simply write text it would be treated as "ANSI" by notepad++.
Hope my explanation is clear and my analysis would help someone. However this approach is a work around and is not suggested but in a helpless scenario this works.
If you do not want your Notepad++ preferences to be changed and still you want the encoding to be "Encode in UTF-8 without BOM" then you must do something like this,
I have explained samething probably in a better way in my blog here
"A" written using UTF-8 without a BOM produces exactly the same file as "A" written using ASCII or ISO-8859-* or any other ASCII-compatible encodings. That file contains a single byte with the decimal value 65.
Think of it this way:
"A".getBytes("UTF-8")
returns anew byte[] { 65 }
"A".getBytes("ISO-8859-1")
returns anew byte[] { 65 }
There's nothing in that file that suggests that UTF-8 needs to be used to decode it.
Try writing "Käsekuchen" or something else that's not encodable in ASCII and see if Notepad++ guesses the encoding correctly (because that's exactly what it does: it makes an educated guess, there's no metadata that tells it which encoding to use).