I've been having an issue getting the java.util.Scanner to read a text file I saved in Notepad, even though it works fine with others. Basically, when it tries to read the problem file, it comes up completely empty handed -- hasNextLine() is false, buffer is empty, etc. I narrowed it down to the fact that it won't even read the first line if there is a curly quote anywhere in the file. No exceptions are thrown. Note that a BufferedReader on the same file doesn't have a problem.
try {
int count = 0;
Scanner scanner = new Scanner(new File("C:/myfile.txt"));
while (scanner.hasNextLine()) {
count++;
scanner.nextLine();
}
scanner.close();
System.out.print(count);
count = 0;
BufferedReader reader = new BufferedReader(new FileReader("C:/myfile.txt"));
while (reader.readLine() != null) {
count++;
}
reader.close();
System.out.print(count);
}
catch(IOException e) {
e.printStackTrace();
}
The above code, reading a file that contains nothing but a single curly quote, prints out "01". Searches on Google led me to try this:
Scanner scanner = new Scanner(new File("C:/myfile.txt"), "ISO-8859-1");
This makes it work (ie. it prints out "11"). I also noticed that if I go into Notepad and do a Save As... the default encoding at the bottom is "ANSI." If I change this to "UTF-8" and save the file, then the scanner (without an encoding) also works. If I tell the scanner "UTF-8", then understandably it only works if I save as UTF-8, but "ISO-8859-1" seems to make it work even if I save it as "ANSI".
So, I know it has something to do with file encoding, but the problem is I don't understand anything about file encoding. My knowledge of what "ISO-8859-1" means is extremely vague; why does that make it work no matter how I save the file? Why does BufferedReader work regardless?
EDIT:
The links/comments below really helped point me in the right direction! I think I've got it figured out.
First of all, in Notepad:
- "ANSI" is CP1252
- "Unicode" is UTF-16LE
- "UTF-8" is... well, UTF-8
In hexadecimal, a curly apostrophe is represented as:
- CP1252: 92
- UTF-16LE: 1920
- UTF-8: E2 80 99
The default encoding Java uses on my system, according to Charset.defaultCharset(), is UTF-8. So when I saved the file in UTF-8, the scanner knew what to expect. When I saved the file in CP1252, however, it choked once it hit that "92", because it's not a valid way to represent a character in that encoding. It works fine as long as there aren't any such chracters in the file -- the hex for "hello world" happens to be the same in both CP1252 and UTF-8 and doesn't happen to cause a problem.
UTF-8 doesn't work with a UTF-16 file, because it doesn't know what to do with the byte order mark ("FFFE"), regardless of what characters are in the file.
On the other hand, when I set the scanner to CP1252 or ISO-8859-1, it's much more tolerant. It doesn't necessarily interpret the characters correctly, mind you, but there's nothing that prevents it from recognizing lines in the file and looping through.
As far as why Scanner has a problem but the FileReader/BufferedReader does not, I am going to guess that it's because the scanner needs to tokenize the file, ie. interpret the characters so it can identify whitespace and other patterns, so it chokes when there's something unrecognizable. The reader doesn't need to do that. All it needs to identify are the line breaks.