Curly quotes causing Java Scanner hasNextLine() to

2020-04-02 06:51发布

问题:

I've been having an issue getting the java.util.Scanner to read a text file I saved in Notepad, even though it works fine with others. Basically, when it tries to read the problem file, it comes up completely empty handed -- hasNextLine() is false, buffer is empty, etc. I narrowed it down to the fact that it won't even read the first line if there is a curly quote anywhere in the file. No exceptions are thrown. Note that a BufferedReader on the same file doesn't have a problem.

try {        
    int count = 0;
    Scanner scanner = new Scanner(new File("C:/myfile.txt"));

    while (scanner.hasNextLine()) {
        count++;
        scanner.nextLine();
    }

    scanner.close();
    System.out.print(count);

    count = 0;
    BufferedReader reader = new BufferedReader(new FileReader("C:/myfile.txt"));

    while (reader.readLine() != null) {
        count++;
    }

    reader.close();
    System.out.print(count);
}
catch(IOException e) {
    e.printStackTrace();
}

The above code, reading a file that contains nothing but a single curly quote, prints out "01". Searches on Google led me to try this:

Scanner scanner = new Scanner(new File("C:/myfile.txt"), "ISO-8859-1");

This makes it work (ie. it prints out "11"). I also noticed that if I go into Notepad and do a Save As... the default encoding at the bottom is "ANSI." If I change this to "UTF-8" and save the file, then the scanner (without an encoding) also works. If I tell the scanner "UTF-8", then understandably it only works if I save as UTF-8, but "ISO-8859-1" seems to make it work even if I save it as "ANSI".

So, I know it has something to do with file encoding, but the problem is I don't understand anything about file encoding. My knowledge of what "ISO-8859-1" means is extremely vague; why does that make it work no matter how I save the file? Why does BufferedReader work regardless?

EDIT:

The links/comments below really helped point me in the right direction! I think I've got it figured out.

First of all, in Notepad:

  • "ANSI" is CP1252
  • "Unicode" is UTF-16LE
  • "UTF-8" is... well, UTF-8

In hexadecimal, a curly apostrophe is represented as:

  • CP1252: 92
  • UTF-16LE: 1920
  • UTF-8: E2 80 99

The default encoding Java uses on my system, according to Charset.defaultCharset(), is UTF-8. So when I saved the file in UTF-8, the scanner knew what to expect. When I saved the file in CP1252, however, it choked once it hit that "92", because it's not a valid way to represent a character in that encoding. It works fine as long as there aren't any such chracters in the file -- the hex for "hello world" happens to be the same in both CP1252 and UTF-8 and doesn't happen to cause a problem.

UTF-8 doesn't work with a UTF-16 file, because it doesn't know what to do with the byte order mark ("FFFE"), regardless of what characters are in the file.

On the other hand, when I set the scanner to CP1252 or ISO-8859-1, it's much more tolerant. It doesn't necessarily interpret the characters correctly, mind you, but there's nothing that prevents it from recognizing lines in the file and looping through.

As far as why Scanner has a problem but the FileReader/BufferedReader does not, I am going to guess that it's because the scanner needs to tokenize the file, ie. interpret the characters so it can identify whitespace and other patterns, so it chokes when there's something unrecognizable. The reader doesn't need to do that. All it needs to identify are the line breaks.

回答1:

If you don't specify an encoding when you create the scanner it will try to divine the encoding based on a byte order mark (BOM), which is the first few bytes of a file. If it doesn't have one, it will default to whatever default the OS uses. Since you're using Windows, the default is cp-1252. It seems that notepad is saving your text file using ISO-8859-1 which is similar, but not that same as cp-1252. See this link for more details:

http://www.i18nqa.com/debug/table-iso8859-1-vs-windows-1252.html

When you save it as UTF-8, it probably places the UTF-8 BOM at the beginning of the file and the scanner can pick up on it.

If you want to look more into BOM, look it up in wikipedia--the article is quite good. You can also download PSPad and open the text file in hex mode to see the individual bytes. Hope that helps :)



回答2:

Scanner's hasNextLine method will just return false if it encountered encoding error in the input file. Without any exception. This is frustrating, and it is not documented anywhere, even in JDK 8 documentation.

If you just want to read a file line-by-line, use this instead:

final BufferedReader input = new BufferedReader(new InputStreamReader(new FileInputStream("inputfile.txt"), "inputencoding"));

while (true) {
    String line = input.readLine();
    if (line == null) break;
    // process line
}

input.close();

Make sure the inputencoding above is replaced with the correct encoding of the file. Most likely it is utf-8 or ascii. Even if the encoding mismatches, it won't prematurely terminate like Scanner.



回答3:

Some time ago I had similar problem with configuration file which was edited by the user. Because I never know what type of editor user will use I try this:

org.mozilla.universalchardet.UniversalDetector

available from here:

https://code.google.com/p/juniversalchardet/

The detecting char encoding is not simple thing so I can't be sure if this library works at any condition, but for me was sufficient. Have a look, maybe will help somehow to detect your encoding and later set it to Scanner.