Scanner's nextLine(), Only fetching partial

2019-07-07 02:12发布

问题:

So, using something like:

for (int i = 0; i < files.length; i++) {
            if (!files[i].isDirectory() && files[i].canRead()) {
                try {
                    Scanner scan = new Scanner(files[i]);
                System.out.println("Generating Categories for " + files[i].toPath());
                while (scan.hasNextLine()) {
                    count++;
                    String line = scan.nextLine();
                    System.out.println("  ->" + line);
                    line = line.split("\t", 2)[1];
                    System.out.println("!- " + line);
                    JsonParser parser = new JsonParser();
                    JsonObject object = parser.parse(line).getAsJsonObject();
                    Set<Entry<String, JsonElement>> entrySet = object.entrySet();
                    exploreSet(entrySet);
                }
                scan.close();
                // System.out.println(keyset);
            } catch (FileNotFoundException e) {
                e.printStackTrace();
            }

        }
    }

as one goes over a Hadoop output file, one of the JSON objects in the middle is breaking... because scan.nextLine() is not fetching the whole line before it brings it to split. ie, the output is:

  ->0   {"Flags":"0","transactions":{"totalTransactionAmount":"0","totalQuantitySold":"0"},"listingStatus":"NULL","conditionRollupId":"0","photoDisplayType":"0","title":"NULL","quantityAvailable":"0","viewItemCount":"0","visitCount":"0","itemCountryId":"0","itemAspects":{   ...  "sellerSiteId":"0","siteId":"0","pictureUrl":"http://somewhere.com/45/x/AlphaNumeric/$(KGrHqR,!rgF!6n5wJSTBQO-G4k(Ww~~
!- {"Flags":"0","transactions":{"totalTransactionAmount":"0","totalQuantitySold":"0"},"listingStatus":"NULL","conditionRollupId":"0","photoDisplayType":"0","title":"NULL","quantityAvailable":"0","viewItemCount":"0","visitCount":"0","itemCountryId":"0","itemAspects":{   ...  "sellerSiteId":"0","siteId":"0","pictureUrl":"http://somewhere.com/45/x/AlphaNumeric/$(KGrHqR,!rgF!6n5wJSTBQO-G4k(Ww~~

Most of the above data has been sanitized (not the URL (for the most part) however... )

and the URL continues as: $(KGrHqZHJCgFBsO4dC3MBQdC2)Y4Tg~~60_1.JPG?set_id=8800005007 in the file....

So its slightly miffing.

This also is entry #112, and I have had other files parse without errors... but this one is screwing with my mind, mostly because I dont see how scan.nextLine() isnt working...

By debug output, the JSON error is caused by the string not being split properly.

And almost forgot, it also works JUST FINE if I attempt to put the offending line in its own file and parse just that.

EDIT: Also blows up if I remove the offending line in about the same place.

Attempted with JVM 1.6 and 1.7


Workaround Solution: BufferedReader scan = new BufferedReader(new FileReader(files[i])); instead of scanner....

回答1:

Based on your code, the best explanation I can come up with is that the line really does end after the "~~" according to the criteria used by Scanner.nextLine().

The criteria for an end-of-line are:

  • Something that matches this regex: "\r\n|[\n\r\u2028\u2029\u0085]" or
  • The end of the input stream

You say that the file continues after the "~~", so lets put EOF aside, and look at the regex. That will match any of the following:

The usual line separators:

  • <CR>
  • <NL>
  • <CR><NL>

... and three unusual forms of line separator that Scanner also recognizes.

  • 0x0085 is the <NEL> or "next line" control code in the "ISO C1 Control" group
  • 0x2028 is the Unicode "line separator" character
  • 0x2029 is the Unicode "paragraph separator" character

My theory is that you've got one of the "unusual" forms in your input file, and this is not showing up in .... whatever tool it is that you are using to examine the files.


I suggest that you examine the input file using a tool that can show you the actual bytes of the file; e.g. the od utility on a Linux / Unix system. Also, check that this isn't caused by some kind of character encoding mismatch ... or trying to read or write binary data as text.

If these don't help, then the next step should be to run your application using your IDE's Java debugger, and single-step it through the Scanner.hasNextLine() and nextLine() calls to find out what the code is actually doing.


And almost forgot, it also works JUST FINE if I attempt to put the offending line in its own file and parse just that.

That's interesting. But if the tool you are using to extract the line is the same one that is not showing the (hypothesized) unusual line separator, then this evidence is not reliable. The process of extraction may be altering the "stuff" that is causing the problems.