I'm trying to write a simple program to read a text file and store pair of words in a Set
. Here is the code I wrote for that
import java.io.File;
import java.io.FileNotFoundException;
import java.util.Scanner;
import java.util.TreeSet;
public class Main {
public static void main(String[] args) {
TreeSet<String> phraseSet = new TreeSet<String>();
try {
Scanner readfile = new Scanner(new File("data.txt"));
while(readfile.hasNext("\\w{2}")) {
String phrase = readfile.next("\\w{2}");
phraseSet.add(phrase);
}
} catch (FileNotFoundException e) {
e.printStackTrace();
}
for(String p : phraseSet) {
System.out.println(p);
}
}
}
The code compiles but prints out a blank line (The while loop is never entered).
The data.txt file contents are:
There are seven words in this line.
And then there are few more words in this line.
I'm expecting following Strings in my TreeSet (off course in sorted order)
There are
are seven
seven words
words in
in this
this line
line And
And then
then there
there are
....
this line
Your main problem is that Scanner
by default parses tokens by whitespace.
According to the API:
A Scanner breaks its input into tokens using a delimiter pattern, which by default matches whitespace. The resulting tokens may then be converted into values of different types using the various next methods.
If you take a look at hasNext(String pattern)
, you'll see that it
Returns the next token if it matches the pattern constructed from the specified string. If the match is successful, the scanner advances past the input that matched the pattern.
(emphasis mine)
i.e. By the time you are asking for the Scanner
to check for your token, it's already broken up the input by whitespace, so asking to find a token with a space in the middle will always fail.
A better way to do this would be to have the Scanner
read in a line at a time, and then just split()
the line and parse it yourself:
Scanner readfile = new Scanner(new File("data.txt"));
while (readfile.hasNextLine()) {
String[] words = readfile.nextLine().split("\\s");
for (int i=0; i<words.length-1; i++) {
phraseSet.add(words[i] + " " + words[i+1]);
}
}
Your question didn't explicitly mention it, but from your example output, it looks like you want to ignore line breaks in reading. This approach makes that slightly more complicated, but you can just store off the last word of each line and add it when parsing the next, like so:
String lastWord = null;
while (readfile.hasNextLine()) {
String[] words = readfile.nextLine().split("\\s");
if (lastWord != null) {
phraseSet.add(lastWord + " " + words[0]);
}
for (int i=0; i<words.length-1; i++) {
phraseSet.add(words[i] + " " + words[i+1]);
}
lastWord = words[words.length-1];
}
If this is actually what you're looking for, you're probably better off just using next()
to pull each word one at a time like other answers have shown how to do.
To sum up
You cannot use Scanner
to directly look for multi-word tokens, you'll have to do the parsing yourself.
The output you described and the code contradict with the sample output you gave.
This produces the sample output you asked for:
Scanner scanner = new Scanner("There are seven words in this line.\n" +
"And then there are few more words in this line.");
List<String> phraseSet = new ArrayList<>();
String prev = scanner.next();
while (scanner.hasNext()) {
String word = scanner.next();
String phrase = prev + " " + word;
phraseSet.add(phrase);
prev = word;
}
for (String phrase : phraseSet) {
System.out.println(phrase);
}
I am not sure what exactly you are trying to learn. Could be Java itself, or TreeSet, may be reg-ex ... but before I give you my solution, a few comments -
Please
- Do not call your class "Main" - ever
- Try to use appropriate camel case in code - easier for everybody else to read
The take away from this is that Scanner.next() and hasNext() can cross the newline boundaries. As you realized already, TreeSet (or any other Set) will not maintain the order. Now, for the data file:
There are seven words in this line.
And then there are few more words in this line.
Try this code (I called the file DoubleWord.java):
import java.io.*;
import java.util.*;
public class DoubleWord {
private String lastWord = null;
private TreeSet<String> phraseSet = new TreeSet<String>();
public DoubleWord (String fileName) throws FileNotFoundException {
Scanner readFile = new Scanner(new File(fileName));
String lastWord = readFile.next();
while (readFile.hasNext()) {
String phrase = readFile.next();
phraseSet.add (lastWord + " " + phrase);
lastWord = phrase;
}
}
public void printSet () {
for(String p : phraseSet) {
System.out.println(p);
}
}
public static void main(String[] args) {
try {
new DoubleWord (args[0]).printSet();
} catch (Exception ex) {
ex.printStackTrace();
}
}
}
The output is
And then
There are
are few
are seven
few more
in this
line. And
more words
seven words
then there
there are
this line.
words in
Hope this helps, - M.
Here is version with BufferedReader:
package com.java.se.stackoverflow;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.ArrayList;
import java.util.List;
public class LoadTwoWordsToSetFromFile {
public static void main(String[] argv) throws IOException {
List<String> phraseSet = new ArrayList<>();
String[] lineWords;
String nextLine, lastLineWord = null;
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(LoadTwoWordsToSetFromFile.class.getResourceAsStream("data.txt")));
while ((nextLine = bufferedReader.readLine()) != null) {
lineWords = nextLine.split(" ");
for (int i = 0; i + 1 < lineWords.length; i++) {
if (lastLineWord != null) {
phraseSet.add(lastLineWord + " " + lineWords[i].replaceAll("\\W", ""));
lastLineWord = null;
} else {
phraseSet.add(lineWords[i].replaceAll("\\W", "") + " " + lineWords[i + 1].replaceAll("\\W", ""));
}
}
lastLineWord = lineWords[lineWords.length - 1].replaceAll("\\W", "");
}
for (String p : phraseSet) {
System.out.println(p);
}
}
}