So, I'm trying to parse some text file which has multiple lines of text. My job is to go through all words and print them out in file.
So, I read all lines, I'm looping through them and splitting every line by spaces, like this:
line.split("\\s+");
Now, the problem is that in some cases Java does not see space between two words...
I was also trying to loop through string which has space but Java doesn't see it, and Character.isSpaceChar(char)
returned true...
And now I'm completly confused...
Here is code:
public void createMap(String inputPath, String outputPath)
throws IOException {
File f = new File(inputPath);
FileWriter fw = new FileWriter(outputPath);
List<String> lines = Files.readAllLines(f.toPath(),
StandardCharsets.UTF_8);
for (String l : lines) {
for (String w : l.split("\\s+")) {
if (isNotRubbish(w.trim())) {
fw.write(w.trim() + "\n");
}
}
}
fw.close();
}
private boolean isNotRubbish(String w) {
Pattern p = Pattern.compile("@?\\p{L}+",
Pattern.UNICODE_CHARACTER_CLASS);
Matcher m = p.matcher(w);
return m.matches();
}
I suspect that you have in your text character which is similar to non-breakable-space which is not white space so it can't be matched via
\\s
.In that case try to use
\p{Zs}
instead of\s
.As mentioned in http://www.regular-expressions.info/unicode.html
BTW if you would also like to include other separators than spaces like tabulators
\t
or line breaks\r
\n
you can combine\p{Zs}
with\s
like[\p{Zs}\s]