I read this problem in Richard Bird's book: Find the top five most common words in War and Peace (or any other text for that matter).
Here's my current attempt:
public class WarAndPeace {
public static void main(String[] args) throws Exception {
Map<String, Integer> wc =
Files.lines(Paths.get("/tmp", "/war-and-peace.txt"))
.map(line -> line.replaceAll("\\p{Punct}", ""))
.flatMap(line -> Arrays.stream(line.split("\\s+")))
.filter(word -> word.matches("\\w+"))
.map(s -> s.toLowerCase())
.filter(s -> s.length() >= 2)
.collect(Collectors.toConcurrentMap(
w -> w, w -> 1, Integer::sum));
wc.entrySet()
.stream()
.sorted((e1, e2) -> Integer.compare(e2.getValue(), e1.getValue()))
.limit(5)
.forEach(e -> System.out.println(e.getKey() + ": " + e.getValue()));
}
}
This definitely looks interesting and runs reasonably fast. On my laptop it prints the following:
$> time java -server -Xmx10g -cp target/classes tmp.WarAndPeace
the: 34566
and: 22152
to: 16716
of: 14987
a: 10521
java -server -Xmx10g -cp target/classes tmp.WarAndPeace 1.86s user 0.13s system 274% cpu 0.724 total
It usually runs in under 2 seconds. Can you suggest further improvements to this from an expressiveness and a performance standpoint?
PS: If you are interested in the rich history of this problem, see here.
You're recompiling all the regexps on every line and on every word. Instead of .flatMap(line -> Arrays.stream(line.split("\\s+")))
write .flatMap(Pattern.compile("\\s+")::splitAsStream)
. The same for .filter(word -> word.matches("\\w+"))
: use .filter(Pattern.compile("^\\w+$").asPredicate())
. The same for map
.
Probably it's better to swap .map(s -> s.toLowerCase())
and .filter(s -> s.length() >= 2)
in order not to call toLowerCase()
for one-letter words.
You should not use Collectors.toConcurrentMap(w -> w, w -> 1, Integer::sum)
. First, your stream is not parallel, so you may easily replace toConcurrentMap
with toMap
. Second, it would probably be more efficient (though testing is necessary) to use Collectors.groupingBy(w -> w, Collectors.summingInt(w -> 1))
as this would reduce boxing (but add a finisher step which will box all the values at once).
Instead of (e1, e2) -> Integer.compare(e2.getValue(), e1.getValue())
you may use ready comparator: Map.Entry.comparingByValue()
(though probably it's a matter of taste).
To summarize:
Map<String, Integer> wc =
Files.lines(Paths.get("/tmp", "/war-and-peace.txt"))
.map(Pattern.compile("\\p{Punct}")::matcher)
.map(matcher -> matcher.replaceAll(""))
.flatMap(Pattern.compile("\\s+")::splitAsStream)
.filter(Pattern.compile("^\\w+$").asPredicate())
.filter(s -> s.length() >= 2)
.map(s -> s.toLowerCase())
.collect(Collectors.groupingBy(w -> w,
Collectors.summingInt(w -> 1)));
wc.entrySet()
.stream()
.sorted(Map.Entry.comparingByValue(Comparator.reverseOrder()))
.limit(5)
.forEach(e -> System.out.println(e.getKey() + ": " + e.getValue()));
If you don't like method references (some people don't), you may store precompiled regexps in the variables instead.
You are performing several redundant and unnecessary operations.
- You first replace all punctuation characters with empty strings, creating new strings, then you perform a split operation using space characters as boundary. This even risks merging of words which are separated by punctuation without spacing. You could fix that by replacing punctuation by spaces, but in the end, you don’t need that replacement as you can change the split pattern to “punctuation or space” but
You are then filtering the split results by accepting strings solely consisting of word characters only. Since you have already removed all punctuation and spacing characters, this will sort out strings having characters that are neither word, space or punctuation characters and I'm not sure if this is the intended logic. After all, if you are interested in words only, why not search for words only in the first place? Since Java 8 does not support streams of matches, we can direct it to split using non-word characters as boundary.
Then you are doing a .map(s -> s.toLowerCase()).filter(s -> s.length() >= 2)
. Since for English texts, the string length won’t change when changing it to uppercase, the filtering condition is not affected, so we can filter first, skipping the toLowerCase
conversion for strings that are not accepted by the predicate: .filter(s -> s.length() >= 2).map(s -> s.toLowerCase())
. The net benefit might be small, but it doesn’t hurt.
Choosing the right Collector
. Tagir already explained it. In principle, there’s Collectors.counting()
which fits better than Collectors.summingInt(w->1)
, but unfortunately, Oracle’s current implementation is poor as it is based on reduce
, unboxing and reboxing Long
s for all elements.
Putting it all together, you’ll get:
Files.lines(Paths.get("/tmp", "/war-and-peace.txt"))
.flatMap(Pattern.compile("\\W+")::splitAsStream)
.filter(s -> s.length() >= 2)
.map(String::toLowerCase)
.collect(Collectors.groupingBy(w->w, Collectors.summingInt(w->1)))
.entrySet()
.stream()
.sorted(Map.Entry.comparingByValue(Comparator.reverseOrder()))
.limit(5)
.forEach(e -> System.out.println(e.getKey() + ": " + e.getValue()));
As explained, don’t be surprised if the word counts are slightly higher than in your approach.