In my software I need to split string into words. I currently have more than 19,000,000 documents with more than 30 words each.
Which of the following two ways is the best way to do this (in terms of performance)?
StringTokenizer sTokenize = new StringTokenizer(s," ");
while (sTokenize.hasMoreTokens()) {
or
String[] splitS = s.split(" ");
for(int i =0; i < splitS.length; i++)
The Java API specification recommends using
split
. See the documentation ofStringTokenizer
.Split in Java 7 just calls indexOf for this input, see the source. Split should be very fast, close to repeated calls of indexOf.
What the 19,000,000 documents have to do there ? Do you have to split words in all the documents on a regular basis ? Or is it a one shoot problem?
If you display/request one document at a time, with only 30 word, this is a so tiny problem that any method would work.
If you have to process all documents at a time, with only 30 words, this is a so tiny problem that you are more likely to be IO bound anyway.
Performance wise StringTokeniser is way better than split. Check the code below,
But according to Java docs its use is discouraged. Check Here
Another important thing, undocumented as far as I noticed, is that asking for the StringTokenizer to return the delimiters along with the tokenized string (by using the constructor
StringTokenizer(String str, String delim, boolean returnDelims)
) also reduces processing time. So, if you're looking for performance, I would recommend using something like:Despite the overhead introduced by the getNext() method, that discards the delimiters for you, it's still 50% faster according to my benchmarks.
This could be a reasonable benchmarking using 1.6.0