In my software I need to split string into words. I currently have more than 19,000,000 documents with more than 30 words each.
Which of the following two ways is the best way to do this (in terms of performance)?
StringTokenizer sTokenize = new StringTokenizer(s," ");
while (sTokenize.hasMoreTokens()) {
or
String[] splitS = s.split(" ");
for(int i =0; i < splitS.length; i++)
If your data already in a database you need to parse the string of words, I would suggest using indexOf repeatedly. Its many times faster than either solution.
However, getting the data from a database is still likely to much more expensive.
prints
The cost of opening a file will be about 8 ms. As the files are so small, your cache may improve performance by a factor of 2-5x. Even so its going to spend ~10 hours opening files. The cost of using split vs StringTokenizer is far less than 0.01 ms each. To parse 19 million x 30 words * 8 letters per word should take about 10 seconds (at about 1 GB per 2 seconds)
If you want to improve performance, I suggest you have far less files. e.g. use a database. If you don't want to use an SQL database, I suggest using one of these http://nosql-database.org/
While running micro (and in this case, even nano) benchmarks, there is a lot that affects your results. JIT optimizations and garbage collection to name just a few.
In order to get meaningful results out of the micro benchmarks, check out the jmh library. It has excellent samples bundled on how to run good benchmarks.
Use split.
StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method instead.
Regardless of its legacy status, I would expect
StringTokenizer
to be significantly quicker thanString.split()
for this task, because it doesn't use regular expressions: it just scans the input directly, much as you would yourself viaindexOf()
. In factString.split()
has to compile the regex every time you call it, so it isn't even as efficient as using a regular expression directly yourself.