Performance of StringTokenizer class vs. String.sp

2020-01-24 01:55发布

In my software I need to split string into words. I currently have more than 19,000,000 documents with more than 30 words each.

Which of the following two ways is the best way to do this (in terms of performance)?

StringTokenizer sTokenize = new StringTokenizer(s," ");
while (sTokenize.hasMoreTokens()) {

or

String[] splitS = s.split(" ");
for(int i =0; i < splitS.length; i++)

10条回答
地球回转人心会变
2楼-- · 2020-01-24 02:38

The Java API specification recommends using split. See the documentation of StringTokenizer.

查看更多
Lonely孤独者°
3楼-- · 2020-01-24 02:41

Split in Java 7 just calls indexOf for this input, see the source. Split should be very fast, close to repeated calls of indexOf.

查看更多
闹够了就滚
4楼-- · 2020-01-24 02:49

What the 19,000,000 documents have to do there ? Do you have to split words in all the documents on a regular basis ? Or is it a one shoot problem?

If you display/request one document at a time, with only 30 word, this is a so tiny problem that any method would work.

If you have to process all documents at a time, with only 30 words, this is a so tiny problem that you are more likely to be IO bound anyway.

查看更多
我只想做你的唯一
5楼-- · 2020-01-24 02:49

Performance wise StringTokeniser is way better than split. Check the code below,

enter image description here

But according to Java docs its use is discouraged. Check Here

查看更多
聊天终结者
6楼-- · 2020-01-24 02:52

Another important thing, undocumented as far as I noticed, is that asking for the StringTokenizer to return the delimiters along with the tokenized string (by using the constructor StringTokenizer(String str, String delim, boolean returnDelims)) also reduces processing time. So, if you're looking for performance, I would recommend using something like:

private static final String DELIM = "#";

public void splitIt(String input) {
    StringTokenizer st = new StringTokenizer(input, DELIM, true);
    while (st.hasMoreTokens()) {
        String next = getNext(st);
        System.out.println(next);
    }
}

private String getNext(StringTokenizer st){  
    String value = st.nextToken();
    if (DELIM.equals(value))  
        value = null;  
    else if (st.hasMoreTokens())  
        st.nextToken();  
    return value;  
}

Despite the overhead introduced by the getNext() method, that discards the delimiters for you, it's still 50% faster according to my benchmarks.

查看更多
女痞
7楼-- · 2020-01-24 02:52

This could be a reasonable benchmarking using 1.6.0

http://www.javamex.com/tutorials/regular_expressions/splitting_tokenisation_performance.shtml#.V6-CZvnhCM8
查看更多
登录 后发表回答