I have a question about optimization of my code (which works but is too slow...). I am reading an input in a form
X1 Y1
X2 Y2
etc
where Xi, Yi are integers. I am using bufferedReader
for reading lines and then StringTokenizer
for processing those numbers like this:
StringTokenizer st = new StringTokenizer(line, " ");
int x = Integer.parseInt(st.nextToken());
int y = Integer.parseInt(st.nextToken());
The problem is that this approach seems time inefficient when coping with large data sets. Could you suggest me some simple improvement (I have heard that some integer parse int or regex can be used) which would improve the performance? Thanks for any tips
EDIT: Perhaps I misjudged myself and some improvements have to be done elsewhere in the code...
You could use regex to check if the value is numerical and then convert to integer:
(updated answer)
I can say that whatever the problems in your program speed, the choice of tokenizer is not one of them. After an initial run of each method to even out initialisation quirks, I can parse 1000000 rows of "12 34" in milliseconds. You could switch to using indexOf if you like but I really think you need to look at other bits of code for the bottleneck rather than this micro-optimisation. Split was a surprise for me - it's really, really slow compared to the other methods. I've added in Guava split test and it's faster than String.split but slightly slower than StringTokenizer.
The difference here is pretty negligible even over millions of rows.
There's now a write up of this on my blog: http://demeranville.com/battle-of-the-tokenizers-delimited-text-parser-performance/
Code I ran was:
eta: here's the guava code:
update
I've added in a CsvMapper test too: