I am working on application which processes large CSV files (several hundreds of MB's). Recently I faced a problem which at first looked as a memory leak in application, but after some investigation, it appears that it is combination of bad formatted CSV and attempt of CsvListReader to parse never-ending line. As a result, I got following exception:
at java.lang.OutOfMemoryError.<init>(<unknown string>)
at java.util.Arrays.copyOf(<unknown string>)
Local Variable: char[]#13624
at java.lang.AbstractStringBuilder.expandCapacity(<unknown string>)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(<unknown string>)
at java.lang.AbstractStringBuilder.append(<unknown string>)
at java.lang.StringBuilder.append(<unknown string>)
Local Variable: java.lang.StringBuilder#3
at org.supercsv.io.Tokenizer.readStringList(<unknown string>)
Local Variable: java.util.ArrayList#642
Local Variable: org.supercsv.io.Tokenizer#1
Local Variable: org.supercsv.io.PARSERSTATE#2
Local Variable: java.lang.String#14960
at org.supercsv.io.CsvListReader.read(<unknown string>)
By analyzing heap dump and CSV file based on dump findings, I noticed that one of columns in one of CSV lines was missing closing quotes, which obviously resulted in reader trying to find end of the line by appending file content to internal string buffer until there was no more heap memory.
Anyway, that was the problem and it was due to bad formatted CSV - once I removed critical line, problem disappeared. What I want to achieve is to tell reader that:
- All the content it should interpret always ends with new line character, even if quotes are not closed properly (no multi-line support)
- Alternatively, to provide certain limit (in bytes) of the CSV line
Is there some clear way to do this in SuperCSV using CsvListReader (preferred in my case)?