I am trying to read big CSV
and TSV
(tab-separated) Files with about 1000000
rows or more. Now I tried to read a TSV
containing ~2500000
lines with opencsv
, but it throws me an java.lang.NullPointerException
. It works with smaller TSV
Files with ~250000
lines. So I was wondering if there are any other Libraries
that support the reading of huge CSV
and TSV
Files. Do you have any ideas?
Everybody who is interested in my Code (I shorten it, so Try-Catch
is obviously invalid):
InputStreamReader in = null;
CSVReader reader = null;
try {
in = this.replaceBackSlashes();
reader = new CSVReader(in, this.seperator, '\"', this.offset);
ret = reader.readAll();
} finally {
try {
reader.close();
}
}
Edit: This is the Method where I construct the InputStreamReader
:
private InputStreamReader replaceBackSlashes() throws Exception {
FileInputStream fis = null;
Scanner in = null;
try {
fis = new FileInputStream(this.csvFile);
in = new Scanner(fis, this.encoding);
ByteArrayOutputStream out = new ByteArrayOutputStream();
while (in.hasNext()) {
String nextLine = in.nextLine().replace("\\", "/");
// nextLine = nextLine.replaceAll(" ", "");
nextLine = nextLine.replaceAll("'", "");
out.write(nextLine.getBytes());
out.write("\n".getBytes());
}
return new InputStreamReader(new ByteArrayInputStream(out.toByteArray()));
} catch (Exception e) {
in.close();
fis.close();
this.logger.error("Problem at replaceBackSlashes", e);
}
throw new Exception();
}
Try switching libraries as suggested by
Satish
. If that doesn't help, you have to split the whole file into tokens and process them.Thinking that your
CSV
didn't had any escape characters for commasThen you can process it. Don't forget to trim the token before using it.
Do not use a CSV parser to parse TSV inputs. It will break if the TSV has fields with a quote character, for example.
uniVocity-parsers comes with a TSV parser. You can parse a billion rows without problems.
Example to parse a TSV input:
If your input is so big it can't be kept in memory, do this:
Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).
I have not tried it, but I had investigated superCSV earlier.
http://sourceforge.net/projects/supercsv/
http://supercsv.sourceforge.net/
Check if that works for you, 2.5 million lines.
I don't know if that question is still active but here is the one I use successfully. Still may have to implement more interfaces such as Stream or Iterable, however: