Any idea how I can get proper lines? some lines are getting glued, and I can't figure out how to stop it or why.
col. 0: Date
col. 1: Col2
col. 2: Col3
col. 3: Col4
col. 4: Col5
col. 5: Col6
col. 6: Col7
col. 7: Col7
col. 8: Col8
col. 0: 2017-05-23
col. 1: String
col. 2: lo rem ipsum
col. 3: dolor sit amet
col. 4: mcdonalds.com/online.html
col. 5: null
col. 6: "","-""-""2017-05-23"
col. 7: String
col. 8: lo rem ipsum
col. 9: dolor sit amet
col. 10: burgerking.com
col. 11: https://burgerking.com/
col. 12: 20
col. 13: 2
col. 14: fake
col. 0: 2017-05-23
col. 1: String
col. 2: lo rem ipsum
col. 3: dolor sit amet
col. 4: wendys.com
col. 5: null
col. 6: "","-""-""2017-05-23"
col. 7: String
col. 8: lo rem ipsum
col. 9: dolor sit amet
col. 10: buggagump.com
col. 11: null
col. 12: "","-""-""2017-05-23"
col. 13: String
col. 14: cheese
col. 15: ad eum
col. 16: mcdonalds.com/online.html
col. 17: null
col. 18: "","-""-""2017-05-23"
col. 19: String
col. 20: burger
col. 21: ludus dissentiet
col. 22: www.mcdonalds.com
col. 23: https://www.mcdonalds.com/
col. 24: 25
col. 25: 3
col. 26: fake
col. 0: 2017-05-23
col. 1: String
col. 2: wine
col. 3: id erat utamur
col. 4: bubbagump.com
col. 5: https://buggagump.com/
col. 6: 25
col. 7: 3
col. 8: fake
done
A sample CSV (the \r\n may have gotten corrupted when copy/pasting). Available here: https://www.dropbox.com/s/86klza4qok4ty2s/malformed%20csv%20r%20n%20small.csv?dl=0
"Date","Col2","Col3","Col4","Col5","Col6","Col7","Col7","Col8"
"2017-05-23","String","lo rem ipsum","dolor sit amet","mcdonalds.com/online.html","","-","-","-"
"2017-05-23","String","lo rem ipsum","dolor sit amet","burgerking.com","https://burgerking.com/","20","2","fake"
"2017-05-23","String","lo rem ipsum","dolor sit amet","wendys.com","","-","-","-"
"2017-05-23","String","lo rem ipsum","dolor sit amet","buggagump.com","","-","-","-"
"2017-05-23","String","cheese","ad eum","mcdonalds.com/online.html","","-","-","-"
"2017-05-23","String","burger","ludus dissentiet","www.mcdonalds.com","https://www.mcdonalds.com/","25","3","fake"
"2017-05-23","String","wine","id erat utamur","bubbagump.com","https://buggagump.com/","25","3","fake"
Building settings:
CsvParserSettings settings = new CsvParserSettings();
settings.setDelimiterDetectionEnabled(true);
settings.setQuoteDetectionEnabled(true);
settings.setLineSeparatorDetectionEnabled(false); // all the same using `true`
settings.getFormat().setLineSeparator("\r\n");
CsvParser parser = new CsvParser(settings);
List<String[]> rows;
rows = parser.parseAll(getReader("testFiles/" + "malformed csv small.csv"));
for (String[] row : rows)
{
System.out.println("");
int i = 0;
for (String element : row)
{
System.out.println("col. " + i++ + ": " + element);
}
}
System.out.println("done");
As you are testing the auto-detection process, I suggest you to print out the detected format with:
This will print out:
As you can see, the parser is not detecting the quote escape correctly. While the format detection process is typically very good, it is not guaranteed that it will always get it right, specially with small test samples. In your sample I can't see why it would pick up the
-
as the escape character, so I opened this issue to investigate and see what is making it detect that one.What you can do right now as a workaround, if you know for a fact that none of your input files will never have
-
as the quote escape, is to detect the format, test what it picked up from the input, and then parse the contents, like this:Now just call the
parse
method:And you will have your data properly extracted. Hope this helps!