Broken CSV, how can I fix it?

2019-06-28 02:17发布

I'm trying to parse a CSV. I'd like to get it into a DB or just parse it with JavaScript, but either way fails due to the broken syntax. My entire CSV file is here:
https://gist.github.com/1023560

If you notice, it breaks where there are double quotes in the double quotes and it also fails when inserting into MySQL. The first break is seen at line 13. It breaks off and instead of returning the full:

 <a href="http://www.facebook.com/pages/Portland-Community-Gardens/139244076118027?v=wall" target="_blank"><img src="/shared/cfm/image.cfm?id=348340" alt="Facebook" width="100" height="31" /></a>

It returns:

<a href="

For JavaScript I was going to just use CSVToArray() by Ben Nadel:
http://www.bennadel.com/blog/1504-Ask-Ben-Parsing-CSV-Strings-With-Javascript-Exec-Regular-Expression-Command.htm

My ultimate goal tho is to put into MySQL so I can echo back a JSON feed with PHP's json_encode().

Things I noticed that could be problematic are that double quotes can be in HTML tags like above, but also as the textNodes of HTML tags, so "<span class="text">"Example"</span>"

The first set of quotes is the CSV column, the 2nd are HTML quotes, and the third are text quotes.

3条回答
够拽才男人
2楼-- · 2019-06-28 02:26

The quotes don't matter as much as the commas do. If comma is the delimiter, then you can't have commas in the values. If you can get the CSV saved using a different delimter, you might get better results.
Use a character like ~ or a ^ instead of a comma as the delimiter.

查看更多
Animai°情兽
3楼-- · 2019-06-28 02:27

Assuming you are either on Windows or can do this on a Windows box, check out Logparser. It is a free command line utility that can parse many data formats including CSV, and can output to many formats including SQL.

查看更多
SAY GOODBYE
4楼-- · 2019-06-28 02:36

You may be able to trick it and use a regex to look for:

"(.*?)"(?=,|$)

But that's kind of hack-ish (basically, only accept an end quote when immediately followed by a comma or an end of line). Same logic would apply to a find-replace. (Again, this all assumes that a "stray" quote will never follow standard CSV rules (e.g. have a comma/line [beginning/end] before or after it))

I assume you have no control over the original data and have to work with what you have?

EDIT

Though I've only tried this on a small sample of your data, this appears to find the "stray" quotes, to which you can use a replace with "" on:

(?<!^|"|,)"(?!"|,|$)
查看更多
登录 后发表回答