CSV parsing for embedded double quotes

I've written a simple CSV file parser. But after looking at the wiki page on CSV formats I noticed some "extensions" to the basic format. Specifically embedded comma via double quotes. I've managed to parse those, however there is a second issue: embedded double quotes.

Example:

12345,"ABC, ""IJK"" XYZ" -> [1234] and [ABC, "IJK" XYZ]

I can't seem to find the correct way to distinguish between an enclosed double quote and none. So my question is what is the correct way/algorithm to parse CVS formats such as the one above?

标签： c++ algorithm parsing csv

5条回答

Lonely孤独者°

2楼-- · 2019-05-07 02:44

A double double-quote ("") is a literal double-quote, while a lone double-quote (") is used for enclosing text (including commas).

Here's a regex for a csv field, if that makes things easier:

([^",\n][^,\n]*)|"((?:[^"]|"")+)"

Group 1 will contain the field if it isn't in quotes, group 2 will contain the field if it is in quotes, minus the surrounding quotes. In that case, just replace all instances of "" with ".

0人赞添加讨论(0) 举报

我欲成王，谁敢阻挡

3楼-- · 2019-05-07 02:45

I would do this using a single character look-ahead, so if you're scanning the string and find a double quote, look at the next character to see if it is also a double quote. If it is, then the pair represents a single doublequote character in the output. If it's any other character, you're looking at the end of the quoted string (and hopefully that next character is a comma!). Be sure to account for the end-of-line condition when looking at the next character, too.

0人赞添加讨论(0) 举报

一纸荒年 Trace。

4楼-- · 2019-05-07 02:53

I suggest reading: Stop Rolling Your Own CSV Parser and this CSV RFC. The first is really just someone who wants you to use their C# CSV parser, but still explains many issues.

Your parser should be examining a character at a time. I used a double bool strategy for my parser in D. Each quote toggles weather the string is quoted or not. When in a quoted Cell you flag when hit a quote, and turn off quoting. If the next character is a quote, quoting is turned on, a quote is added to the result and the flag is turned off. If the next character isn't a quote then the flag is turned off and so is quoting.

0人赞添加讨论(0) 举报

Ridiculous、

5楼-- · 2019-05-07 02:55

The way I normally think about this is basically to look at the quoted value as a single, unquoted value or a sequence of double quoted values that form a value joined by quotes. That is,

to parse the next atom in the row:
- read up to the first non whitespace character
- if the current character is not a quote:
  - mark the current spot
  - read up to the next comma or newline
  - return the text between the mark and the character before the comma (strip spaces if appropriate)
- if the current character is a quote:
  - create an empty string buffer
  - while the current character is not a quote
    - mark the current position +1 (skip the quote character)
    - read up to the next quote
    - if the buffer is not empty, append a quote to it
    - append to the buffer the text between the mark and the character before the current position (to strip both quotes)
    - advance one character (past the just read quote)
  - read up to the next comma or newline
  - return the buffer

essentially, split each double quoted segment of the quoted string and then catenate them together with quotes. thus: "ABC, ""IJK"" XYZ" becomes ABC, , IJK, XYZ, which in turn becomes ABC, "IJK" XYZ

0人赞添加讨论(0) 举报

Melony?

6楼-- · 2019-05-07 03:00

If you find a double-quote, then you should look for a double-quote in the end of the word/string. If you can't find, then there is an error. The same for a quote.

I suggest you try Flex/Bison in order to write a parser for the CSV file. Both tools will help you to generate a parser and then you can use the C files with the parser and call it from your C++ program. On Flex, you create a scanner that can find your tokens, like "word" or ""word"". On Bison, you define the syntax.

0人赞添加讨论(0) 举报

CSV parsing for embedded double quotes

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间