Parsing comma-separated values containing quoted c

2019-01-27 08:13发布

I have string with some special characters. The aim is to retrieve String[] of each line (, separated) You have special character “ where you can have /n and ,

For example Main String
Alpha,Beta,Gama,"23-5-2013,TOM",TOTO,"Julie, KameL
Titi",God," timmy, tomy,tony,
tini".

You can see that there are you /n in "".

Can any Help me to Parse this.

Thanks

__ More Explanation

with the Main Sting I need to separate these

Here Alpha
Beta
Gama
23-5-2013,TOM
TOTO
Julie,KameL,Titi
God
timmy, tomy,tony,tini

Problem is : for Julie,KameL,Titi there is line break /n or
in between KameL and Titi similar problem for timmy, tomy,tony,tini there is line break /n or
in between tony and tini.


new this text is in file (compulsory line by line reading)

Alpha,Beta Charli,Delta,Delta Echo ,Frank George,Henry
1234-5,"Ida, John
 ", 25/11/1964, 15/12/1964,"40,000,000.00",0.0975,2,"King, Lincoln 
 ",Mary / New York,123456
12543-01,"Ocean, Peter

output i want to remove this "

Alpha
Beta Charli
Delta
Delta Echo
Frank George
Henry
1234-5
Ida
John
"
25/11/1964
15/12/1964
40,000,000.00
0.0975
2
King
Lincoln
"
Mary / New York
123456
12543-01
Ocean
Peter

4条回答
成全新的幸福
2楼-- · 2019-01-27 08:52

See this related answer for a decent Java-compatible regex for parsing CSV.

It recognizes:

  • Newlines (after values or inside quoted values)
  • Quoted values containing escaped double-quotes like ""this""

In short, you will use this pattern: (?:,|\n|^)("(?:(?:"")*[^"]*)*"|[^",\n]*|(?:\n|$))

Then collect each Matcher group(1) in a find() loop.


Note: Although I have posted this answer here about a "decent" regex I discovered, just to save people searching for one, it is by no means robust. I still agree with this answer by user "fgv": a CSV Parser is preferrable.

查看更多
虎瘦雄心在
3楼-- · 2019-01-27 08:58

Description

Consider the following powershell example of a universal regex tested on a Java parser which requires no extra processing to reassemble the data parts. The first matching group will match a quote, then carry that to the end of the match so that you're assured to capture the entire value between but not including the quotes. I also don't capture the commas unless they were embedded a quote delimited substring.

(?:^|,\s{0,})(["]?)\s{0,}((?:.|\n|\r)*?)\1(?=[,]\s{0,}|$)

Example

$Matches = @()
$String = 'Alpha,Beta,Gama,"23-5-2013,TOM",TOTO,"Julie, KameL\n
Titi",God,"timmy, \n
tomy,tony,tini"'
$Regex = '(?:^|,\s{0,})(["]?)\s{0,}((?:.|\n|\r)*?)\1(?=[,]\s{0,}|$)'

Write-Host start with 
write-host $String
Write-Host
Write-Host found
([regex]"(?i)(?m)$Regex").matches($String) | foreach {
    write-host "key at $($_.Groups[1].Index) = '$($_.Groups[1].Value)'`t= value at $($_.Groups[2].Index) = '$($_.Groups[2].Value)'"
    } # next match

Yields

start with
Alpha,Beta,Gama,"23-5-2013,TOM",TOTO,"Julie, KameL\n
Titi",God,"timmy, \n
tomy,tony,tini"

found
key at 0 = ''   = value at 0 = 'Alpha'
key at 6 = ''   = value at 6 = 'Beta'
key at 11 = ''  = value at 11 = 'Gama'
key at 16 = '"' = value at 17 = '23-5-2013,TOM'
key at 32 = ''  = value at 32 = 'TOTO'
key at 37 = '"' = value at 38 = 'Julie, KameL\n
Titi'
key at 60 = ''  = value at 60 = 'God'
key at 64 = '"' = value at 65 = 'timmy, \n
tomy,tony,tini'

Summary

enter image description here

  • (?: start non capture group
  • ^ require start of string
  • | or
  • ,\s{0,} a comma followed by any number of white space
  • ) close the non capture group
  • ( start capture group 1
  • ["]? consume a quote if it exists, I like doing it this way incase you want to include other characters then a quote
  • ) close capture group 1
  • \s{0,} consume any spaces if they exist, this means you don't need to trim the value later
  • ( start capture group 2
  • (?:.|\n|\r)*? capture all characters including a new line, non greedy
  • ) close capture group 2
  • \1 if there was a quote it would be stored in group 1, so if there was one then require it here
  • (?= start zero assertion look ahead
  • [,]\s{0,} must have a comma followed by optional whitespace
  • | or
  • $ end of the string
  • ) close the zero assertion look ahead
查看更多
贪生不怕死
4楼-- · 2019-01-27 08:59

Try this:

String source = "Alpha,Beta,Gama,\"23-5-2013,TOM\",TOTO,\"Julie, KameL\n"
              + "Titi\",God,\" timmy, tomy,tony,\n"
              + "tini\".";

Pattern p = Pattern.compile("(([^\"][^,]*)|\"([^\"]*)\"),?");
Matcher m = p.matcher(source);

while(m.find())
{
    if(m.group(2) != null)
        System.out.println( m.group(2).replace("\n", "") );
    else if(m.group(3) != null)
        System.out.println( m.group(3).replace("\n", "") );
}

If it matches a string without quotes, the result is returned in group 2. Strings with quotes are returned in group 3. Hence i needed a distinction in the while-block. You might find a prettier way.

Output:
Alpha
Beta
Gama
23-5-2013,TOM
TOTO
Julie, KameLTiti
God
timmy, tomy,tony,tini
.

查看更多
▲ chillily
5楼-- · 2019-01-27 09:03

Parsing CSV is a whole lot harder than one would imagine at first sight, and that's why your best option is to use a well-designed and tested library to do that work for you. Two libraries are opencsv and supercsv, and many others. Have a look at both and use the one that's the best fit to your requirements and style.

查看更多
登录 后发表回答