Source file delimiter issue

2019-07-24 00:20发布

问题:

I am facing one issue with my source file. Consider I have following data in file-

"dfjsdlfkj,fsdkfj,werkj",234234,234234,,"dfsd,etwetr"

here, the delimiter is comma, but some fields have comma as a part of data. Such fields are enclosed in double quotes. I want to extract few columns from the file.

If I use cut -d "," -f 1,3 then I am getting output like-

"dfjsdlfkj,werkj"

回答1:

I suggest you to use a csv parser. For example, python has one as a built-in module, so you only have to import it:

import sys 
import csv 

with open(sys.argv[1], newline='') as csvfile:
    csvreader = csv.reader(csvfile)
    csvwriter = csv.writer(sys.stdout)
    for row in csvreader:
        csvwriter.writerow([row[e] for e in (0,2)])

Assuming your example line is in an input file named infile, run the script as:

python3 script.py infile

That yields:

"dfjsdlfkj,fsdkfj,werkj",234234


回答2:

You could try the following:

awk -f getFields.awk input.txt

where input.txt is your input file, and getFields.awk is:

{
    split("",a)
    splitLine()
    print a[1],a[3]
}

function splitLine(s,indq,t,r,len) {
# Assumptions: 
#  * spaces before or after commas are ignored
#  * spaces at beginning or end of line is ignored

# definition of a quoted parameter:
# - starts with: (^ and $ are regexp characters)
#   a) ^"
#   b) ,"
# - ends with:
#   a) "$
#   b) ",

    s=$0; k=1
    s=removeBlanks(s)
    while (s) {
        if (substr(s,1,1)=="\"")
            indq=2
        else {
            sub(/[[:blank:]]*,[[:blank:]]*"/,",\"",s)
            indq=index(s,",\"")
            if (indq) {
                t=substr(s,1,indq-1)
                splitCommaString(t)
                indq=indq+2
            }
        }
        if (indq) {
            s=substr(s,indq)
            sub(/"[[:blank:]]*,/,"\",",s)
            len=index(s,"\",")  #find closing quote
            if (!len) {
                if (match(s,/"$/)) {
                    len=RSTART-1
                }
                else 
                    len=length(s)
                r=substr(s,1,len)
                s=""
            } else {
                r=substr(s,1,len-1)
                s=substr(s,len+2)
            }
            a[k++]=r
        } else  {
            splitCommaString(s)
            s=""
        }
    }
    k=k-1
}

function splitCommaString(t,b,i) {
    n=split(t,b,",")
    for (i=1; i<=n; i++)
        a[k++]=removeBlanks(b[i])
}       

function removeBlanks(r) {
    sub(/^[[:blank:]]*/,"",r)
    sub(/[[:blank:]]*$/,"",r)
    return r
}