I am facing one issue with my source file. Consider I have following data in file-
"dfjsdlfkj,fsdkfj,werkj",234234,234234,,"dfsd,etwetr"
here, the delimiter is comma, but some fields have comma as a part of data. Such fields are enclosed in double quotes. I want to extract few columns from the file.
If I use cut -d "," -f 1,3
then I am getting output like-
"dfjsdlfkj,werkj"
I suggest you to use a csv
parser. For example, python has one as a built-in module, so you only have to import it:
import sys
import csv
with open(sys.argv[1], newline='') as csvfile:
csvreader = csv.reader(csvfile)
csvwriter = csv.writer(sys.stdout)
for row in csvreader:
csvwriter.writerow([row[e] for e in (0,2)])
Assuming your example line is in an input file named infile
, run the script as:
python3 script.py infile
That yields:
"dfjsdlfkj,fsdkfj,werkj",234234
You could try the following:
awk -f getFields.awk input.txt
where input.txt
is your input file, and getFields.awk
is:
{
split("",a)
splitLine()
print a[1],a[3]
}
function splitLine(s,indq,t,r,len) {
# Assumptions:
# * spaces before or after commas are ignored
# * spaces at beginning or end of line is ignored
# definition of a quoted parameter:
# - starts with: (^ and $ are regexp characters)
# a) ^"
# b) ,"
# - ends with:
# a) "$
# b) ",
s=$0; k=1
s=removeBlanks(s)
while (s) {
if (substr(s,1,1)=="\"")
indq=2
else {
sub(/[[:blank:]]*,[[:blank:]]*"/,",\"",s)
indq=index(s,",\"")
if (indq) {
t=substr(s,1,indq-1)
splitCommaString(t)
indq=indq+2
}
}
if (indq) {
s=substr(s,indq)
sub(/"[[:blank:]]*,/,"\",",s)
len=index(s,"\",") #find closing quote
if (!len) {
if (match(s,/"$/)) {
len=RSTART-1
}
else
len=length(s)
r=substr(s,1,len)
s=""
} else {
r=substr(s,1,len-1)
s=substr(s,len+2)
}
a[k++]=r
} else {
splitCommaString(s)
s=""
}
}
k=k-1
}
function splitCommaString(t,b,i) {
n=split(t,b,",")
for (i=1; i<=n; i++)
a[k++]=removeBlanks(b[i])
}
function removeBlanks(r) {
sub(/^[[:blank:]]*/,"",r)
sub(/[[:blank:]]*$/,"",r)
return r
}