finding rows from file2 in file1 which have extend

2019-09-13 00:43发布

问题:

I have file1 as:

ABC CDEF HAGD CBDGCBAHS:ATSVHC
NBS JHA AUW MNDBE:BWJW
DKW QDW OIW KNDSK:WLKJW
BNSHW JBSS IJS BSHJA
ABC CDEF CBS 234:ATSVHC
DKW QDW FSD 634:WLKJW

and file2:

ABC CDEF HAGD CBDGCBAHS:ATSVHC:THE:123
NBS JHA AUW MNDBE:BWJW:THE:243
DKW QDW OIW KNDSK:WLKJW:THE:253
KAB GCBS YSTW SHSEB:AGTW:THE:193

I want to compare file 1 and file 2 based on column 1,2,3 and 4 except that column 4 in file2 has a bit of an extension to compare with, by using

awk 'FNR==NR{seen[$1,$2,$3,$4;next} ($1,$2,$3,$4) in seen' file1 file2

what can I tweak to make it comparable such that my output are the matched lines in file2 as:

ABC CDEF HAGD CBDGCBAHS:ATSVHC:THE:123
NBS JHA AUW MNDBE:BWJW:THE:243
DKW QDW OIW KNDSK:WLKJW:THE:253

回答1:

Just include : in the FS:

$ awk -F'[ :]' 'NR==FNR{a[$1,$2,$3,$4,$5];next} ($1,$2,$3,$4,$5) in a' file1 file2
ABC CDEF HAGD CBDGCBAHS:ATSVHC:THE:123
NBS JHA AUW MNDBE:BWJW:THE:243
DKW QDW OIW KNDSK:WLKJW:THE:253


回答2:

As I understand it, you want to print lines from file2 that have fields 1, 2, 3, matching the corresponding fields in file1 and also have the beginning part of field 4 in file2 matching field 4 in file1. In that case:

$ awk 'FNR==NR{seen[$1,$2,$3,$4];next} {a=$4; sub(/:[^:]*:[^:]*$/, "", a)} ($1,$2,$3,a) in seen' file1 file2
ABC CDEF HAGD CBDGCBAHS:ATSVHC:THE:123
NBS JHA AUW MNDBE:BWJW:THE:243
DKW QDW OIW KNDSK:WLKJW:THE:253

How it works

  • FNR==NR{seen[$1,$2,$3,$4];next}

    While reading the first file, file1, we add ato associative array seen a key which is equal to the first four fields. We then skip the rest of the commands and jump to the next line.

  • a=$4; sub(/:[^:]*:[^:]*$/, "", a)

    If we get to here, that means we are working on file2.

    This assigns the value of field 4 to variable a and then removes the last two colon-separated strings from a.

  • ($1,$2,$3,a) in seen

    This prints any line in file2 for which the first three fields and a are a key in associative array seen.