I have file1 as:
ABC CDEF HAGD CBDGCBAHS:ATSVHC
NBS JHA AUW MNDBE:BWJW
DKW QDW OIW KNDSK:WLKJW
BNSHW JBSS IJS BSHJA
ABC CDEF CBS 234:ATSVHC
DKW QDW FSD 634:WLKJW
and file2:
ABC CDEF HAGD CBDGCBAHS:ATSVHC:THE:123
NBS JHA AUW MNDBE:BWJW:THE:243
DKW QDW OIW KNDSK:WLKJW:THE:253
KAB GCBS YSTW SHSEB:AGTW:THE:193
I want to compare file 1 and file 2 based on column 1,2,3 and 4 except that column 4 in file2 has a bit of an extension to compare with, by using
awk 'FNR==NR{seen[$1,$2,$3,$4;next} ($1,$2,$3,$4) in seen' file1 file2
what can I tweak to make it comparable such that my output are the matched lines in file2 as:
ABC CDEF HAGD CBDGCBAHS:ATSVHC:THE:123
NBS JHA AUW MNDBE:BWJW:THE:243
DKW QDW OIW KNDSK:WLKJW:THE:253
Just include :
in the FS:
$ awk -F'[ :]' 'NR==FNR{a[$1,$2,$3,$4,$5];next} ($1,$2,$3,$4,$5) in a' file1 file2
ABC CDEF HAGD CBDGCBAHS:ATSVHC:THE:123
NBS JHA AUW MNDBE:BWJW:THE:243
DKW QDW OIW KNDSK:WLKJW:THE:253
As I understand it, you want to print lines from file2 that have fields 1, 2, 3, matching the corresponding fields in file1 and also have the beginning part of field 4 in file2 matching field 4 in file1. In that case:
$ awk 'FNR==NR{seen[$1,$2,$3,$4];next} {a=$4; sub(/:[^:]*:[^:]*$/, "", a)} ($1,$2,$3,a) in seen' file1 file2
ABC CDEF HAGD CBDGCBAHS:ATSVHC:THE:123
NBS JHA AUW MNDBE:BWJW:THE:243
DKW QDW OIW KNDSK:WLKJW:THE:253
How it works
FNR==NR{seen[$1,$2,$3,$4];next}
While reading the first file, file1, we add ato associative array seen
a key which is equal to the first four fields. We then skip the rest of the commands and jump to the next
line.
a=$4; sub(/:[^:]*:[^:]*$/, "", a)
If we get to here, that means we are working on file2.
This assigns the value of field 4 to variable a
and then removes the last two colon-separated strings from a
.
($1,$2,$3,a) in seen
This prints any line in file2 for which the first three fields and a
are a key in associative array seen
.