I need to merge two files and create a new files

2019-06-14 22:11发布

问题:

input 1

1 10611 2 122 C:0.983607 G:0.0163934

input 2

1 10611 rs146752890 C G 100 PASS AC=184;RSQ=0.8228;AVGPOST=0.9640;AN=2184;ERATE=0.0031;VT=SNP;AA=.;THETA=0.0127;LDAF=0.0902;SNPSOURCE=LOWCOV;AF=0.08;ASN_AF=0.08;AMR_AF=0.14;AFR_AF=0.08;EUR_AF=0.07

here 1st and 2nd column are matching and values before ':' of 5th column of first file and 4th column of 2nd files are equel and 6th column(values before ':') of first and 5th column of second files are equel and output is creating based on this match.Will get the clear idea from input and output line and both files are .gz files

output

1 10611 rs146752890 C G 100 PASS AC=184;RSQ=0.8228;AVGPOST=0.9640;AN=2184;ERATE=0.0031;VT=SNP;AA=.;THETA=0.0127;LDAF=0.0902;SNPSOURCE=LOWCOV;AF=0.08;ASN_AF=0.08;AMR_AF=0.14;AFR_AF=0.08;EUR_AF=0.07;REF=0.983607;ALT=0.0163934;

回答1:

Here's one way using awk:

awk 'FNR==NR { split($5,a,":"); split($6,b,":"); c[$1,$2,a[1],b[1]]="REF=" a[2] ";ALT=" b[2] ";"; next } ($1,$2,$4,$5) in c { print $0 ";" c[$1,$2,$4,$5] }' input1 input2

Result:

1 10611 rs146752890 C G 100 PASS AC=184;RSQ=0.8228;AVGPOST=0.9640;AN=2184;ERATE=0.0031;VT=SNP;AA=.;THETA=0.0127;LDAF=0.0902;SNPSOURCE=LOWCOV;AF=0.08;ASN_AF=0.08;AMR_AF=0.14;AFR_AF=0.08;EUR_AF=0.07;REF=0.983607;ALT=0.0163934;

So for compressed files, try:

awk 'FNR==NR { split($5,a,":"); split($6,b,":"); c[$1,$2,a[1],b[1]]="REF=" a[2] ";ALT=" b[2] ";"; next } ($1,$2,$4,$5) in c { print $0 ";" c[$1,$2,$4,$5] }' <(gzip -dc input1.gz) <(gzip -dc input2.gz) | gzip > output.gz

EDIT:

From the comments below, try this:

awk 'FNR==NR { split($5,a,":"); split($6,b,":"); c[$1,$2,a[1],b[1]]="REF=" a[2] ";ALT=" b[2] ";"; next } ($1,$2,$4,$5) in c { print $1, $2, $3, $4, $5, $6, $7, c[$1,$2,$4,$5] $8 ";" }' file1 file2

Result:

1 10611 rs146752890 C G 100 PASS REF=0.983607;ALT=0.0163934;AC=184;RSQ=0.8228;AVGPOST=0.9640;AN=2184;ERATE=0.0031;VT=SNP;AA=.;THETA=0.0127;LDAF=0.0902;SNPSOURCE=LOWCOV;AF=0.08;ASN_AF=0.08;AMR_AF=0.14;AFR_AF=0.08;EUR_AF=0.07;


回答2:

This should work (assuming you have enough disk space to store the expanded .gz files):

zcat 1 | awk '{print $1$2,$0}' | sort > new1
zcat 2 | awk '{print $1$2,$0}' | sort > new2
join new1 new2 -11 -21 -o "2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 1.6 1.7"|sed 's/ C:/;REF=/'|sed 's/ G:/;ALT=/' > output


标签: shell awk