Remove duplicates from 2 filtered files with awk

2019-09-05 05:24发布

I have 2 source files (an english file and an italian file) with the same number of lines and i perform an awk command to remove all lines from the IT.txt file which have more than 2 words

EN.txt
Santa Claus
Pigs don't fly
The son of the father
Elf
Santa Claus
Elf
Sabatons
Shoes

IT.txt
Babbo Natale
I maiali non volano
Il figlio del padre
Elfo
Babbo Natale
Elfo
Scarpe
Scarpe

So basically i have this kind of output:

EN.txt
Santa Claus
Pigs don't fly
The son of the father
Elf
Santa Claus
Elf
Sabatons
Shoes

IT.txt
Babbo Natale
Elfo
Babbo Natale
Elfo
Scarpe
Scarpe

But at the same time, i'd like to remove the same related strings from the EN.txt file. I thought I could work on the line number (for a moment, then i found out a better solution) and not on running another awk command to remove in the same way the strings having more than 2 words in the EN file, because a translation could be different from the source string (like having more words). So i need to focus my work to the IT file and the EN file must suffer the effect of command i launched. Therefore, my filtered output must be like this:

EN.txt
Santa Claus
Elf
Santa Claus
Elf
Sabatons
Shoes

IT.txt
Babbo Natale
Elfo
Babbo Natale
Elfo
Scarpe
Scarpe

this is the command i tried with (suggested with a previous question) and it works perfectly: awk 'NR==FNR{if(NF>3){a[NR]}else{a[NR]=1;print > "filtered_it.txt"}} NR!=FNR && a[FNR]{print > "filtered_en.txt"}' IT.txt EN.txt

But now i'd like to add extra on this command, like removing duplicates in order to have an output like this, but being careful to those lines that may have the same translation in italian but their respective source strings are different (like Sabatons and Shoes translated into Scarpe). In conclusion, i need to remove the duplicates only from both files at the same time (somehow) and not from a single one running each single command.

EN.txt
Santa Claus
Elf
Sabatons
Shoes

IT.txt
Babbo Natale
Elfo
Scarpe
Scarpe

标签: linux bash awk
1条回答
爱情/是我丢掉的垃圾
2楼-- · 2019-09-05 06:18

Your spec is very confusing but I think this is what you wanted. Also, instead of operating on two files, if they are supposed to be matched line by line it's easier to start doing that first.

$ paste EN.txt IT.txt
          | awk -F'\t' '{n=split($1,_," ");
                         m=split($2,_," ")} 
 n<3 && m<3 && !a[$0]++ {print $1 > "f_EN.txt";
                         print $2 > "f_IT.txt"}' 

$ cat f_EN.txt 
Santa Claus
Elf
Sabatons
Shoes

$ cat f_IT.txt   
Babbo Natale
Elfo
Scarpe
Scarpe

ps. You either believe time travel is possible or using "tomorrow" instead of "yesterday" :)

查看更多
登录 后发表回答