I have several large text files produced by different people. These files contain listings of a single title per line. Every sentence is different, but supposedly refer to the same -unknown- set of items.
Given that formats and wording are different, I tried generating a shorter file with likely matches for manual inspection. I am new to Bash and I tried several commands to compare each line with titles having two or more key words in common. Case sensitive should be avoided and key words over 4 characters long to exclude articles and the like.
Example:
Input Text File #1
Investigating Amusing King : Expl and/in the Proletariat
Managing Self-Confident Legacy: The Harlem Renaissance and/in the Abject
Inventing Sarcastic Silence: The Harlem Renaissance and/in the Invader
Inventing Random Ethos: The Harlem Renaissance and/in the Marginalized
Loss: Supplementing Transgressive Production and Assimilation
Input Text File #2
Loss: Judging Foolhardy Historicism and Homosexuality
Loss: Developping Homophobic Textuality and Outrage
Loss: Supplement of transgressive production
Loss: Questioning Diligent Verbiage and Mythos
Me Against You: Transgressing Easygoing Materialism and Dialectic
Output Text File
File #1-->Loss: Supplementing Transgressive Production and Assimilation
File #2-->Loss: Supplement of transgressive production
So far I have been able to weed out a few duplicates with exact same entries...
cat FILE_num*.txt | sort | uniq -d > berbatim_duplicates.txt
...and a few other which had identical annotations between brackets
cat FILE_num*.txt | sort | cut -d "{" -f2 | cut -d "}" -f1 | uniq -d > same_annotations.txt
A command that looks very promising is find with regex, but I am failing to make it work.
Thanks in advance.
In Python 3 :
gives the expected output. It displays the matching pairs of lines of the files.
Save this script in a file - I'll refer to it as script.py, but you can name it as you want. You can launch it with
You can even use an alias :
and then
I included the features from the discussion below.