Find all lines in a text file that have at least t

I have several large text files produced by different people. These files contain listings of a single title per line. Every sentence is different, but supposedly refer to the same -unknown- set of items.

Given that formats and wording are different, I tried generating a shorter file with likely matches for manual inspection. I am new to Bash and I tried several commands to compare each line with titles having two or more key words in common. Case sensitive should be avoided and key words over 4 characters long to exclude articles and the like.

Example:

Input Text File #1

Investigating Amusing King : Expl and/in the Proletariat
Managing Self-Confident Legacy: The Harlem Renaissance and/in the Abject
Inventing Sarcastic Silence: The Harlem Renaissance and/in the Invader
Inventing Random Ethos: The Harlem Renaissance and/in the Marginalized
Loss: Supplementing Transgressive Production and Assimilation

Input Text File #2

Loss: Judging Foolhardy Historicism and Homosexuality
Loss: Developping Homophobic Textuality and Outrage
Loss: Supplement of transgressive production
Loss: Questioning Diligent Verbiage and Mythos
Me Against You: Transgressing Easygoing Materialism and Dialectic

Output Text File

File #1-->Loss: Supplementing Transgressive Production and Assimilation
File #2-->Loss: Supplement of transgressive production

So far I have been able to weed out a few duplicates with exact same entries...

cat FILE_num*.txt | sort | uniq -d > berbatim_duplicates.txt

...and a few other which had identical annotations between brackets

  cat FILE_num*.txt | sort | cut -d "{" -f2 | cut -d "}" -f1 | uniq -d > same_annotations.txt

A command that looks very promising is find with regex, but I am failing to make it work.

Thanks in advance.

标签： regex linux bash

1条回答

欢心

2楼-- · 2019-09-09 08:16

In Python 3 :

from sys import argv
from re import sub

def getWordSet(line):
    line=sub(r'\[.*\]|\(.*\)|[.,!?:]','',line).split()
    s=set()
    for word in line:
        if len(word)>4:
            word=word.lower()
            s.add(word)
    return s

def compare(file1, file2):
    file1 = file1.split('\n')
    file2 = file2.split('\n')
    for line1,set1 in zip(file1,map(getWordSet,file1)):
        for line2,set2 in zip(file2,map(getWordSet,file2)):
            if len(set1.intersection(set2))>1:
                print("File #1-->",line1,sep='')
                print("File #2-->",line2,sep='')

if __name__=='__main__':
    with open(argv[1]) as file1, open(argv[2]) as file2:
        compare(file1.read(),file2.read())

gives the expected output. It displays the matching pairs of lines of the files.

Save this script in a file - I'll refer to it as script.py, but you can name it as you want. You can launch it with

python3 script.py file1 file2

You can even use an alias :

alias comp="python3 script.py"

and then

comp file1 file2

I included the features from the discussion below.

0人赞添加讨论(0) 举报

Find all lines in a text file that have at least t

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间