Fast way of finding lines in one file that are not

2019-01-01 09:46发布

I have two large files (sets of filenames). Roughly 30.000 lines in each file. I am trying to find a fast way of finding lines in file1 that are not present in file2.

For example, if this is file1:

line1
line2
line3

And this is file2:

line1
line4
line5

Then my result/output should be:

line2
line3

This works:

grep -v -f file2 file1

But it is very, very slow when used on my large files.

I suspect there is a good way to do this using diff(), but the output should be just the lines, nothing else, and I cannot seem to find a switch for that.

Can anyone help me find a fast way of doing this, using bash and basic linux binaries?

EDIT: To follow up on my own question, this is the best way I have found so far using diff():

diff file2 file1 | grep '^>' | sed 's/^>\ //'

Surely, there must be a better way?

10条回答
心情的温度
2楼-- · 2019-01-01 10:19

The way I usually do this is using the --suppress-common-lines flag, though note that this only works if your do it in side-by-side format.

diff -y --suppress-common-lines file1.txt file2.txt

查看更多
无与为乐者.
3楼-- · 2019-01-01 10:20

The comm command (short for "common") may be useful comm - compare two sorted files line by line

#find lines only in file1
comm -23 file1 file2 

#find lines only in file2
comm -13 file1 file2 

#find lines common to both files
comm -12 file1 file2 

The man file is actually quite readable for this.

查看更多
大哥的爱人
4楼-- · 2019-01-01 10:20

I found that for me using a normal if and for loop statement worked perfectly.

for i in $(cat file2);do if [ $(grep -i $i file1) ];then echo "$i found" >>Matching_lines.txt;else echo "$i missing" >>missing_lines.txt ;fi;done
查看更多
有味是清欢
5楼-- · 2019-01-01 10:26

You can use Python:

python -c '
lines_to_remove = set()
with open("file2", "r") as f:
    for line in f.readlines():
        lines_to_remove.add(line.strip())

with open("f1", "r") as f:
    for line in f.readlines():
        if line.strip() not in lines_to_remove:
            print(line.strip())
'
查看更多
琉璃瓶的回忆
6楼-- · 2019-01-01 10:30
$ join -v 1 -t '' file1 file2
line2
line3

The -t makes sure that it compares the whole line, if you had a space in some of the lines.

查看更多
琉璃瓶的回忆
7楼-- · 2019-01-01 10:30

If you're short of "fancy tools", e.g. in some minimal Linux distribution, there is a solution with just cat, sort and uniq:

cat includes.txt excludes.txt excludes.txt | sort | uniq --unique

Test:

seq 1 1 7 | sort --random-sort > includes.txt
seq 3 1 9 | sort --random-sort > excludes.txt
cat includes.txt excludes.txt excludes.txt | sort | uniq --unique

# Output:
1
2    

This is also relatively fast, compared to grep.

查看更多
登录 后发表回答