Fast way of finding lines in one file that are not

I have two large files (sets of filenames). Roughly 30.000 lines in each file. I am trying to find a fast way of finding lines in file1 that are not present in file2.

For example, if this is file1:

line1
line2
line3

And this is file2:

line1
line4
line5

Then my result/output should be:

line2
line3

This works:

grep -v -f file2 file1

But it is very, very slow when used on my large files.

I suspect there is a good way to do this using diff(), but the output should be just the lines, nothing else, and I cannot seem to find a switch for that.

Can anyone help me find a fast way of doing this, using bash and basic linux binaries?

EDIT: To follow up on my own question, this is the best way I have found so far using diff():

diff file2 file1 | grep '^>' | sed 's/^>\ //'

Surely, there must be a better way?

标签： bash grep find diff

10条回答

心情的温度

2楼-- · 2019-01-01 10:19

The way I usually do this is using the --suppress-common-lines flag, though note that this only works if your do it in side-by-side format.

diff -y --suppress-common-lines file1.txt file2.txt

0人赞添加讨论(0) 举报

无与为乐者.

3楼-- · 2019-01-01 10:20

The comm command (short for "common") may be useful comm - compare two sorted files line by line

#find lines only in file1
comm -23 file1 file2 

#find lines only in file2
comm -13 file1 file2 

#find lines common to both files
comm -12 file1 file2

The man file is actually quite readable for this.

0人赞添加讨论(0) 举报

大哥的爱人

4楼-- · 2019-01-01 10:20

I found that for me using a normal if and for loop statement worked perfectly.

for i in $(cat file2);do if [ $(grep -i $i file1) ];then echo "$i found" >>Matching_lines.txt;else echo "$i missing" >>missing_lines.txt ;fi;done

0人赞添加讨论(0) 举报

有味是清欢

5楼-- · 2019-01-01 10:26

You can use Python:

python -c '
lines_to_remove = set()
with open("file2", "r") as f:
    for line in f.readlines():
        lines_to_remove.add(line.strip())

with open("f1", "r") as f:
    for line in f.readlines():
        if line.strip() not in lines_to_remove:
            print(line.strip())
'

0人赞添加讨论(0) 举报

琉璃瓶的回忆

6楼-- · 2019-01-01 10:30

$ join -v 1 -t '' file1 file2
line2
line3

The -t makes sure that it compares the whole line, if you had a space in some of the lines.

0人赞添加讨论(0) 举报

琉璃瓶的回忆

7楼-- · 2019-01-01 10:30

If you're short of "fancy tools", e.g. in some minimal Linux distribution, there is a solution with just cat, sort and uniq:

cat includes.txt excludes.txt excludes.txt | sort | uniq --unique

Test:

seq 1 1 7 | sort --random-sort > includes.txt
seq 3 1 9 | sort --random-sort > excludes.txt
cat includes.txt excludes.txt excludes.txt | sort | uniq --unique

# Output:
1
2

This is also relatively fast, compared to grep.

0人赞添加讨论(0) 举报

1 2 下一页

Fast way of finding lines in one file that are not

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间