可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I have a large file A (consisting of emails), one line for each mail. I also have another file B that contains another set of mails.
Which command would I use to remove all the addresses that appear in file B from the file A.
So, if file A contained:
A
B
C
and file B contained:
B
D
E
Then file A should be left with:
A
C
Now I know this is a question that might have been asked more often, but I only found one command online that gave me an error with a bad delimiter.
Any help would be much appreciated! Somebody will surely come up with a clever one-liner, but I\'m not the shell expert.
回答1:
comm -23 file1 file2
-23 suppresses the lines that are in both files, or only in file 2. The files have to be sorted (they are in your example) but if not, pipe them through sort
first...
See the man page here
回答2:
grep -Fvxf <lines-to-remove> <all-lines>
- works on non-sorted files
- maintains the order
- is POSIX
Example:
cat <<EOF > A
b
1
a
0
01
b
1
EOF
cat <<EOF > B
0
1
EOF
grep -Fvxf B A
Output:
b
a
01
b
Explanation:
-F
: use literal strings instead of the default BRE
-x
: only consider matches that match the entire line
-v
: print non-matching
-f file
: take patterns from the given file
This method is slower on pre-sorted files than other methods, since it is more general. If speed matters as well, see: Fast way of finding lines in one file that are not in another?
See also: https://unix.stackexchange.com/questions/28158/is-there-a-tool-to-get-the-lines-in-one-file-that-are-not-in-another
回答3:
awk to the rescue!
This solution doesn\'t require sorted inputs. You have to provide fileB first.
awk \'NR==FNR{a[$0];next} !($0 in a)\' fileB fileA
returns
A
C
How does it work?
NR==FNR{a[$0];next}
idiom is for storing the first file in an associative array as keys for a later \"contains\" test.
NR==FNR
is checking whether we\'re scanning the first file, where the global line counter (NR) equals to the current file line counter (FNR).
a[$0]
adds the current line to the associative array as key, note that this behaves like a set, where there won\'t be any duplicate values (keys)
!($0 in a)
we\'re now in the next file(s), in
is a contains test, here it\'s checking whether current line is in the set we populated in the first step from the first file, !
negates the condition. What is missing here is the action, which by default is {print}
and usually not written explicitly.
Note that this can now be used to remove blacklisted words.
$ awk \'...\' badwords allwords > goodwords
with a slight change it can clean multiple lists and create cleaned versions.
$ awk \'NR==FNR{a[$0];next} !($0 in a){print > FILENAME\".clean\"}\' bad file1 file2 file3 ...
回答4:
Another way to do the same thing (also requires sorted input):
join -v 1 fileA fileB
In Bash, if the files are not pre-sorted:
join -v 1 <(sort fileA) <(sort fileB)
回答5:
You can do this unless your files are sorted
diff file-a file-b --new-line-format=\"\" --old-line-format=\"%L\" --unchanged-line-format=\"\" > file-a
--new-line-format
is for lines that are in file b but not in a
--old-..
is for lines that are in file a but not in b
--unchanged-..
is for lines that are in both.
%L
makes it so the line is printed exactly.
man diff
for more details
回答6:
This refinement of @karakfa\'s nice answer may be noticeably faster for very large files. As with that answer, neither file need be sorted, but speed is assured by virtue of awk\'s associative arrays. Only the lookup file is held in memory.
This formulation also allows for the possibility that only one particular field ($N) in the input file is to be used in the comparison.
# Print lines in the input unless the value in column $N
# appears in a lookup file, $LOOKUP;
# if $N is 0, then the entire line is used for comparison.
awk -v N=$N -v lookup=\"$LOOKUP\" \'
BEGIN { while ( getline < lookup ) { dictionary[$0]=$0 } }
!($N in dictionary) {print}\'
(Another advantage of this approach is that it is easy to modify the comparison criterion, e.g. to trim leading and trailing white space.)
回答7:
You can use Python:
python -c \'
lines_to_remove = set()
with open(\"file B\", \"r\") as f:
for line in f.readlines():
lines_to_remove.add(line.strip())
with open(\"file A\", \"r\") as f:
for line in [line.strip() for line in f.readlines()]:
if line not in lines_to_remove:
print(line)
\'
回答8:
You can use -
diff fileA fileB | grep \"^>\" | cut -c3- > fileA
This will work for files that are not sorted as well.