The question Unix command to find lines common in two files has an answer suggesting the use of the comm
command to do the task:
comm -12 1.sorted.txt 2.sorted.txt
This shows the lines common to the two files (the -1
suppresses the lines that are only in the first file, and the -2
suppresses the lines only in the second file, leaving just the lines common to both files as output). As the file names suggest, the input files must be in sorted order.
In a comment to that question, bapors asks:
How would one have the outputs in different files?
Seeking clarification, I asked:
If you want the lines only in File1 in one file, those only in File2 in another, and those in both in a third, then (provided that none of the lines in the files starts with a tab) you could use sed
to split the output to three files.
User bapors confirmed:
It is exactly what I was asking. Would you show an example?
The answer is relatively long-winded and would spoil the simplicity of the answer to the other question (drowning it out with lots of information), so I've asked the question separately here — and provided an answer too.
The basic solution using sed
relies on the fact that comm
outputs lines found only in the first file with no prefix; it outputs the lines found only in the second file with a single tab; and it outputs the lines found in both files with two tabs.
It also relies on sed
's w
command to write to files.
Given file 1.sorted.txt
containing:
1.line-1
1.line-2
1.line-4
1.line-6
2.line-2
3.line-5
and file 2.sorted.txt
containing:
1.line-3
2.line-1
2.line-2
2.line-4
2.line-6
3.line-5
the basic output from comm 1.sorted.txt 2.sorted.txt
is:
1.line-1
1.line-2
1.line-3
1.line-4
1.line-6
2.line-1
2.line-2
2.line-4
2.line-6
3.line-5
Given a file script.sed
containing:
/^\t\t/ {
s///
w file.3
d
}
/^\t/ {
s///
w file.2
d
}
/^[^\t]/ {
w file.1
d
}
you can run the command shown below and get the desired output like this:
$ comm 1.sorted.txt 2.sorted.txt | sed -f script.sed
$ cat file.1
1.line-1
1.line-2
1.line-4
1.line-6
$ cat file.2
1.line-3
2.line-1
2.line-4
2.line-6
$ cat file.3
2.line-2
3.line-5
$
The script works by:
- matching lines that start with 2 tabs, deleting the tabs, writing the line to
file.3
, and deleting the line (so the rest of the script is ignored),
- matching lines that start with 1 tab, deleting the tab, writing the line to
file.2
, and deleting the line (so the rest of the script is ignored),
- matching lines that do not start with a tab, writing the line to
file.1
, and deleting the line.
The match and delete operations in step 3 are more for symmetry than anything else; they could be omitted (leaving just w file.1
) and this script would work the same. However, see script3.sed
below for further justification for keeping the symmetry.
As written, that requires GNU sed
; BSD sed
doesn't recognize the \t
escapes. Obviously, the file could be written with actual tabs in place of the \t
notation, and then BSD sed
is OK with the script.
It is possible to make it work all on the command line, but it is fiddly (and that's being polite about it). Using Bash's ANSI C Quoting, you can write:
$ comm 1.sorted.txt 2.sorted.txt |
> sed -e $'/^\t\t/ { s///\n w file.3\n d\n }' \
> -e $'/^\t/ { s///\n w file.2\n d\n }' \
> -e $'/^[^\t]/ { w file.1\n d\n }'
$
which writes each of the three 'paragraphs' of script.sed
in a separate -e
option. The w
command is fussy; it expects the file name, and only the file name, after it on the same line of the script, hence the use of \n
after the file names in the script. There are spaces aplenty that could be eliminated, but the symmetry is clearer with the layout shown. And using the -f script.sed
file is probably simpler — it is certainly a technique worth knowing as it can avoid problems when the sed
script must operate on single, double and back-quotes, which makes it difficult to write the script on the Bash command line.
Finally, if the two files can contain lines starting with tabs, this technique requires some more brute force to make it work. One variant solution exploits Bash's process substitution to add a prefix before the lines in the files, and then the post-processing sed
script removes the prefixes before writing to the output files.
script3.sed
(with tabs replaced by up to 8 spaces) — note that this time there is a substitute s///
needed in the third paragraph (the d
is still optional, but may as well be included):
/^ X/ {
s///
w file.3
d
}
/^ X/ {
s///
w file.2
d
}
/^X/ {
s///
w file.1
d
}
And the command line:
$ comm <(sed 's/^/X/' 1.sorted.txt) <(sed 's/^/X/' 2.sorted.txt) |
> sed -f script3.sed
$
For the same input files, this produces the same output, but by adding and then removing the X
at the start of each line, the code doesn't change the sort order of the data and would handle leading tabs if they were present.
You can also easily write solutions that use Perl or Awk, and those do not even have to use comm
(and can be made to work with unsorted files, provided the files fit into memory).
comm + awk solution:
Complicated sample files:
1.txt:
1. line-1 with spaces ( | | here
1.line-2
1.line-4 with tabs >
1.line-6
2.line-2
3.line-5 (tabs)
2.txt:
1.line-3
2.line-1 with spaces
2.line-2
2.line-4
2.line-6 with tabs
3.line-5 (tabs)
The job:
comm -12 1.txt 2.txt > file-common
awk 'NR==FNR{ a[$0];next }!($0 in a){ print $0 > "file"ARGIND-1 }' file-common 1.txt 2.txt
Viewing results:
head file*
==> file1 <==
1. line-1 with spaces ( | | here
1.line-2
1.line-4 with tabs >
1.line-6
==> file2 <==
1.line-3
2.line-1 with spaces
2.line-4
2.line-6 with tabs
==> file-common <==
2.line-2
3.line-5 (tabs)