I have two files which just list a bunch of different file names on each line. I merged them, sorted them, and then checked the comm
output and noticed something really interesting.
$ sort -u -o list1 list1
$ sort -u -o list2 list2
$ cat list1 list2 > combined
$ wc -l list1
18141 list1
$ wc -l list2
21755 list2
$ wc -l combined
39896 combined
$ sort -u -o combined combined
$ wc -l combined
24400 combined
$ comm -23 list1 combined | wc -l
12889
$ comm -13 list1 combined | wc -l
19148
$ comm -12 list1 combined | wc -l
5252
$ comm -23 list2 combined | wc -l
0
$ comm -13 list2 combined | wc -l
2645
$ comm -12 list2 combined | wc -l
21755
(line breaks above for clarity)
What's going on with those last few calls to comm
? When I compare list1
to combined
the output is wacky, but when I compare list2
to combined
the output seems fine.
I even tried to combine all three lists again and test:
$ cat list1 list2 combined > combined-again
$ wc -l combined-again
64296 combined-again
$ sort -u -o combined-again combined-again
$ wc -l combined-again
24400 combined-again
$ diff combined combined-again
The sorted unique line count of combined
and combined-again
match, and there is no output from diff
!
$ comm combined combined-again | wc -l
24400
$ comm -12 combined combined-again | wc -l
24400
$ comm -3 combined combined-again | wc -l
0
These comm
outputs make sense, there shouldn't be any difference between the two files.
$ comm -23 list1 combined-again | wc -l
12889
$ comm -13 list1 combined-again | wc -l
19148
$ comm -12 list1 combined-again | wc -l
5252
When comparing against list1
, we see the same wonky numbers again.
$ comm -23 list2 combined-again | wc -l
0
$ comm -13 list2 combined-again | wc -l
2645
$ comm -12 list2 combined-again | wc -l
21755
When comparing against list2
, the numbers are appropriate and correct.
I even used the some lines of output from comm -23 list1 combined-again
to grep
for those lines in combined-again
, and those lines do exist. I'm totally at a loss for why the comm
output is faulty in this case...
EDIT1:
$ locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=
Each of the files don't contain weird symbols or characters, just package names using camel case. For example:
$ head list1
AAAAuthentication
AAACorrelationAPI
AAACorrespondence
AAATestSuite
AESDescription
AESImplementation
AESLogging
AESMaster
AESProofSystem
AESTestSuite
EDIT2:
After some more investigation due to some suggestions in the comments, it seems that the issue could be because of the versioning of the comm
and sort
tools.
I ran all of the above commands on mac, where comm
is from BSD January 26, 2005, and sort
is from GNU coreutils, sort 5.93 on November 2005.
On the linux box, both comm
and sort
are from GNU coreutils 8.4 of January 2012, and the calls work perfectly.
I guess the question now is: what's the discrepancy between the versioning, and why does it affect the comm
output as shown above?