I am sorry for the title but I didn't know how to explain this:
I am trying to sort two files because I want to merge them, they look like this:
test1.txt
rs1010735 224915429
rs1010805 38189142
rs10108 114516330
rs1010863 185432942
rs1010891 110712154
rs1010910 61212213
rs1011124 7533164
and
test2.txt
rs1010735 C
rs1010805 T
rs1010863 T
rs1010891 T
rs10108 C
rs1010910 A
rs1011124 A
I use sort -k1 test1.txt
and sort -k1 test2.txt
and got this:
test1_sort.txt
rs1010735 224915429
rs1010805 38189142
rs10108 114516330
rs1010863 185432942
rs1010891 110712154
rs1010910 61212213
rs1011124 7533164
and
test2_sort.txt
rs1010735 C
rs1010805 T
rs1010863 T
rs1010891 T
rs10108 C
rs1010910 A
rs1011124 A
Why is there a different sorting if both first columns have the same values.
I also tried sort -n -s k1,1
but got the same result.
There are two issues here.
Locale-aware sorting
At base, the problem here is that you are sorting according to your "locale", which is presumably
en_US.UTF-8
(or some other Unicode locale). In theory, a locale-aware sort will produce an ordering which is what would be expected according to the normal sorting rules for that location, while a non-locale-aware sort will sort according to the "arbitrary" character codes for each character.In a locale-aware sort, for example, it would be common for a word starting with a capital letter to come just before (or just after) the same word starting with a lower-case letter, whereas an non-locale-aware sort will put all the words starting with a capital letter before any word starting with a lower-case letter. Also, in an English-speaking locale, you would probably find words starting with an
ä
intermingled with words starting witha
, whereas in a Swedish-locale, you'd find them after words starting withz
because in Swedish,ä
is the 28th letter (it comes after å and before ö, in case you're interested).For all that to work, the locale descriptions on your machine need to actually describe the sorting order which would be expected in each locale, and particularly with the default locale, which should correspond to what you would expect. As can be seen from this example, that is sometimes not the case. Indeed, it sometimes produces bizarrely unexpected results.
What is happening in your example is that the locale description for your locale says that whitespace does not participate in sortation. It also indicates that digits come before letters. Now, consider a subset of your data (with both files combined):
If we eliminate the whitespace altogether, that would be:
And if we then sort that according to normal alphabetic rules, with digits first, we get:
Or, putting the whitespace back:
Those are the rules sort is following, and the result is that the two lines whose first field is
rs10108
do not get sorted together. Counter-intuitive, ¿no?Probably the correct solution would be to tell whoever built the locale files for your distribution that the normal rule is "nothing (visible) comes before something", which was the alphabetization rule we were taught in school. In other words, a space (nothing visible) comes before any character. Or you could try to fix the collation files yourself.
But in practical terms, the solution is to tell
sort
to do a non-locale-aware sort by default. I do that by putting:in my bash startup files. (
C
is the special name of the locale corresponding to the programming language "C", in which symbols are sorted by their internal character codes.) You could also just type that everytime you want to sort something:The meaning of the
-k
argumentThe
-k
argument to sort has the basic syntax:-kstart[,end]
where the positions
start
(and optionallyend
) define a range of text to use as a sort key. Ifend
is not specified, the range continues to the end of the line.The simplest form of a position is just a field number, such as
1
, meaning "the first field". But-k1
does nothing, because it means, precisely, "use the text from the first field to the end of the line", which is essentially the same as saying "use the entire line as a sort key", which is the default. So anytime you see-k1
you should know that it is not doing what is expected.Explicitly specifying the end would be more precise:
-k1,1
means the the sort key is the text from the (start of) the first field to the (end of) the first field, or in other words, the first field. That would be better, but it wouldn't provide any hint on how to sort two lines which had the same first field. The standardsort
utility is not "stable" by default, so it is not predictable which order two such lines will be sorted. It would generally be better to add more secondary sort fields:which means, effectively, "sort by the first field, but if the first fields are equal, then compare the second fields."
Fields are split at whitespace (even if whitespace is ignored for sortation), so the above is different from
sort -k1,2
in that it is guaranteed to put lines with the same value in the first field in consecutive positions.Appendix: Why locales ignore whitespace in sorting
Unfortunately,
sort -k1,1 -k2,2
also might not do what you want, particularly if you do it in the "C" locale, because of the historic definition of sort fields used bysort
. Unless an explicit delimiter is specified with the-t
option, sort fields start with each whitespace character which follows a non-whitespace character. Consequently, all fields except the first field start with whitespace. That's fine if they all start with the same whitespace, but often fields have been lined up by explicitly adding the right number of space characters. And that almost always produces incorrect sorting on fields other than the first field.Since that is not generally what is wanted,
sort
provides a way of suppressing this annoying behaviour: theb
sort-key flag (sort key flags go at the end of the-k
specification). This flag tellssort
to ignore leading whitespace in a sort-key. Also, you can specify-b
as a command-line option before any-k
option to specify at all sort keys should be treated as having theb
flag. That would suggest that the correct invocation of sort would be:or
Some people believe that it is irritating to have to specify
b
all the time (since it is almost always what you want), and complicated to explain to users why they have to do it. As a consequence, it may appear easier to set up the locale definitions to ignore whitespace, which will certainly cause leading whitespace to be ignored. The problem with that "solution" is that it produces results which are at least as confusing that the results caused by havingsort
include the spaces between fields in the field definition, but which are rather more difficult to fix because there is no simple way to modify a locale's collation order.Add spaces: