Sorting two files that have the same column gives

2019-02-27 19:48发布

I am sorry for the title but I didn't know how to explain this:

I am trying to sort two files because I want to merge them, they look like this:

test1.txt

rs1010735   224915429
rs1010805   38189142
rs10108     114516330
rs1010863   185432942
rs1010891   110712154
rs1010910   61212213
rs1011124   7533164

and

test2.txt

rs1010735 C
rs1010805 T
rs1010863 T
rs1010891 T
rs10108  C
rs1010910 A
rs1011124 A

I use sort -k1 test1.txt and sort -k1 test2.txt and got this:

test1_sort.txt

rs1010735   224915429
rs1010805   38189142
rs10108 114516330
rs1010863   185432942
rs1010891   110712154
rs1010910   61212213
rs1011124   7533164

and

test2_sort.txt

rs1010735   C
rs1010805   T
rs1010863   T
rs1010891   T
rs10108     C
rs1010910   A
rs1011124   A

Why is there a different sorting if both first columns have the same values.

I also tried sort -n -s k1,1 but got the same result.

2条回答
看我几分像从前
2楼-- · 2019-02-27 20:35

There are two issues here.

Locale-aware sorting

At base, the problem here is that you are sorting according to your "locale", which is presumably en_US.UTF-8 (or some other Unicode locale). In theory, a locale-aware sort will produce an ordering which is what would be expected according to the normal sorting rules for that location, while a non-locale-aware sort will sort according to the "arbitrary" character codes for each character.

In a locale-aware sort, for example, it would be common for a word starting with a capital letter to come just before (or just after) the same word starting with a lower-case letter, whereas an non-locale-aware sort will put all the words starting with a capital letter before any word starting with a lower-case letter. Also, in an English-speaking locale, you would probably find words starting with an ä intermingled with words starting with a, whereas in a Swedish-locale, you'd find them after words starting with z because in Swedish, ä is the 28th letter (it comes after å and before ö, in case you're interested).

For all that to work, the locale descriptions on your machine need to actually describe the sorting order which would be expected in each locale, and particularly with the default locale, which should correspond to what you would expect. As can be seen from this example, that is sometimes not the case. Indeed, it sometimes produces bizarrely unexpected results.

What is happening in your example is that the locale description for your locale says that whitespace does not participate in sortation. It also indicates that digits come before letters. Now, consider a subset of your data (with both files combined):

rs10108     114516330
rs1010805   38189142
rs1010863   185432942
rs10108     C
rs1010805   T
rs1010863   T

If we eliminate the whitespace altogether, that would be:

rs10108114516330
rs101080538189142
rs1010863185432942
rs10108C
rs1010805T
rs1010863T

And if we then sort that according to normal alphabetic rules, with digits first, we get:

rs101080538189142
rs1010805T
rs10108114516330
rs1010863185432942
rs1010863T
rs10108C

Or, putting the whitespace back:

rs1010805   38189142
rs1010805   T
rs10108     114516330
rs1010863   185432942
rs1010863   T
rs10108     C

Those are the rules sort is following, and the result is that the two lines whose first field is rs10108 do not get sorted together. Counter-intuitive, ¿no?

Probably the correct solution would be to tell whoever built the locale files for your distribution that the normal rule is "nothing (visible) comes before something", which was the alphabetization rule we were taught in school. In other words, a space (nothing visible) comes before any character. Or you could try to fix the collation files yourself.

But in practical terms, the solution is to tell sort to do a non-locale-aware sort by default. I do that by putting:

export LC_COLLATE=C

in my bash startup files. (C is the special name of the locale corresponding to the programming language "C", in which symbols are sorted by their internal character codes.) You could also just type that everytime you want to sort something:

LC_COLLATE=C sort test1.txt

The meaning of the -k argument

The -k argument to sort has the basic syntax:

-kstart[,end]

where the positions start (and optionally end) define a range of text to use as a sort key. If end is not specified, the range continues to the end of the line.

The simplest form of a position is just a field number, such as 1, meaning "the first field". But -k1 does nothing, because it means, precisely, "use the text from the first field to the end of the line", which is essentially the same as saying "use the entire line as a sort key", which is the default. So anytime you see -k1 you should know that it is not doing what is expected.

Explicitly specifying the end would be more precise: -k1,1 means the the sort key is the text from the (start of) the first field to the (end of) the first field, or in other words, the first field. That would be better, but it wouldn't provide any hint on how to sort two lines which had the same first field. The standard sort utility is not "stable" by default, so it is not predictable which order two such lines will be sorted. It would generally be better to add more secondary sort fields:

sort -k1,1 -k2,2 

which means, effectively, "sort by the first field, but if the first fields are equal, then compare the second fields."

Fields are split at whitespace (even if whitespace is ignored for sortation), so the above is different from sort -k1,2 in that it is guaranteed to put lines with the same value in the first field in consecutive positions.


Appendix: Why locales ignore whitespace in sorting

Unfortunately, sort -k1,1 -k2,2 also might not do what you want, particularly if you do it in the "C" locale, because of the historic definition of sort fields used by sort. Unless an explicit delimiter is specified with the -t option, sort fields start with each whitespace character which follows a non-whitespace character. Consequently, all fields except the first field start with whitespace. That's fine if they all start with the same whitespace, but often fields have been lined up by explicitly adding the right number of space characters. And that almost always produces incorrect sorting on fields other than the first field.

Since that is not generally what is wanted, sort provides a way of suppressing this annoying behaviour: the b sort-key flag (sort key flags go at the end of the -k specification). This flag tells sort to ignore leading whitespace in a sort-key. Also, you can specify -b as a command-line option before any -k option to specify at all sort keys should be treated as having the b flag. That would suggest that the correct invocation of sort would be:

sort -k1,1 -k2,2b

or

sort -b -k1,1 -k2,2

Some people believe that it is irritating to have to specify b all the time (since it is almost always what you want), and complicated to explain to users why they have to do it. As a consequence, it may appear easier to set up the locale definitions to ignore whitespace, which will certainly cause leading whitespace to be ignored. The problem with that "solution" is that it produces results which are at least as confusing that the results caused by having sort include the spaces between fields in the field definition, but which are rather more difficult to fix because there is no simple way to modify a locale's collation order.

查看更多
Juvenile、少年°
3楼-- · 2019-02-27 20:53

Add spaces:

$ sort -k 1,1 /tmp/2
rs1010735 C
rs10108  C
rs1010805 T
rs1010863 T
rs1010891 T
rs1010910 A
rs1011124 A
$ sort -k 1,1 /tmp/1
rs1010735   224915429
rs10108     114516330
rs1010805   38189142
rs1010863   185432942
rs1010891   110712154
rs1010910   61212213
rs1011124   7533164
查看更多
登录 后发表回答