Locale based sort in Javascript, sort accented let

2019-04-04 05:29发布

问题:

In Finnish we sort W after V (as in English) but because W is not a native Finnish letter, it is considered as a variant of V, which is sorted as it were equal to V, but in cases where the only difference between two words is that V is W, then V-version is sorted first. An example enlights the proper order:

Vatanen, Watanen, Virtanen

In Finnish V and W are collated as A and Á. Á is sorted like A, but in cases where it is the only difference, the unaccented one comes first. The same rule is for all other accented letters, but the Å, Ä and Ö are collated separately after Z.

Question: What would be the best algorithm to sort this like variants in a predefined way? ( eg. [Watanen, Vatanen, Virtanen] to [Vatanen, Watanen, Virtanen] )?

Addition: The question is relevant to extend to cover also other variants in the way they are defined in http://cldr.unicode.org/index/cldr-spec/collation-guidelines, because the technique would in a great probability be the same and the answers to this question benefit the widest possible audience and sort algorithms can be made compatible with collation rules defined in Unicode CLDR. The Unicode CLDR defines three levels of differences between letters: primary level (base letters), secondary level (accented letters) and tertiary level (character case).

I have thought some kind of array preparation like in numerical sort where we could pad all numbers with zeros to make them comparaple as strings. An example: Array [file1000.jpg, file3.jpg, file22.jpg] can be prepared to make it comparable as strings by padding with zeros this way: [file1000.jpg, file0003.jpg, file0022.jpg]. Because of preparation of array, we can sort it very fast using native Array.sort().

The target language is Javascript, which lacks support for collation based sorts, so the custom sort function have to be made self. The algorithm is preferred, but if you have also code it's worth +1.

回答1:

The usual approach to this problem is to use a list of mappings (normally the list doesn't need to be longer than three, and in your case two would do.) Each mapping maps a character onto a sequence point. [Note 3] So in your example,

 primary:      secondary:
  A -> 0         A -> 0
  Á -> 0         Á -> 1
  B -> 1         (irrelevant)
  C -> 2
  D -> 3
  E -> 4
...
  T -> 20
  U -> 21
  V -> 22        V -> 0
  W -> 22        W -> 1
  X -> 23
...

The comparison algorithm essentially first translates each character in the words to using mapping1, and if they are not the same, uses that as the comparison. If they are the same, then it repeats using mapping2 (and so on).

Not all languages are so simple, so there are a bunch of variations (for example, you might reverse the strings in pass 2).

Note that you can achieve the same effect by making comparison keys consisting of the concatenation of the translations. If you do a lot of comparisons, caching this key can be a win. In that case, you would use a special value in the mappings other than the first mapping for "irrelevant". All irrelevant codes can be omitted, which often shortens the comparison key quite a bit.

For example, in your example (but just upper case because it would be tedious to type the whole mapping sequence), we would translate VATANEN using the first mapping to [22, 1, 20, 1, 15, 5, 15] and with the second mapping to [0, 0, --, 0, --, --, --]. WATANEN would be [22, 1, 20, 1, 15, 5, 15] (exactly the same) with the first mapping, and [1, 0, --, 0, --, --, --] with the second. So dropping the --'s [Note 1], the comparison keys would be:

VATANEN:  [22, 1, 20, 1, 15, 5, 15, 0, 0, 0]
VÁTANEN:  [22, 1, 20, 1, 15, 5, 15, 0, 1, 0] (if there were such a place)
WATANEN:  [22, 1, 20, 1, 15, 5, 15, 1, 0, 0]
VIRTANEN: [22, 9, ...]

This can be extended to more than two translation tables.

For example, many applications want to do something like case-insensitive sorting, but where character case makes a difference if there are no other differences (in English, this usually means putting words with upper-case letter before words which are all lower-case, but both options are plausible.)

So in the Finnish case, we could add a third translation table, where all upper case letters are translated to 0, all lower case letters are translated to 1, and all other characters are not translated. Some concatenated translations:

           -------primary---------  --2ary-  ------tertiary-----
VÁTANEN:  [22, 1, 20, 1, 15, 5, 15, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
Vátenen:  [22, 1, 20, 1, 15, 5, 15, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1]
WATANEN:  [22, 1, 20, 1, 15, 5, 15, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]

It is not at all obvious that this order is "correct". Nor, indeed, is it obvious what "correct" means for most languages, aside from those which have official linguistic authorities. [Note 2] So the above should just be considered as an example of multi-level encoding, not a definitive guide to alphabetic order. In this case, the tertiary code consists of just a single bit, although there may still be languages (like Dutch) in which there are three cases for a few letters.

The above scheme does not contemplate digraphs and trigraphs, although they are reasonably easy to add, with care. (In the primary ordering, and possibly in secondary and tertiary as well, the digraph needs to have a single code for both characters.) Spanish, contrary to popular belief amongst non-Spanish programmers, has not been an example of this since 1994, almost twenty years ago, when the RAE decreed that 'ch' is alphabetized between 'cg' and 'ci', and not between 'c' and 'd' as it formerly was. I believe that some Dutch speakers still expect to find 'ij' and 'y' together, and Hungarians may still respect the complicated collection of digraphs and trigraphs that comprise their alphabet, but on the whole complicated mechanical schemes for alphabetical ordering are dying out, to be replaced with simple latin ordering, possibly complemented with secondary ordering for diacritics (French, Spanish, apparently Finnish, German dictionaries but not telephone books) or primary ordering of diacritics (Spanish ñ, Danish/Norwegian/Swedish vowels, Turkish).


[Note 1]: It's not necessary to insert "irrelevant" secondary codes because the secondary part of the encoding is only consulted for pairs of words where the primary parts are identical. Since any letter considered irrelevant for the secondary coding would be so considered in all words in a primary equivalence class, it can just be omitted from the secondary coding. Similarly, it is legitimate to reuse codes in different primary equivalence classes, as we do above: [v, w] are [0, 1] and so are [a, á]. Obviously, there is no possibility of ambiguity. Consequently, secondary encodings can be quite short, both in sequence length and bit length.

[Note 2]: English has no such body; the Spanish one is the Real Academia Española, but I couldn't find precise collation rules in any of the RAE's publications on my bookshelf, aside from the laconic observation that accents aren't considered in alphabetical order. However, the RAE's dictionary seems to consistently put unaccented words before any accented word with the same letters, at least in the two cases I could think of -- papa/papá and sabana/sábana.

[Note 3] Of course, we need to keep track of the originals as well, so we have to somehow attach the comparison keys to the strings. As long as no two characters have the same translation in all of the mappings, this can be done with a simple hash table using the comparison key as a key.



回答2:

Since the time you originally asked this question, JavaScript is finally acquiring some decent locale support, including for collation.

Read up on the new EcmaScript 6 / Harmony features Intl and, specifically, Intl.Collator.

The documentation doesn't actually make it very clear that both modern and traditional sort orders are supported for Finnish, but I've tried and they are.

To get a collator for the traditional order you need to pass a "fancy" language code string: fi-u-co-trad. For the "reformed" sort order there is fi-u-co-reformed. This breaks down as:

  • fi - ISO 639 language code for Finnish.
  • u - enables Unicode features / options. (not well documented)
  • co - collation options.
  • trad - traditional sort order. I read about this option for Spanish and only found it works for Finnish as well by testing. (not well documented)
  • reformed - reformed sort order. Seems to be an antonym for 'trad'. If you specify neither trad nor reformed you will get default, which may be trad on some browsers and reformed on others.

Teh codez:

var surnames = ['Watanen', 'Vatanen', 'Virtanen'];

var traColl = new Intl.Collator('fi-u-co-trad');
var refColl = new Intl.Collator('fi-u-co-reformed');
var defColl = new Intl.Collator('fi');

console.log('traditional:', traColl.resolved.requestedLocale + ' -> ' + traColl.resolved.collation, surnames.sort(function (a, b) {
  return traColl.compare(a,b);
}));

console.log('reformed:', refColl.resolved.requestedLocale + ' -> ' + refColl.resolved.collation, surnames.sort(function (a, b) {
  return refColl.compare(a,b);
}));

console.log('default:', defColl.resolved.requestedLocale + ' -> ' + defColl.resolved.collation, surnames.sort(function (a, b) {
  return defColl.compare(a,b);
}));

Outputs:

traditional: fi-u-co-trad -> trad ["Vatanen", "Watanen", "Virtanen"]
reformed: fi-u-co-reformed -> reformed ["Vatanen", "Virtanen", "Watanen"]
default: fi -> default ["Vatanen", "Virtanen", "Watanen"]

Tested in Google Chrome, which, from what I read online, is lagging behind Firefox in this stuff.



回答3:

I had this problem today and came across String.prototype.localeCompare. You can use it with arr.sort() and specify a locale:

var names = ['Andrea', 'Ándrea', 'Àndrea', 'Äiti', 'Özmir', 'åke', 'Zorro', 'Åke'];

// Undesired order:
names.sort()
console.log('default sort', names);

// Desired order:
names.sort((nameA, nameB) => nameA.localeCompare(nameB, 'fi') > 0);
console.log('locale sort', names);

// Or since positive values are truthy, you can omit the `> 0`:
names.sort((nameA, nameB) => nameA.localeCompare(nameB, 'fi'));

// You can also control whether upper or lower case should sort first:
names.sort((nameA, nameB) => nameA.localeCompare(nameB, 'fi', {
  caseFirst: 'upper'
}));
console.log('locale sort with caseFirst option', names);

Looks like the caseFirst option works in Chrome but not in Firefox at the moment.

The MDN page contains more info and available options. localeCompare seems to work fine with sorting Finnish strings, and I suppose it works with many other locales as well.



回答4:

I think this should do it:

var variants = ["AÁÀ", "VW", … ];

// Build a map that links variants with their base letter (for quick access)
var map = {}, chars = "";
for (var i=0; i<variants.length; i++) {
    var variant = variants[i], char = variant.charAt(0);
    for (var j=1; j<variants[i].length; j++)
        map[variant.charAt(j)] = char;
    chars += variant.substr(1);
}
// and a simple regular expression, containing a character class of all variant chars
var regex = new RegExp("["+chars+"]","g");

function sortFinnish(arr) {
    // each word is replaced by an array [literal],
    // containing 0) the word 1) the normalized word
    for (var i=0; i<arr.length; i++)
        arr[i] = [ arr[i], arr[i].replace(regex, function(m) {
            // every variant character is replaced by its base letter
            return map[m];
        }) ];
    // then sort that array with a custom compare function:
    arr.sort(function(a, b) {
        // at first by the normalized words,
        // i.e. variants count the same as their bases
        if (b[1] > a[1]) return -1;
        if (b[1] < a[1]) return 1;
        // else the normalized words are the same
        // - return a comparsion of the actual words
        if (b[0] > a[0]) return -1;
        if (b[0] < a[0]) return 1;
        return 0;
    });
    // after that, replace each of the arrays with the actual word again
    for (var i=0; i<arr.length; i++)
        arr[i] = arr[i][0];
    return arr;
}

@performance: Okay, I found a way to use .sort() without a custom compare function, which could be even a bit faster [in some environments] according to http://jsperf.com/sort-mapped-strings. The trick is to use objects with a .toString() method that returns the string-to-be-sorted-by:

    function SortString(actualvalue) {
        this.val = actualvalue;
        // the value-to-sort-by is a normalized version, concatenated by a space
        // with the actual value so that the actual value is compared when the
        // normalized ones are the same.
        // ! does not work with values that contain spaces !
        // we'd need to use something like \u0001 instead
        var sortval = actualvalue.replace(regex, function(m) {
            // every variant character is replaced by its base letter
            return map[m];
        }) + " " + actualvalue;
        this.toString = function(){ return sortval; };
    }
    for (var i=0; i<arr.length; i++)
        arr[i] = new SortString(arr[i]);
    // when comparing, the sortstring is used as the object's representation:
    arr.sort();
    // after that, replace the objects with the actual words again:
    for (var i=0; i<arr.length; i++)
        arr[i] = arr[i].val;


回答5:

The standard way to multilevel sorting is expressed here: http://unicode.org/reports/tr10/

The principle is to use locale based tailorings to override the order expressed in the Default Unicode Collation Element Table (DUCET, http://www.unicode.org/Public/UCA/latest/allkeys.txt ). DUCET is the base sort order of characters. Tailorings are needed if locale has some special rules that cannot or are not performance-wise to implement in DUCET.

The directory core/common/collation/ in http://unicode.org/Public/cldr/22/core.zip has 87 xml-files. An example of Finnish tailorings in fi.xml:

<collation type="standard" >
    <rules>
    <!-- SNIP -->
        <reset>V</reset>
            <s>w</s>
            <t>W</t>
    <!-- SNIP -->
    </rules>
</collation>

Locale-based sort is fairly tedious to implement, and in order to be fast enough, it is required to use of resources at the lowest possible (machine) level, so I'm coming round to the view that it is best to wait until Javascript supports it natively.

But may be waiting never ends: Javascript lacks still support for numerical sort, which should be very easy to implement at machine level.

If someone coder has enough motivation to implement locale based sort in Javascript, I'll be glad to see results and support it at my side.