In Finnish we sort W
after V
(as in English) but because W
is not a native Finnish letter, it is considered as a variant of V
, which is sorted as it were equal to V
, but in cases where the only difference between two words is that V
is W
, then V
-version is sorted first. An example enlights the proper order:
Vatanen, Watanen, Virtanen
In Finnish V
and W
are collated as A
and Á
. Á
is sorted like A
, but in cases where it is the only difference, the unaccented one comes first. The same rule is for all other accented letters, but the Å
, Ä
and Ö
are collated separately after Z.
Question: What would be the best algorithm to sort this like variants in a predefined way? ( eg. [Watanen, Vatanen, Virtanen]
to [Vatanen, Watanen, Virtanen]
)?
Addition: The question is relevant to extend to cover also other variants in the way they are defined in http://cldr.unicode.org/index/cldr-spec/collation-guidelines, because the technique would in a great probability be the same and the answers to this question benefit the widest possible audience and sort algorithms can be made compatible with collation rules defined in Unicode CLDR. The Unicode CLDR defines three levels of differences between letters: primary level (base letters), secondary level (accented letters) and tertiary level (character case).
I have thought some kind of array preparation like in numerical sort where we could pad all numbers with zeros to make them comparaple as strings. An example: Array [file1000.jpg, file3.jpg, file22.jpg]
can be prepared to make it comparable as strings by padding with zeros this way: [file1000.jpg, file0003.jpg, file0022.jpg]
. Because of preparation of array, we can sort it very fast using native Array.sort().
The target language is Javascript, which lacks support for collation based sorts, so the custom sort function have to be made self. The algorithm is preferred, but if you have also code it's worth +1.
Since the time you originally asked this question, JavaScript is finally acquiring some decent locale support, including for collation.
Read up on the new EcmaScript 6 / Harmony features
Intl
and, specifically,Intl.Collator
.The documentation doesn't actually make it very clear that both modern and traditional sort orders are supported for Finnish, but I've tried and they are.
To get a collator for the traditional order you need to pass a "fancy" language code string:
fi-u-co-trad
. For the "reformed" sort order there isfi-u-co-reformed
. This breaks down as:fi
- ISO 639 language code for Finnish.u
- enables Unicode features / options. (not well documented)co
- collation options.trad
- traditional sort order. I read about this option for Spanish and only found it works for Finnish as well by testing. (not well documented)reformed
- reformed sort order. Seems to be an antonym for 'trad'. If you specify neithertrad
norreformed
you will getdefault
, which may betrad
on some browsers andreformed
on others.Teh codez:
Outputs:
Tested in Google Chrome, which, from what I read online, is lagging behind Firefox in this stuff.
The standard way to multilevel sorting is expressed here: http://unicode.org/reports/tr10/
The principle is to use locale based tailorings to override the order expressed in the Default Unicode Collation Element Table (DUCET, http://www.unicode.org/Public/UCA/latest/allkeys.txt ). DUCET is the base sort order of characters. Tailorings are needed if locale has some special rules that cannot or are not performance-wise to implement in DUCET.
The directory core/common/collation/ in http://unicode.org/Public/cldr/22/core.zip has 87 xml-files. An example of Finnish tailorings in fi.xml:
Locale-based sort is fairly tedious to implement, and in order to be fast enough, it is required to use of resources at the lowest possible (machine) level, so I'm coming round to the view that it is best to wait until Javascript supports it natively.
But may be waiting never ends: Javascript lacks still support for numerical sort, which should be very easy to implement at machine level.
If someone coder has enough motivation to implement locale based sort in Javascript, I'll be glad to see results and support it at my side.
I think this should do it:
@performance: Okay, I found a way to use
.sort()
without a custom compare function, which could be even a bit faster [in some environments] according to http://jsperf.com/sort-mapped-strings. The trick is to use objects with a.toString()
method that returns the string-to-be-sorted-by:The usual approach to this problem is to use a list of mappings (normally the list doesn't need to be longer than three, and in your case two would do.) Each mapping maps a character onto a sequence point. [Note 3] So in your example,
The comparison algorithm essentially first translates each character in the words to using mapping1, and if they are not the same, uses that as the comparison. If they are the same, then it repeats using mapping2 (and so on).
Not all languages are so simple, so there are a bunch of variations (for example, you might reverse the strings in pass 2).
Note that you can achieve the same effect by making comparison keys consisting of the concatenation of the translations. If you do a lot of comparisons, caching this key can be a win. In that case, you would use a special value in the mappings other than the first mapping for "irrelevant". All irrelevant codes can be omitted, which often shortens the comparison key quite a bit.
For example, in your example (but just upper case because it would be tedious to type the whole mapping sequence), we would translate VATANEN using the first mapping to
[22, 1, 20, 1, 15, 5, 15]
and with the second mapping to[0, 0, --, 0, --, --, --]
. WATANEN would be[22, 1, 20, 1, 15, 5, 15]
(exactly the same) with the first mapping, and[1, 0, --, 0, --, --, --]
with the second. So dropping the--
's [Note 1], the comparison keys would be:This can be extended to more than two translation tables.
For example, many applications want to do something like case-insensitive sorting, but where character case makes a difference if there are no other differences (in English, this usually means putting words with upper-case letter before words which are all lower-case, but both options are plausible.)
So in the Finnish case, we could add a third translation table, where all upper case letters are translated to 0, all lower case letters are translated to 1, and all other characters are not translated. Some concatenated translations:
It is not at all obvious that this order is "correct". Nor, indeed, is it obvious what "correct" means for most languages, aside from those which have official linguistic authorities. [Note 2] So the above should just be considered as an example of multi-level encoding, not a definitive guide to alphabetic order. In this case, the tertiary code consists of just a single bit, although there may still be languages (like Dutch) in which there are three cases for a few letters.
The above scheme does not contemplate digraphs and trigraphs, although they are reasonably easy to add, with care. (In the primary ordering, and possibly in secondary and tertiary as well, the digraph needs to have a single code for both characters.) Spanish, contrary to popular belief amongst non-Spanish programmers, has not been an example of this since 1994, almost twenty years ago, when the RAE decreed that 'ch' is alphabetized between 'cg' and 'ci', and not between 'c' and 'd' as it formerly was. I believe that some Dutch speakers still expect to find 'ij' and 'y' together, and Hungarians may still respect the complicated collection of digraphs and trigraphs that comprise their alphabet, but on the whole complicated mechanical schemes for alphabetical ordering are dying out, to be replaced with simple latin ordering, possibly complemented with secondary ordering for diacritics (French, Spanish, apparently Finnish, German dictionaries but not telephone books) or primary ordering of diacritics (Spanish ñ, Danish/Norwegian/Swedish vowels, Turkish).
[Note 1]: It's not necessary to insert "irrelevant" secondary codes because the secondary part of the encoding is only consulted for pairs of words where the primary parts are identical. Since any letter considered irrelevant for the secondary coding would be so considered in all words in a primary equivalence class, it can just be omitted from the secondary coding. Similarly, it is legitimate to reuse codes in different primary equivalence classes, as we do above: [v, w] are [0, 1] and so are [a, á]. Obviously, there is no possibility of ambiguity. Consequently, secondary encodings can be quite short, both in sequence length and bit length.
[Note 2]: English has no such body; the Spanish one is the Real Academia Española, but I couldn't find precise collation rules in any of the RAE's publications on my bookshelf, aside from the laconic observation that accents aren't considered in alphabetical order. However, the RAE's dictionary seems to consistently put unaccented words before any accented word with the same letters, at least in the two cases I could think of -- papa/papá and sabana/sábana.
[Note 3] Of course, we need to keep track of the originals as well, so we have to somehow attach the comparison keys to the strings. As long as no two characters have the same translation in all of the mappings, this can be done with a simple hash table using the comparison key as a key.
I had this problem today and came across
String.prototype.localeCompare
. You can use it witharr.sort()
and specify a locale:Looks like the
caseFirst
option works in Chrome but not in Firefox at the moment.The MDN page contains more info and available options.
localeCompare
seems to work fine with sorting Finnish strings, and I suppose it works with many other locales as well.