For a poor man's implementation of near-collation-correct sorting on the client side I need a JavaScript function that does efficient single character replacement in a string.
Here is what I mean (note that this applies to German text, other languages sort differently):
native sorting gets it wrong: a b c o u z ä ö ü collation-correct would be: a ä b c o ö u ü z
Basically, I need all occurrences of "ä" of a given string replaced with "a" (and so on). This way the result of native sorting would be very close to what a user would expect (or what a database would return).
Other languages have facilities to do just that: Python supplies str.translate()
, in Perl there is tr/…/…/
, XPath has a function translate()
, ColdFusion has ReplaceList()
. But what about JavaScript?
Here is what I have right now.
// s would be a rather short string (something like
// 200 characters at max, most of the time much less)
function makeSortString(s) {
var translate = {
"ä": "a", "ö": "o", "ü": "u",
"Ä": "A", "Ö": "O", "Ü": "U" // probably more to come
};
var translate_re = /[öäüÖÄÜ]/g;
return ( s.replace(translate_re, function(match) {
return translate[match];
}) );
}
For starters, I don't like the fact that the regex is rebuilt every time I call the function. I guess a closure can help in this regard, but I don't seem to get the hang of it for some reason.
Can someone think of something more efficient?
Answers below fall in two categories:
- String replacement functions of varying degrees of completeness and efficiency (what I was originally asking about)
- A late mention of
String#localeCompare
, which is widely supported among JS engines and could solve this category of problem much more elegantly.
Basing on existing answers and some suggestions, I've created this one:
It uses real chars instead of unicode list and works well.
You can use it like
You can easily convert this function to not be string prototype. However, as I'm fan of using string prototype in such cases, you'll have to do it yourself.
Simply should be normalized chain and run a replacement codes:
See normalize
Then you can use this function:
Based on the solution by Jason Bunting, here is what I use now.
The whole thing is for the jQuery tablesorter plug-in: For (nearly correct) sorting of non-English tables with tablesorter plugin it is necessary to make use of a custom
textExtraction
function.This one:
'dd.mm.yyyy'
) to a recognized format ('yyyy-mm-dd'
)Be careful to save the JavaScript file in UTF-8 encoding or it won't work.
You can use it like this:
A simple and easy way:
So do this:
Output:
A direct port to javascript of Kierons solution: https://github.com/rwarasaurus/nano/blob/master/system/helpers.php#L61-73:
And a slightly modified version, using a char-map instead of two arrays:
To compare these two methods I made a simple benchmark: http://jsperf.com/replace-foreign-characters
If you're looking specifically for a way to convert accented characters to non-accented characters, rather than a way to sort accented characters, with a little finagling, the String.localeCompare function can be manipulated to find the basic latin characters that match the extended ones. For example, you might want to produce a human friendly url slug from a page title. If so, you can do something like this:
This should perform quite well, but if further optimization were needed, a binary search could be used with
localeCompare
as the comparator to locate the base character. Note that case is preserved, and options allow for either preserving, replacing, or removing characters that aren't alphabetical, or do not have matching latin characters they can be replaced with. This implementation is faster and more flexible, and should work with new characters as they are added. The disadvantage is that compound characters like 'ꝡ' have to be handled specifically, if they need to be supported.