I am building search and I am going to use javascript autocomplete with it. I am from Finland (finnish language) so I have to deal with some special characters like ä, ö and å
When user types text in to the search input field I try to match the text to data.
Here is simple example that is not working correctly if user types for example "ää". Same thing with "äl"
var title = "this is simple string with finnish word tämä on ääkköstesti älkää ihmetelkö";
// Does not work
var searchterm = "äl";
// does not work
//var searchterm = "ää";
// Works
//var searchterm = "wi";
if ( new RegExp("\\b"+searchterm, "gi").test(title) ) {
$("#result").html("Match: ("+searchterm+"): "+title);
} else {
$("#result").html("nothing found with term: "+searchterm);
}
So how can I get those ä,ö and å characters to work with javascript regex?
I think I should use unicode codes but how should I do that? Codes for those characters are: [\u00C4,\u00E4,\u00C5,\u00E5,\u00D6,\u00F6]
=> äÄåÅöÖ
I would recommend you to use XRegExp when you have to work with a specific set of characters from Unicode, the author of this library mapped all kind of regional sets of characters making the work with different languages easier.
My idea is to search with codes representing the Finnish letters
new RegExp("\\b"+asciiOnly(searchterm), "gi").test(asciiOnly(title))
My original idea was to use plain
encodeURI
but the % sign seemed to interfere with the regexp.http://jsfiddle.net/7TsxB/5/
I wrote a crude function using encodeURI to encode every character with code over 128 but removing its % and adding 'QQ' in the beginning. It is not the best marker but I couldn't get non alphanumeric to work.
I noticed something really weird with
\b
when using Unicode:It appears that meaning of
\b
and\B
are reversed, but only when used with non-ASCII Unicode? There might be something deeper going on here, but I'm not sure what it is.In any case, it seems that the word boundary is the issue, not the Unicode characters themselves. Perhaps you should just replace
\b
with(^|[\s\\/-_&])
, as that seems to work correctly. (Make your list of symbols more comprehensive than mine, though.)There appears to be a problem with Regex and the word boundary
\b
matching the beginning of a string with a starting character out of the normal 256 byte range.Instead of using
\b
, try using(?:^|\\s)
Breakdown:
(?:
parenthesis()
form a capture group in Regex. Parenthesis started with a question mark and colon?:
form a non-capturing group. They just group the terms together^
the caret symbol matches the beginning of a string|
the bar is the "or" operator.\s
matches whitespace (appears as\\s
in the string because we have to escape the backslash))
closes the groupSo instead of using
\b
, which matches word boundaries and doesn't work for unicode characters, we use a non-capturing group which matches the beginning of a string OR whitespace.The
\b
character class in JavaScript RegEx is really only useful with simple ASCII encoding.\b
is a shortcut code for the boundary between\w
and\W
sets or\w
and the beginning or end of the string. These character sets only take into account ASCII "word" characters, where\w
is equal to[a-zA-Z0-9_]
and\W
is the negation of that class.This makes the RegEx character classes largely useless for dealing with any real language.
\s
should work for what you want to do, provided that search terms are only delimited by whitespace.I have had a similar problem, but I had to replace an array of terms. All solutions, which I have found did not worked, if two terms were in the text next to each other (because their boundaries overlaped). So I had to use a little modified approach:
See the code in a fiddle: http://jsfiddle.net/antoninslejska/bvbLpdos/1/
The regular expression is inspired by: http://breakthebit.org/post/3446894238/word-boundaries-in-javascripts-regular
I can't say, that I find the solution elegant...