I have a search functionality that obtains data from an InnoDB table (utf8_spanish_ci
collation) and displays it in an HTML document (UTF-8
charset). The user types a substring and obtains a list of matches where the first substring occurrence is highlighted, e.g.:
Matches for "AL":
Álava
<strong>Al</strong>bacete
<strong>Al</strong>mería
Ciudad Re<strong>al</strong>
Málaga
As you can see from the example, the search ignores both case and accent differences (MySQL takes care of it automatically). However, the code I'm using to hightlight matches fails to do the latter:
<?php
private static function highlightTerm($full_string, $match){
$start = mb_stripos($full_string, $match);
$length = mb_strlen($match);
return
htmlspecialchars( mb_substr($full_string, 0, $start)) .
'<strong>' . htmlspecialchars( mb_substr($full_string, $start, $length) ) . '</strong>' .
htmlspecialchars( mb_substr($full_string, $start+$length) );
}
?>
Is there a sensible way to fix this that doesn't imply hard-coding all possible variations?
Update: System specs are PHP/5.2.14 and MySQL/5.1.48
use PEAR I18N_UnicodeNormalizer-1.0.0
→ AEIOUaeiou
You could use the Normalizer to normalize the string to Normalization Form KD (NFKD) where the characters are getting decomposed, so
Á
(U+00C1) is getting decomposed to the combination of the letterA
(U+0041) and the combining marḱ
(U+0301):Then you modify the search pattern to match those optional marks:
The replacement is then done with
preg_replace
:So the full method is: