I am loosing my mind on this for two days now...
I would like to use slovenian letters in sphinx search, all english ones + č ž š (and just in case ć)
I was looking all over the net to get the proper chars but I found squat...
so I decided to make my own step by step...
this is my index
index classifieds
{
source = classifieds_src
path = c:\Sphinx\data\classifieds
docinfo = extern
min_infix_len = 2
infix_fields = title,keywords,summary,text
expand_keywords = 1
enable_star = 1
charset_type = utf-8
charset_table = 0..9, a..z, _, A..Z->a..z,-, U+002C, \
U+010C->U+010D, U+0106->U+0107, U+0160->U+0161, U+017D->U+017E, \
U+010D->c,U+0107->c, U+0161->s, U+017E->z, \
U+010D, U+0107, U+0161, U+017E
}
where I mapped big Č, Ć Š Ž to their lowercase counterparts, and added mapping from č into c, ć into c, š into s and ž into z and finally I added those four chars to the table....
these are my classifieds titles:
t1: HP USB optična miška za prenosnik RH304 t2: Čiška PCplus MO-U033+F2 (optična, brezžična, PS/2) t3: Miška Logitech optična Nano M235 siva
db encoding: utf8_general_ci table's encoding: utf8_general_ci title field encoding: utf8_general_ci
test case:
$testcase = array(
"miška",
"mi*ka",
"Čiška",
"čiška",
"miska",
"usb prenosnik",
"prenosnik miska",
"miška usb"
);
//api settings:
$this->sphinx->SetArrayResult(true);
$this->sphinx->setLimits(0, 100);
$this->sphinx->setMatchMode(SPH_MATCH_EXTENDED2);
$this->sphinx->SetSortMode(SPH_SORT_RELEVANCE, '@weight DESC');
$this->sphinx->SetRankingMode(SPH_RANK_PROXIMITY_BM25);
$this->sphinx->SetFieldWeights(array("title"=>100, "keywords"=>80, "summary"=>60,
"text"=>20, "slug"=>10));
and finally the test results:
Keyword (total / total_found) words
miška (0/0)
Array
(
[*miška*] => Array
(
[docs] => 0
[hits] => 0
)
[miška] => Array
(
[docs] => 0
[hits] => 0
)
)
mi*ka (0/0)
Array
(
[*mi*] => Array
(
[docs] => 3
[hits] => 4
)
[mi] => Array
(
[docs] => 1
[hits] => 1
)
[*2aka*] => Array
(
[docs] => 0
[hits] => 0
)
[2aka] => Array
(
[docs] => 0
[hits] => 0
)
)
Čiška (0/0)
Array
(
[*čiška*] => Array
(
[docs] => 0
[hits] => 0
)
[čiška] => Array
(
[docs] => 0
[hits] => 0
)
)
čiška (0/0)
Array
(
[*čiška*] => Array
(
[docs] => 0
[hits] => 0
)
[čiška] => Array
(
[docs] => 0
[hits] => 0
)
)
miska (0/0)
Array
(
[*miska*] => Array
(
[docs] => 0
[hits] => 0
)
[miska] => Array
(
[docs] => 0
[hits] => 0
)
)
usb prenosnik (1/1)
Array
(
[*usb*] => Array
(
[docs] => 1
[hits] => 1
)
[usb] => Array
(
[docs] => 1
[hits] => 1
)
[*prenosnik*] => Array
(
[docs] => 1
[hits] => 1
)
[prenosnik] => Array
(
[docs] => 1
[hits] => 1
)
)
prenosnik miska (0/0)
Array
(
[*prenosnik*] => Array
(
[docs] => 1
[hits] => 1
)
[prenosnik] => Array
(
[docs] => 1
[hits] => 1
)
[*miska*] => Array
(
[docs] => 0
[hits] => 0
)
[miska] => Array
(
[docs] => 0
[hits] => 0
)
)
miška usb (0/0)
Array
(
[*miška*] => Array
(
[docs] => 0
[hits] => 0
)
[miška] => Array
(
[docs] => 0
[hits] => 0
)
[*usb*] => Array
(
[docs] => 1
[hits] => 1
)
[usb] => Array
(
[docs] => 1
[hits] => 1
)
)
You can clearly see I get positive results only in the queries without slovenian special chars
Please, please help I am loosing my mind on this
Here you could find many different charset tables, maybe one of the would be suitable for your language: http://sphinxsearch.com/wiki/doku.php?id=charset_tables
The problem is that sphinx indexer wasn't using utf8 character set by default. Fixed by adding the following to sphinx.conf
References
There are some parameters you should setup. They are ngram_len and ngram_chars.
First, I got the same problem with you. Then I read the book Introduction to search with sphinx , I found the default charset and ngram_chars is for english and russian only. So you have to add those by yourself.
In my case, I made these changs, for Chinese: