I have a document collection with following structure
uid, name
With a Index
db.Collection.createIndex({name: "text"})
It contains following data
1, iphone
2, iphóne
3, iphonë
4, iphónë
When I am doing text search for iphone
I am getting only two records, which is unexpected
actual output
--------------
1, iphone
2, iphóne
If I search for iphonë
db.Collection.find( { $text: { $search: "iphonë"} } );
I am getting
---------------------
3, iphonë
4, iphónë
But Actually I am expecting following output
db.Collection.find( { $text: { $search: "iphone"} } );
db.Collection.find( { $text: { $search: "iphónë"} } );
Expected output
------------------
1, iphone
2, iphóne
3, iphonë
4, iphónë
am I missing something here?
How can I get above expected outputs, with search of iphone
or iphónë
?
Since mongodb 3.2, text indexes are diacritic insensitive:
With version 3, text index is diacritic insensitive. That is, the
index does not distinguish between characters that contain diacritical
marks and their non-marked counterpart, such as é, ê, and e. More
specifically, the text index strips the characters categorized as
diacritics in Unicode 8.0 Character Database Prop List.
So the following query should work:
db.Collection.find( { $text: { $search: "iphone"} } );
db.Collection.find( { name: { $regex: "iphone"} } );
but it looks like there is a bug with dieresis ( ¨ ), even if it's caterorized as diacritic in unicode 8.0 list (issue on JIRA: SERVER-29918 )
Solution
since mongodb 3.4 you can use collation which allows you to perform this kind of query :
for example, to get your expected output, run the following query:
db.Collection.find({name: "iphone"}).collation({locale: "en", strength: 1})
this will output:
{ "_id" : 1, "name" : "iphone" }
{ "_id" : 2, "name" : "iphône" }
{ "_id" : 3, "name" : "iphonë" }
{ "_id" : 4, "name" : "iphônë" }
in the collation, strength
is the level of comparaison to perform
- 1 : base character only
- 2 : diacritic sensitive
- 3 : case sensitive + diacritic sensitive