Match similar/variations of words / suffixes in My

2019-04-11 12:59发布

How can I match variations on words in MySQL, for example a search for accountancy should match accountant, accountants, accounting etc. I'm on shared hosting so can't add any functions to MySQL such as levenshtein.

I want something similar to how Google matches 'accounting course' and 'accountancy courses' when searching for 'accountant courses'. Example.

My server language is php, if it's only possible to implement it there and not in SQL.

The current statement is as follows.

SELECT 
  pjs.title,
  MATCH (pjs.title) AGAINST ('accountancy' IN NATURAL LANGUAGE MODE WITH QUERY EXPANSION) AS rel1,
  MATCH (pjs.description) AGAINST ('accountancy' IN NATURAL LANGUAGE MODE WITH QUERY EXPANSION) AS rel2,
  MATCH (
    pjs.benefits,
    pjs.experienceRequirements,
    pjs.incentives,
    pjs.qualifications,
    pjs.responsibilities,
    pjs.skills
  ) AGAINST ('accountancy' IN NATURAL LANGUAGE MODE WITH QUERY EXPANSION) AS rel3
FROM
  pxl_jobsearch AS pjs 
ORDER BY (rel1 * 5) + (rel2 * 1.5) + (rel3) DESC;

标签: mysql sql match
5条回答
Luminary・发光体
2楼-- · 2019-04-11 13:03

Search engines do this by implementing a text processing technique known as stemming. There are many libraries that implement this for you, personally I have user the Snowball stemmer, it does a good enough job.

I'm not familiar enough with the full text search capabilities of MySql, but you could try to apply the stemming algorithms to the search terms. For your example of "accountancy courses" the Snowball stemmer returns "account cours".

查看更多
啃猪蹄的小仙女
3楼-- · 2019-04-11 13:10

The SOUNDEX() function of MySQL comes pretty close. Read more about it here.

Example:

create table test(id int auto_increment, a varchar(255), primary key(id));
insert into test(a) values
('accountancy'),
('accountant'),
('accountants'),
('accounting'),
('accountingc'),
('becounting'),
('asdf'),
('this is a test');

select 
test.*,
SOUNDEX(a),
SOUNDEX('accountancy')
FROM
test 
WHERE a SOUNDS LIKE 'accountancy';

If this doesn't solve it, the levenshtein algorithm is the way to go. Talk to your database administrator that he allows you to create functions. If he does, here's the solution (I didn't write the function, credit goes to anonymous):

DELIMITER //
CREATE FUNCTION levenshtein( s1 VARCHAR(255), s2 VARCHAR(255) )
RETURNS INT
DETERMINISTIC
BEGIN
DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
DECLARE s1_char CHAR;
-- max strlen=255
DECLARE cv0, cv1 VARBINARY(256);
SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
IF s1 = s2 THEN
RETURN 0;
ELSEIF s1_len = 0 THEN
RETURN s2_len;
ELSEIF s2_len = 0 THEN
RETURN s1_len;
ELSE
WHILE j <= s2_len DO
SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
END WHILE;
WHILE i <= s1_len DO
SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
WHILE j <= s2_len DO
SET c = c + 1;
IF s1_char = SUBSTRING(s2, j, 1) THEN
SET cost = 0; ELSE SET cost = 1;
END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
IF c > c_temp THEN SET c = c_temp; END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
IF c > c_temp THEN
SET c = c_temp;
END IF;
SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
END WHILE;
SET cv1 = cv0, i = i + 1;
END WHILE;
END IF;
RETURN c;
END//

Again test data:

create table leven(id int auto_increment, a varchar(255), primary key(id));
insert into leven(a) values
('accountancy'),
('accountant'),
('accountants'),
('accounting'),
('accountingc'),
('becounting'),
('asdf'),
('this is a test')
;


select
leven.*,
levenshtein(leven.a, 'accountancy')
from
leven
where levenshtein(leven.a, 'accountancy') <= 3 /*or any value you like*/
查看更多
Bombasti
4楼-- · 2019-04-11 13:11

You may use SQL SOUNDEX(), which is pretty useful for your needs: it searches for words that sound the same, not for grammatically close ones. You may use two approaches, which are quite similar.

  • Recognize common suffixes and replace them with a % in the LIKE clause: using your example, accountancy would become account%.
  • Write a "dictionary" that contains all the variations you need, and use it into PHP (stristr() function): so, accountancy would produce a clause similar to WHERE value='accountancy' or value='accountant' or value='accountancies'.
查看更多
放荡不羁爱自由
5楼-- · 2019-04-11 13:17

I don't know much about MATCH, when I want to select a column with variations I do the following

SELECT pjs.title
FROM pxl_jobsearch AS pjs
WHERE pjs.title LIKE 'account%'

I work mostly in SQL Server but do some MySQL. I imagine that this works in MySQL as well.

查看更多
太酷不给撩
6楼-- · 2019-04-11 13:22

MySQL isn't very good at full text search and you'd probably want to use other engines. My favorite one is Sphinx (http://sphinxsearch.com/) but there are others as well. Most of these support stemming out of the box.

If you have large tables and are going to use stemming the performance of MySQL will probably be very bad.

If you can't use Sphinx, take a look at this php script http://tartarus.org/~martin/PorterStemmer/php.txt

With this you can use stemming, and the search on the stemmed words.

查看更多
登录 后发表回答