We have a C2C website and we discourage selling branded products on our website. We have built a database of brand words such as Nike and D&G and made an algorithm that filters product information for these words and disables products if it contains these words.
Our current algorithm removes all white space and special characters from provided text and matches text with word from database. These cases are required to be caught by algorithm and are caught efficiently:
- i am nike world
- i have n ikee shoes
- i have nikeeshoes
- i sell i-phone casings
- i sell iphone-casings
- you can have iphone
Now the problem is that it also catches following:
- rapiD Garment factory (for D&G)
- rosNIK Electronics (for Nike)
What can be done to prevent such false matches while preserving efficiency with catching true cases?
EDIT
Here's the code for those of you who understand code better:
$orignal_txt = preg_replace('/&.{0,}?;/', '', (strip_tags($orignal_txt)));
$orignal_txt_nospace = preg_replace('/\W/', '', $orignal_txt);
{
$qry_kws = array("nike", "iphone", "d&g");
foreach($qry_kws as $rs_kw)
{
$no_space_db_kw = preg_replace('/\W/', '', $rs_kw);
if(stristr($orignal_txt_nospace, $rs_kw))
{
$ipr_banned_keywords[] = strtolower($rs_kw);
}
else if(stristr($orignal_txt_nospace, $no_space_db_kw))
{
$ipr_banned_keywords[] = strtolower($rs_kw);
}
}
}
Just playing around .... (Not to be used in production)
Example 1:
Output 1
Example 2
Output 2
Class Used
Here's just an idea.
Why don't you do the matching first and if it hits the "branded" filter, it gets put in the review queue for you to accept / decline, highlighting the matches for easy discovery.
Humans will be able to spot whether a brand is used almost immediately and accurately. You could even turn this into machine learning, who knows :)
Having said that, this is not a regex problem and can't be solved by nifty expressions; the system needs to be trained, remember hits (increase confidence) and learn from misses.
Simple, do the brand match before you remove spaces/special characters. Then it won't match these weird edge cases.
You already know this, but it's worth saying explicitly: Your current algorithm is completely inadequate for the task. It can't deal with even simple cases, let alone cases where people deliberately try to get past your filter. There's only one thing you can do with your current filter, and that's throw it away completely -- it can't be made to work.
While we aren't discussing an obsenity filter here, it is pretty much the same sort of concept, so you would be well advised to read up on some of the worst mistakes made by obsenity filters.
http://www.telegraph.co.uk/news/newstopics/howaboutthat/2667634/The-Clbuttic-Mistake-When-obscenity-filters-go-wrong.html
http://en.wikipedia.org/wiki/Scunthorpe_problem
These articles mostly deal with false-positives -- ie where the filter makes a match on something that it shouldn't and thus blocks a legitimate entry. This sort of thing can be very damaging as it will upset your customers and if it happens a lot it will drive people away from your site. The complexities of natural language make it almost enevitable.
You also need to be aware of false-negatives. These are where your filter fails to pick up something that it should pick up. Your problem here is that spammers have a massive arsenal of techniques for getting past filters. Your current filter would be trivial to get past, but even the most advanced filters can be defeated -- check how much spam you get in your inbox for evidence of this. And they change their techniques all the time, so a static algorithm simply isn't going to work in the long term.
A Bayesian filter would seem to be the best solution for you. These are filters that learn as they go. You need to keep an eye on them and train them to recognise what needs to be filtered, so it'll be a bit of work to set up, but I doubt you'll have a workable solution any other way.