detecting mistyped email addresses in javascript

2020-06-20 05:25发布

问题:

I notice sometimes users mistype their email address (in a contact-us form), for example, typing @yahho.com, @yhoo.com, or @yahoo.co instead of @yahoo.com

I feel that this can be corrected on-the-spot with some javascript. Simply check the email address for possible mistakes, such as the ones listed above, so that if the user types his_email@yhoo.com, a non-obtrusive message can be displayed, or something like that, suggesting that he probably means @yahoo.com, and asking to double check he typed his email correctly.

The Question is:
How can I detect -in java script- that a string is very similar to "yahoo" or "yahoo.com"? or in general, how can I detect the level of similarity between two strings?

P.S. (this is a side note) In my specific case, the users are not native English speakers, and most of them are no where near fluent, the site itself is not in English.

回答1:

Here's a dirty implementation that could kind of get you some simple checks using the Levenshtein distance. Credit for the "levenshteinenator" goes to this link. You would add whatever popular domains you want to the domains array and it would check to see if the distance of the host part of the email entered is 1 or 2 which would be reasonably close to assume there's a typo somewhere.

levenshteinenator = function(a, b) {
    var cost;

    // get values
    var m = a.length;
    var n = b.length;

    // make sure a.length >= b.length to use O(min(n,m)) space, whatever that is
    if (m < n) {
        var c=a;a=b;b=c;
        var o=m;m=n;n=o;
    }

    var r = new Array();
    r[0] = new Array();
    for (var c = 0; c < n+1; c++) {
        r[0][c] = c;
    }

    for (var i = 1; i < m+1; i++) {
        r[i] = new Array();
        r[i][0] = i;
        for (var j = 1; j < n+1; j++) {
            cost = (a.charAt(i-1) == b.charAt(j-1))? 0: 1;
            r[i][j] = minimator(r[i-1][j]+1,r[i][j-1]+1,r[i-1][j-1]+cost);
        }
    }

    return r[m][n];
}

// return the smallest of the three values passed in
minimator = function(x,y,z) {
    if (x < y && x < z) return x;
    if (y < x && y < z) return y;
    return z;
}

var domains = new Array('yahoo.com','google.com','hotmail.com');
var email = 'whatever@yahoo.om';
var parts = email.split('@');
var dist;
for(var x=0; x < domains.length; x++) {
    dist = levenshteinenator(domains[x], parts[1]);
    if(dist == 1 || dist == 2) {
        alert('did you mean ' + domains[x] + '?');
    }
}


回答2:

In addition to soundex, you may also want to have a look at algorithms for determining Levenshtein distance.



回答3:

Check out soundex and Difference: If you use ajax you can have the sql-server check the soundex-value of the words against "correct" domains and get suggestions back. It is also possible to make an own version of soundex (its not that complicated).

SQL Server's SoundEx function on non-Latin character sets?

Data structure for soundex algorithm?

How do you implement a "Did you mean"?



回答4:

Of course, as a first step, you could strip out the domain name and do a DNS lookup - that should at least tell you if it appears to be legitimate.



回答5:

As other said, the Levenshtein distance is a sure solution.

There is an excellent Javascript library that does exactly what you want: Mailcheck from Kicksend.

https://github.com/DimitarChristoff/mailcheck

The library:

  • offers up suggestions for domains and top level domains.
  • can be customized (domains, top domains, string distance method).
  • can be used with jQuery
  • is decoupled from jQuery

This library uses sift3 string similarity algorithm for speed purpose. It has been reported that Levenshtein distance produces better results (https://github.com/DimitarChristoff/mailcheck).



回答6:

It might be possible to use a regex, but personally, it would take me way too long to write one I'd be happy with that could get all the possible permutations without causing too many false positives.

So, here's what I would do:

  • Hard-code a list of all the common typing errors.
  • Use a case-insensitive string comparison to compare the email to each string in the list .
  • If there's a match, display a warning - "Did you mean yahoo.com?"

Yeah, it's not very pretty, but it doesn't seem (at least from your question) like you'll have that many to check, so it should perform just fine. It also doesn't seem (at least to me) like something worth putting a whole lot of time into, so this is an incredible simple solution that could be done in about 15-30 min.