this is the use case I'm trying to figure this out for.
I have a list of spam subscriptions to a service and they are killing conversion rate and other usability studies.
The emails inserted look like the following:
rogerep_dyeepvu@hotmail.com
rogeram_ingramameb@hotmail.com
rogerew_jonesewct@hotmail.com
roger[...]_surname[...]@hotmail.com
What would be your suggestions on spotting these entries by using an automated script? It feels a little more complicated than it actually looks.
Help would be very much appreciated!
I don't think you can easily check for this. It's not likely to be a simple string matching problem that you can throw a regular expression at because I would guess that your use of the name 'Roger' was just an example, and that any number of names can appear in that position. You could also run one of the regular expressions supplied by the other posters, parameterising it with every permutation of obvious first name and last name. This will probably take somewhere between "too long" and "forever", and will flag up plenty of false positives.
Another approach, which works with the pattern you posted above, would be to take the last 4 letters of the username, and compare them against something. Spotting characters that are random as opposed to arranged sensibly (given a specific language) can be done by training a Markov Chain on legitimate text which can then allow you to calculate the probability of a given 4 letters appearing in that order in that language. For random letters, this probability will typically come in far lower than for a legitimate name (although if there are special characters or digits in there, all bets are off).
Another way might be to use a Bayesian filter (eg. something like Reverend in Python, though there are others) trained on the last 4 letters of legitimate email addresses. This would probably spot 95% of the ones which were just random, providing you made the data usable. eg. Submitting not just the 4 letters but each of the 2-letter and 3-letter substrings inside it, to capture the context of each letter. I don't think this would work as well as the Markov-style method though.
Whatever check you do, you can cut false positives by only submitting certain email addresses for it (eg. only those at webmail addresses, which contain an underscore, with at least 3 characters before the underscore and 5 characters after it.)
But ultimately, you can never know whether it's a spam address or a real one for sure until it gets used for one purpose or the other. So if possible I'd suggest giving up on trying to analyse the content and fix the problem somewhere else. In what way are they killing conversion rate? If you're counting these dummy accounts in some sort of metric, you'd be best off adding a verification stage first and only caring about metrics for accounts that pass verification. Some people really do have addresses like rogerep_dyeepvu@hotmail.com, after all.
I don't think you can do more than flag it as a potential problem, by checking for:
^roger([a-z]{2})_([a-z]+)@hotmail.com
using regular expressions, if that's the pattern that the spammer is using repeatedly.
Looks like they're using 2 lower-case alphabetic characters after roger
, so I've built that in. Not sure how you'd go about matching what dictionary of surnames they're using, so matching the last part (which appears to be surname then 4 lower-case alphabetic characters) might be hard, though you could perhaps do:
^roger([a-z]{2})_([a-z]{5,})@hotmail.com
which assumes that all their surnames at least have one character in.
Sounds like a job for regular expressions:
if re.match("^roger[a-z]+_[a-z]+@hotmail.com$", email_address):
# might be your spammer
(If you've never used regular expressions, here's a quick rundown of what that means: ^
matches the beginning of the string and $
matches the end, so we're requiring that everything between those symbols is a pattern describing the entire string. [a-z]
matches any lower-case letter, and +
means "one or more times", so [a-z]+
matches one or more lower-case letters. Putting it all together, our regex matches if the string can be described as "the beginning of the string, followed by the letters roger
, followed one or more lower-case letters, followed by an underscore, followed by one or more lower-case letters, followed by @hotmail.com
, followed by the end of the string." If the regex matches, the email address fits the pattern you described in your question.)
Of course, if he catches on and changes up his pattern (for example, by switching first names), this method will fail and you'll have to fall back on more traditional spam-prevention techniques like employing a CAPTCHA.