How can I detect (with regular expressions or heuristics) a web site link in a string of text such as a comment?
The purpose is to prevent spam. HTML is stripped so I need to detect invitations to copy-and-paste. It should not be economical for a spammer to post links because most users could not successfully get to the page. I would like suggestions, references, or discussion on best-practices.
Some objectives:
- The low-hanging fruit like well-formed URLs (
http://some-fqdn/some/valid/path.ext
) - URLs but without the
http://
prefix (i.e. a valid FQDN + valid HTTP path) - Any other funny business
Of course, I am blocking spam, but the same process could be used to auto-link text.
Ideas
Here are some things I'm thinking.
- The content is native-language prose so I can be trigger-happy in detection
- Should I strip out all whitespace first, to catch "
www .example.com
"? Would common users know to remove the space themselves, or do any browsers "do-what-I-mean" and strip it for you? - Maybe multiple passes is a better strategy, with scans for:
- Well-formed URLs
- All non-whitespace followed by '.' followed by any valid TLD
- Anything else?
Related Questions
I've read these and they are now documented here, so you can just references the regexes in those questions if you want.
- replace URL with HTML Links javascript
- What is the best regular expression to check if a string is a valid URL
- Getting parts of a URL (Regex)
Update and Summary
Wow, I there are some very good heuristics listed in here! For me, the best bang-for-the-buck is a synthesis of the following:
- @Jon Bright's technique of detecting TLDs (a good defensive chokepoint)
- For those suspicious strings, replace the dot with a dot-looking character as per @capar
- A good dot-looking character is @Sharkey's subscripted · (i.e. "·"). · is also a word boundary so it's harder to casually copy & paste.
That should make a spammer's CPM low enough for my needs; the "flag as inappropriate" user feedback should catch anything else. Other solutions listed are also very useful:
- Strip out all dotted-quads (@Sharkey's comment to his own answer)
- @Sporkmonger's requirement for client-side Javascript which inserts a required hidden field into the form.
- Pinging the URL server-side to establish whether it is a web site. (Perhaps I could run the HTML through SpamAssassin or another Bayesian filter as per @Nathan..)
- Looking at Chrome's source for its smart address bar to see what clever tricks Google uses
- Calling out to OWASP AntiSAMY or other web services for spam/malware detection.
I'm not sure if detecting URLs with a regex is the right way to solve this problem. Usually you will miss some sort of obscure edge case that spammers will be able to exploit if they are motivated enough.
If your goal is just to filter spam out of comments then you might want to think about Bayesian filtering. It has proved to be very accurate in flagging email as spam, it might be able to do the same for you as well, depending on the volume of text you need to filter.
Check these articles:
Ping the possible URL
If you don't mind a little server side computation, what about something like this?
Here:
extracted_urls takes in a comment and uses a conservative regex to pull out possible candidates
pingable actually uses a system call to determine whether the hostname exists on the web. You could have a simple wrapper parse the output of ping.
[ramanujan:~/base]$ping -c 1 www.google.com
PING www.l.google.com (74.125.19.147): 56 data bytes 64 bytes from 74.125.19.147: icmp_seq=0 ttl=246 time=18.317 ms
--- www.l.google.com ping statistics --- 1 packets transmitted, 1 packets received, 0% packet loss round-trip min/avg/max/stddev = 18.317/18.317/18.317/0.000 ms
[ramanujan:~/base]$ping -c 1 fooalksdflajkd.com
ping: cannot resolve fooalksdflajkd.com: Unknown host
The downside is that if the host gives a 404, you won't detect it, but this is a pretty good first cut -- the ultimate way to verify that an address is a website is to try to navigate to it. You could also try wget'ing that URL, but that's more heavyweight.
Consider incorporating the OWASP AntiSAMY API...
I know this doesn't help with auto-link text but what if you search and replaced all full-stop periods with a character that looks like the same thing, such as the unicode character for hebrew point hiriq (U+05B4)?
The following paragraph is an example:
This might workִ The period looks a bit odd but it is still readableִ The benefit of course is that anyone copying and pasting wwwִgoogleִcom won't get too farִ :)
Having made several attempts at writing this exact piece of code, I can say unequivocally, you won't be able to do this with absolute reliability, and you certainly won't be able to detect all of the URI forms allowed by the RFC. Fortunately, since you have a very limited set of URLs you're interested in, you can use any of the techniques above.
However, the other thing I can say with a great deal of certainty, is that if you really want to beat spammers, the best way to do that is to use JavaScript. Send a chunk of JavaScript that performs some calculation, and repeat the calculation on the server side. The JavaScript should copy the result of the calculation to a hidden field so that when the comment is submitted, the result of the calculation is submitted as well. Verify on the server side that the calculation is correct. The only way around this technique is for spammers to manually enter comments or for them to start running a JavaScript engine just for you. I used this technique to reduce the spam on my site from 100+/day to one or two per year. Now the only spam I ever get is entered by humans manually. It's weird to get on-topic spam.