Detecting a (naughty or nice) URL or link in a tex

2019-01-30 08:18发布

How can I detect (with regular expressions or heuristics) a web site link in a string of text such as a comment?

The purpose is to prevent spam. HTML is stripped so I need to detect invitations to copy-and-paste. It should not be economical for a spammer to post links because most users could not successfully get to the page. I would like suggestions, references, or discussion on best-practices.

Some objectives:

  • The low-hanging fruit like well-formed URLs (http://some-fqdn/some/valid/path.ext)
  • URLs but without the http:// prefix (i.e. a valid FQDN + valid HTTP path)
  • Any other funny business

Of course, I am blocking spam, but the same process could be used to auto-link text.

Ideas

Here are some things I'm thinking.

  • The content is native-language prose so I can be trigger-happy in detection
  • Should I strip out all whitespace first, to catch "www .example.com"? Would common users know to remove the space themselves, or do any browsers "do-what-I-mean" and strip it for you?
  • Maybe multiple passes is a better strategy, with scans for:
    • Well-formed URLs
    • All non-whitespace followed by '.' followed by any valid TLD
    • Anything else?

Related Questions

I've read these and they are now documented here, so you can just references the regexes in those questions if you want.

Update and Summary

Wow, I there are some very good heuristics listed in here! For me, the best bang-for-the-buck is a synthesis of the following:

  1. @Jon Bright's technique of detecting TLDs (a good defensive chokepoint)
  2. For those suspicious strings, replace the dot with a dot-looking character as per @capar
  3. A good dot-looking character is @Sharkey's subscripted · (i.e. "·"). · is also a word boundary so it's harder to casually copy & paste.

That should make a spammer's CPM low enough for my needs; the "flag as inappropriate" user feedback should catch anything else. Other solutions listed are also very useful:

  • Strip out all dotted-quads (@Sharkey's comment to his own answer)
  • @Sporkmonger's requirement for client-side Javascript which inserts a required hidden field into the form.
  • Pinging the URL server-side to establish whether it is a web site. (Perhaps I could run the HTML through SpamAssassin or another Bayesian filter as per @Nathan..)
  • Looking at Chrome's source for its smart address bar to see what clever tricks Google uses
  • Calling out to OWASP AntiSAMY or other web services for spam/malware detection.

13条回答
贪生不怕死
2楼-- · 2019-01-30 08:43

I'm not sure if detecting URLs with a regex is the right way to solve this problem. Usually you will miss some sort of obscure edge case that spammers will be able to exploit if they are motivated enough.

If your goal is just to filter spam out of comments then you might want to think about Bayesian filtering. It has proved to be very accurate in flagging email as spam, it might be able to do the same for you as well, depending on the volume of text you need to filter.

查看更多
祖国的老花朵
4楼-- · 2019-01-30 08:46

Ping the possible URL

If you don't mind a little server side computation, what about something like this?

urls = []
for possible_url in extracted_urls(comment):
    if pingable(possible_url):
       urls.append(url)  #you could do this as a list comprehension, but OP may not know python

Here:

  1. extracted_urls takes in a comment and uses a conservative regex to pull out possible candidates

  2. pingable actually uses a system call to determine whether the hostname exists on the web. You could have a simple wrapper parse the output of ping.

    [ramanujan:~/base]$ping -c 1 www.google.com

    PING www.l.google.com (74.125.19.147): 56 data bytes 64 bytes from 74.125.19.147: icmp_seq=0 ttl=246 time=18.317 ms

    --- www.l.google.com ping statistics --- 1 packets transmitted, 1 packets received, 0% packet loss round-trip min/avg/max/stddev = 18.317/18.317/18.317/0.000 ms

    [ramanujan:~/base]$ping -c 1 fooalksdflajkd.com

    ping: cannot resolve fooalksdflajkd.com: Unknown host

The downside is that if the host gives a 404, you won't detect it, but this is a pretty good first cut -- the ultimate way to verify that an address is a website is to try to navigate to it. You could also try wget'ing that URL, but that's more heavyweight.

查看更多
▲ chillily
5楼-- · 2019-01-30 08:47

Consider incorporating the OWASP AntiSAMY API...

查看更多
贼婆χ
6楼-- · 2019-01-30 08:50

I know this doesn't help with auto-link text but what if you search and replaced all full-stop periods with a character that looks like the same thing, such as the unicode character for hebrew point hiriq (U+05B4)?

The following paragraph is an example:

This might workִ The period looks a bit odd but it is still readableִ The benefit of course is that anyone copying and pasting wwwִgoogleִcom won't get too farִ :)

查看更多
Root(大扎)
7楼-- · 2019-01-30 08:51

Having made several attempts at writing this exact piece of code, I can say unequivocally, you won't be able to do this with absolute reliability, and you certainly won't be able to detect all of the URI forms allowed by the RFC. Fortunately, since you have a very limited set of URLs you're interested in, you can use any of the techniques above.

However, the other thing I can say with a great deal of certainty, is that if you really want to beat spammers, the best way to do that is to use JavaScript. Send a chunk of JavaScript that performs some calculation, and repeat the calculation on the server side. The JavaScript should copy the result of the calculation to a hidden field so that when the comment is submitted, the result of the calculation is submitted as well. Verify on the server side that the calculation is correct. The only way around this technique is for spammers to manually enter comments or for them to start running a JavaScript engine just for you. I used this technique to reduce the spam on my site from 100+/day to one or two per year. Now the only spam I ever get is entered by humans manually. It's weird to get on-topic spam.

查看更多
登录 后发表回答