Detecting a (naughty or nice) URL or link in a tex-第3页回答

How can I detect (with regular expressions or heuristics) a web site link in a string of text such as a comment?

The purpose is to prevent spam. HTML is stripped so I need to detect invitations to copy-and-paste. It should not be economical for a spammer to post links because most users could not successfully get to the page. I would like suggestions, references, or discussion on best-practices.

Some objectives:

The low-hanging fruit like well-formed URLs (http://some-fqdn/some/valid/path.ext)
URLs but without the http:// prefix (i.e. a valid FQDN + valid HTTP path)
Any other funny business

Of course, I am blocking spam, but the same process could be used to auto-link text.

Ideas

Here are some things I'm thinking.

The content is native-language prose so I can be trigger-happy in detection
Should I strip out all whitespace first, to catch "www .example.com"? Would common users know to remove the space themselves, or do any browsers "do-what-I-mean" and strip it for you?
Maybe multiple passes is a better strategy, with scans for:
- Well-formed URLs
- All non-whitespace followed by '.' followed by any valid TLD
- Anything else?

Update and Summary

Wow, I there are some very good heuristics listed in here! For me, the best bang-for-the-buck is a synthesis of the following:

@Jon Bright's technique of detecting TLDs (a good defensive chokepoint)
For those suspicious strings, replace the dot with a dot-looking character as per @capar
A good dot-looking character is @Sharkey's subscripted · (i.e. "_·"). · is also a word boundary so it's harder to casually copy & paste.

That should make a spammer's CPM low enough for my needs; the "flag as inappropriate" user feedback should catch anything else. Other solutions listed are also very useful:

Strip out all dotted-quads (@Sharkey's comment to his own answer)
@Sporkmonger's requirement for client-side Javascript which inserts a required hidden field into the form.
Pinging the URL server-side to establish whether it is a web site. (Perhaps I could run the HTML through SpamAssassin or another Bayesian filter as per @Nathan..)
Looking at Chrome's source for its smart address bar to see what clever tricks Google uses
Calling out to OWASP AntiSAMY or other web services for spam/malware detection.

标签： language-agnostic url sanitization spam-prevention

13条回答

唯我独甜

2楼-- · 2019-01-30 09:05

Well, obviously the low hanging fruit are things that start with http:// and www. Trying to filter out things like "www . g mail . com" leads to interesting philosophical questions about how far you want to go. Do you want to take it the next step and filter out "www dot gee mail dot com" also? How about abstract descriptions of a URL, like "The abbreviation for world wide web followed by a dot, followed by the letter g, followed by the word mail followed by a dot, concluded with the TLD abbreviation for commercial".

It's important to draw the line of what sorts of things you're going to try to filter before you continue with trying to design your algorithm. I think that the line should be drawn at the level where "gmail.com" is considered a url, but "gmail. com" is not. Otherwise, you're likely to get false positives every time someone fails to capitalize the first letter in a sentence.

0人赞添加讨论(0) 举报

上一页 1 2 3

Detecting a (naughty or nice) URL or link in a tex

Ideas

Related Questions

Update and Summary

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间