Can a URL contain a semi-colon?

2019-01-14 01:27发布

I am using a regular expression to convert plain text URL to clickable links.

@(https?://([-\w\.]+)+(:\d+)?(/([\w/_\.-]*(\?\S+)?)?)?)@

However, sometimes in the body of the text, URL are enumerated one per line with a semi-colon at the end. The real URL does not contain any ";".

http://www.aaa.org/pressdetail.asp?PRESS_REL_ID=275;
http://www.aaa.org/pressdetail.asp?PRESS_REL_ID=123;
http://www.aaa.org/pressdetail.asp?PRESS_REL_ID=124

Is it permitted to have a semicolon (;) in a URL or can the semicolon be considered a marker of the end of an URL? How would that fit in my regular expression?

标签: regex url
7条回答
We Are One
2楼-- · 2019-01-14 01:34

http://www.ietf.org/rfc/rfc3986.txt covers URLs and what characters may appear in unencoded form. Given that URLs containing semicolons work properly in browsers, your code should support them.

查看更多
爷的心禁止访问
3楼-- · 2019-01-14 01:41

Technically, a semicolon is a legal sub-delimiter in a URL string; plenty of source material is quoted above including http://www.ietf.org/rfc/rfc3986.txt.

And some do use it for legitimate purposes though it's use is likely site-specific (ie, only for use with that site) because it's usage has to be defined by the site using it.

In the real world however, the primary use for semicolons in URLs is to hide a virus or phishing URL behind a legitimate URL.

For example, sending someone an email with this link:

http:// www.yahoo.com/junk/nonsense;0200.0xfe.0x37.0xbf/malicious_file/

will result in the Yahoo! link (www.yahoo.com/junk/nonsense) being ignored because even though it is legitimate (ie, properly formed) no such page exists. But the second link (0200.0xfe.0x37.0xbf/malicious_file/) presumably exists* and the user will be directed to the malicious_file page; whereupon one's corporate IT manager will get a report and one will likely get a pink slip.

And before all the nay-sayers get their dander up, this is exactly how the new Facebook phishing problem works. The names have been changed to protect the guilty as usual.

*No such page actually exists to my knowledge. The link shown is for purposes of this discussion only.

查看更多
\"骚年 ilove
4楼-- · 2019-01-14 01:43

The W3C encourages CGI programs to accept ; as well as & in query strings (i.e. treat ?name=fred&age=50 and ?name=fred;age=50 the same way). This is supposed to be because & has to be encoded as & in HTML whereas ; doesn't.

查看更多
Summer. ? 凉城
5楼-- · 2019-01-14 01:43

Yes, semicolons are valid in URLs. However, if you're plucking them from relatively unstructured prose, it's probably safe to assume a semicolon at the end of a URL is meant as sentence punctuation. The same goes for other sentence-punctuation characters like periods, question marks, quotes, etc..

If you're only interested in URLs with an explicit http[s] protocol, and your regex flavor supports lookbehinds, this regex should suffice:

https?://[\w!#$%&'()*+,./:;=?@\[\]-]+(?<![!,.?;:"'()-])

After the protocol, it simply matches one or more characters that may be valid in a URL, without worrying about structure at all. But then it backs off as many positions as necessary until the final character is not something that might be sentence punctuation.

查看更多
smile是对你的礼貌
6楼-- · 2019-01-14 01:52

Quoting RFCs is not all that helpful in answering this question, because you will encounter URLs with semicolons (and commas for that matter). We had a Regex that did not handle semicolons and commas, and some of our users at NutshellMail complained because URLs containing them do in fact exist in the wild. Try building a dummy URL in Facebook or Twitter that contains a ';' or ',' and you will see that those two services encode the full URL properly.

I replaced the Regex we were using with the following pattern (and have tested that it works):

 string regex = @"((www\.|(http|https|ftp|news|file)+\:\/\/)[&#95;.a-zA-Z0-9-]+\.[a-zA-Z0-9\/&#95;:@=.+?,##%&~_-]*[^.|\'|\# |!|\(|?|,| |>|<|;|\)])";

This Regex came from http://rickyrosario.com/blog/converting-a-url-into-a-link-in-csharp-using-regular-expressions/ (with a slight modification)

查看更多
beautiful°
7楼-- · 2019-01-14 01:53

A semicolon is reserved and may not be used unencoded except for its special purpose (which depends on the scheme). Section 2.2:

Many URL schemes reserve certain characters for a special meaning: their appearance in the scheme-specific part of the URL has a designated semantics. If the character corresponding to an octet is reserved in a scheme, the octet must be encoded. The characters ";", "/", "?", ":", "@", "=" and "&" are the characters which may be reserved for special meaning within a scheme. No other characters may be reserved within a scheme.

查看更多
登录 后发表回答