I currently do automatic detection of hyperlinks within text in my program. I made it very simple and only look for http:// or www.
However, a user suggested to me that I extend it to other forms, e.g.: https:// or .com
Then I realized it might not stop there because there's ftp and mailto and file, all the other top level domains, and even email addresses and file paths.
What I think is best is to limit it to what is practical by following some often-used standard set of hyperlink detection rules that are currently in use. Maybe how Microsoft Word does it, or maybe how RichEdit does it or maybe you know of a better standard.
So my question is:
Is there a built in function that I can call from Delphi to do the detection, and if so, what would the call look like? (I plan in the future to go to FireMonkey, so I would prefer something that will work beyond Windows.)
If there isn't a function available, is there some place I can find a documented set of rules of what is detected in Word, in RichEdit, or any other set of rules of what should be detected? That would then allow me to write the detection code myself.
Try the PathIsURL
function which is declarated in the ShLwApi
unit.
Following regex taken from RegexBuddy's library might get you started (I can't make any claims about performance).
Regex
Match; JGsoft; case insensitive:
\b(https?|ftp|file)://[-A-Z0-9+&@#/%?=~_|$!:,.;]*[A-Z0-9+&@#/%=~_|$]
Explanation
URL: Find in full text
The final character class makes sure that if an URL is part of some text,
punctuation such as a comma or full stop after the URL is not interpreted as part
of the URL.
Matches (whole or partial)
http://regexbuddy.com
http://www.regexbuddy.com
http://www.regexbuddy.com/
http://www.regexbuddy.com/index.html
http://www.regexbuddy.com/index.html?source=library
You can download RegexBuddy at http://www.regexbuddy.com/download.html.
Does not match
regexbuddy.com
www.regexbuddy.com
"www.domain.com/quoted URL with spaces"
support@regexbuddy.com
For a set of rules you might look into RFC 3986
A Uniform Resource Identifier (URI) is a compact sequence of
characters that identifies an abstract or physical resource. This
specification defines the generic URI syntax and a process for
resolving URI references that might be in relative form, along with
guidelines and security considerations for the use of URIs on the
Internet
A regex that validates a URL as specified in RFC 3986 would be
^
(# Scheme
[a-z][a-z0-9+\-.]*:
(# Authority & path
//
([a-z0-9\-._~%!$&'()*+,;=]+@)? # User
([a-z0-9\-._~%]+ # Named host
|\[[a-f0-9:.]+\] # IPv6 host
|\[v[a-f0-9][a-z0-9\-._~%!$&'()*+,;=:]+\]) # IPvFuture host
(:[0-9]+)? # Port
(/[a-z0-9\-._~%!$&'()*+,;=:@]+)*/? # Path
|# Path without authority
(/?[a-z0-9\-._~%!$&'()*+,;=:@]+(/[a-z0-9\-._~%!$&'()*+,;=:@]+)*/?)?
)
|# Relative URL (no scheme or authority)
([a-z0-9\-._~%!$&'()*+,;=@]+(/[a-z0-9\-._~%!$&'()*+,;=:@]+)*/? # Relative path
|(/[a-z0-9\-._~%!$&'()*+,;=:@]+)+/?) # Absolute path
)
# Query
(\?[a-z0-9\-._~%!$&'()*+,;=:@/?]*)?
# Fragment
(\#[a-z0-9\-._~%!$&'()*+,;=:@/?]*)?
$
Regular Expressions may be the way to go here, to define the various patterns which you deem to be appropriate hyperlinks.