How Can I Implement A Standard Set of Hyperlink De

2019-06-16 11:34发布

问题:

I currently do automatic detection of hyperlinks within text in my program. I made it very simple and only look for http:// or www.

However, a user suggested to me that I extend it to other forms, e.g.: https:// or .com

Then I realized it might not stop there because there's ftp and mailto and file, all the other top level domains, and even email addresses and file paths.

What I think is best is to limit it to what is practical by following some often-used standard set of hyperlink detection rules that are currently in use. Maybe how Microsoft Word does it, or maybe how RichEdit does it or maybe you know of a better standard.

So my question is:

Is there a built in function that I can call from Delphi to do the detection, and if so, what would the call look like? (I plan in the future to go to FireMonkey, so I would prefer something that will work beyond Windows.)

If there isn't a function available, is there some place I can find a documented set of rules of what is detected in Word, in RichEdit, or any other set of rules of what should be detected? That would then allow me to write the detection code myself.

回答1:

Try the PathIsURL function which is declarated in the ShLwApi unit.



回答2:

Following regex taken from RegexBuddy's library might get you started (I can't make any claims about performance).

Regex

Match; JGsoft; case insensitive:  
\b(https?|ftp|file)://[-A-Z0-9+&@#/%?=~_|$!:,.;]*[A-Z0-9+&@#/%=~_|$]

Explanation

URL: Find in full text The final character class makes sure that if an URL is part of some text, punctuation such as a comma or full stop after the URL is not interpreted as part of the URL.

Matches (whole or partial)

http://regexbuddy.com
http://www.regexbuddy.com 
http://www.regexbuddy.com/ 
http://www.regexbuddy.com/index.html 
http://www.regexbuddy.com/index.html?source=library 
You can download RegexBuddy at http://www.regexbuddy.com/download.html.

Does not match

regexbuddy.com
www.regexbuddy.com
"www.domain.com/quoted URL with spaces"
support@regexbuddy.com

For a set of rules you might look into RFC 3986

A Uniform Resource Identifier (URI) is a compact sequence of
characters that identifies an abstract or physical resource. This
specification defines the generic URI syntax and a process for
resolving URI references that might be in relative form, along with
guidelines and security considerations for the use of URIs on the
Internet

A regex that validates a URL as specified in RFC 3986 would be

^
(# Scheme
 [a-z][a-z0-9+\-.]*:
 (# Authority & path
  //
  ([a-z0-9\-._~%!$&'()*+,;=]+@)?              # User
  ([a-z0-9\-._~%]+                            # Named host
  |\[[a-f0-9:.]+\]                            # IPv6 host
  |\[v[a-f0-9][a-z0-9\-._~%!$&'()*+,;=:]+\])  # IPvFuture host
  (:[0-9]+)?                                  # Port
  (/[a-z0-9\-._~%!$&'()*+,;=:@]+)*/?          # Path
 |# Path without authority
  (/?[a-z0-9\-._~%!$&'()*+,;=:@]+(/[a-z0-9\-._~%!$&'()*+,;=:@]+)*/?)?
 )
|# Relative URL (no scheme or authority)
 ([a-z0-9\-._~%!$&'()*+,;=@]+(/[a-z0-9\-._~%!$&'()*+,;=:@]+)*/?  # Relative path
 |(/[a-z0-9\-._~%!$&'()*+,;=:@]+)+/?)                            # Absolute path
)
# Query
(\?[a-z0-9\-._~%!$&'()*+,;=:@/?]*)?
# Fragment
(\#[a-z0-9\-._~%!$&'()*+,;=:@/?]*)?
$


回答3:

Regular Expressions may be the way to go here, to define the various patterns which you deem to be appropriate hyperlinks.