I looked around for a while, but probably I can't "Google" with the proper keywords.. so I'm here.
I need to match an url stripping out protocol to first /
Target: match the first substring from http:// to first / (maybe last / don't exist) or to the end
And here come a problem:
i wrote this regex
(?<=//)(.*?)(?=/)
but this regex matches only url with at least 1 '/' in the end excluding the protocol..
here some url to be matched:
- http://www.google.com/ (matched by my regex)
- http://www.google.com
- https://www.google
- xxx://www.google.com/hello/bleh blah....../
- xxx://google.com
- google.com/blah/hello.php?x=11_x.hi
Something like...
^(https?:\/\/)?([0-9a-zA-Z][-\w]*[0-9a-zA-Z\.)+[a-zA-Z]{2,6})\/
I saw this in a book I had. That should account for a variable http/https, disallow whitespace, and probably stop at the first slash.
Comment if I did this wrong.
This is working for all your example but the last:
(?<=//)[^/\\s]+
[^/\\s]
is a negated character class matching every character except /
and \s
(whitespace, e.g. a space, tab or newline characters)
See it here on Regexr
What will not work is the last row. How do you want to decide what is a link? If I make the first part optional, it will match on every character except /
and whitespaces.
^(?:\w+://)?([\w.-]+)/?.*$
(double backslashes for Java)
seems to work on all your examples, including a simple www.google.com
It seems like you have the right answer, but you're missing the possibility of not having a trailing "/". Try this:
(?<=//)(.*?)(?=/|$)