Regular Expression - Extract subdomain & domain [d

2019-01-14 07:30发布

This question already has an answer here:

I'm trying to form a regular expression (javascript/node.js) which will extract the sub-domain & domain part from any given URL. This is what I ended up with:

[^(?:http:\/\/|www\.|https:\/\/)]([^\/]+)

Right now, I'm just considering http, https for protocol & exclude "www." portion from the subdomain+domain portion of an URL. I checked the expression & it almost works. But, here is the issue:

Success

'http://mplay.google.co.in/sadfask/asdkfals?dk=10'.match(/[^(?:http:\/\/|www\.|https:\/\/)]([^\/]+)/i)

'http://lplay.google.co.in/sadfask/asdkfals?dk=10'.match(/[^(?:http:\/\/|www\.|https:\/\/)]([^\/]+)/i)

Failure

'http://play.google.co.in/sadfask/asdkfals?dk=10'.match(/[^(?:http:\/\/|www\.|https:\/\/)]([^\/]+)/i)

'http://tplay.google.co.in/sadfask/asdkfals?dk=10'.match(/[^(?:http:\/\/|www\.|https:\/\/)]([^\/]+)/i)

I just use the first element from the result array. I'm not able to understand why "play." & "tplay." doesn't work. Could anyone please help me in this regard?

Does "/p" and "/t" have any meaning for the regular expression evaluator?

Is there any other way of extracting sub-domain & domain from any given URL using a regular expression?

Edit -

Example:

https://play.google.com/store/apps/details?id=com.skgames.trafficracer => play.google.com

https://mail.google.com/mail/u/0/#inbox => mail.google.com

5条回答
对你真心纯属浪费
2楼-- · 2019-01-14 07:53

You are about the one millionth person to try to parse URLs in JavaScript. I'm a little bit surprised you didn't see any of the existing questions on SO dating back years. The last thing you want to do is write yet another broken regexp, with all due respect to those that provided answers to your question.

There are many well documented libraries and approaches to handling this. Google it. The simplest way is to create an a element in memory, assign it an href, and then access its hostname and other properties. See http://tutorialzine.com/2013/07/quick-tip-parse-urls/. If that does not float your boat, then use a library like uri.js.

If you really don't want to use a library, and insist on reinventing the wheel, then at least do something like the following:

function get_domain_from_url(url) {
    var a = document.createElement('a').
    a.setAttribute('href', url);
    return a.hostname;
}

Essentially, you are delegating the extraction of the subdomain/domain part of the URL to the browser's URL parsing logic, which is MUCH better than anything you will ever write.

Also see Parse URL with jquery/ javascript?, Parse URL with Javascript, How do I parse a URL into hostname and path in javascript?, or parse URL with JavaScript or jQuery. How did you miss those? Sorry, I have to vote to close this as a duplicate.

查看更多
兄弟一词,经得起流年.
3楼-- · 2019-01-14 07:55

Your regex doesn't seem correct. Try this regex:

/^(?:https?:\/\/)?(?:[^@\n]+@)?(?:www\.)?([^:\/\n?]+)/img

RegEx Demo

查看更多
forever°为你锁心
4楼-- · 2019-01-14 08:03

Here's a solution ignoring everything before ://

.*\://?([^\/]+)

Incase you want to ignore www.

.*\://(?:www.)?([^\/]+)
查看更多
Anthone
5楼-- · 2019-01-14 08:08

Your regex expression works pretty well. You only need to remove the brackets. The final expression is:

^(?:http:\/\/|www\.|https:\/\/)([^\/]+)

Hope it's useful!

查看更多
孤傲高冷的网名
6楼-- · 2019-01-14 08:14

The same RegExp as in anubhava's accepted answer, only added support for protocol-relative URLs like //google.com:

/^(?:https?:)?(?:\/\/)?(?:[^@\n]+@)?(?:www\.)?([^:\/\n]+)/im

RegEx Demo

查看更多
登录 后发表回答