substring regex for first part of url

2019-06-08 19:36发布

问题:

I've got a large database of projects and issue trackers, some of which have urls.

I'd like to query it to figure out a list of urls for each project, but many have extra data I'd like to avoid.

I'd like to do something like this:

substring(tracker_extra_field_data.field_data FROM 'http://([^/]*).*')

Except some urls are https, and I'd like to capture that as well as the first sub directory.

For example, given the url:

https://dev.foo.com/bar/action/?param=val

I'd like the select to return:

https://dev.foo.com/bar/

Is there a semi-simple way to do this with substring/regex in pgsql?

回答1:

try this:

select substring('https://dev.foo.com/bar/action/?param=val' from '(https?://([^/]*/){1,2})');

template1=# select substring('https://dev.foo.com/bar/action/?param=val' from '(https?://([^/]*/){1,2})');
        substring
-------------------------
 https://dev.foo.com/bar/
(1 row)

template1=# select substring('http://dev.foo.com/bar/action/?param=val' from '(https?://([^/]*/){1,2})');
       substring
------------------------
 http://dev.foo.com/bar/


回答2:

Updated after I didn't read the Q properly at first.

Use the pattern

^https?://[^/]+(?:/[^/]+)?/?

^ .. start of string
? .. zero or one atoms
(?:) .. non-capturing parens
[^/]+ .. any character except /, 1 or more of them

This only accepts URLs starting with http:// or https:// (protocol header required).

->SQLfiddle with a bigger test case.