I've got a large database of projects and issue trackers, some of which have urls.
I'd like to query it to figure out a list of urls for each project, but many have extra data I'd like to avoid.
I'd like to do something like this:
substring(tracker_extra_field_data.field_data FROM 'http://([^/]*).*')
Except some urls are https, and I'd like to capture that as well as the first sub directory.
For example, given the url:
https://dev.foo.com/bar/action/?param=val
I'd like the select to return:
https://dev.foo.com/bar/
Is there a semi-simple way to do this with substring/regex in pgsql?
try this:
select substring('https://dev.foo.com/bar/action/?param=val' from '(https?://([^/]*/){1,2})');
template1=# select substring('https://dev.foo.com/bar/action/?param=val' from '(https?://([^/]*/){1,2})');
substring
-------------------------
https://dev.foo.com/bar/
(1 row)
template1=# select substring('http://dev.foo.com/bar/action/?param=val' from '(https?://([^/]*/){1,2})');
substring
------------------------
http://dev.foo.com/bar/
Updated after I didn't read the Q properly at first.
Use the pattern
^https?://[^/]+(?:/[^/]+)?/?
^
.. start of string
?
.. zero or one atoms
(?:)
.. non-capturing parens
[^/]+
.. any character except /
, 1 or more of them
This only accepts URLs starting with http://
or https://
(protocol header required).
->SQLfiddle with a bigger test case.