Given the following SSH urls:
git@github.com:james/example
git@github.com:007/example
git@github.com:22/james/example
git@github.com:22/007/example
How can I pull the following:
{user}@{host}:{optional port}{path (user/repo)}
As you can see in the example, one of the usernames is numeric and NOT a port. I can't figure out how to workaround that. A port isn't always in the URL too.
My current regex is:
^(?P<user>[^@]+)@(?P<host>[^:\s]+)?:(?:(?P<port>\d{1,5})\/)?(?P<path>[^\\].*)$
Not sure what else to try.
Lazy quantifiers to the rescue!
This seems to work well and satisfies the optional port:
^
(?P<user>.*?)@
(?P<host>.*?):
(?:(?P<port>.*?)/)?
(?P<path>.*?/.*?)
$
The line breaks are not part of the regex because the /x
modifier is enabled. Remove all line breaks if you are not using /x
.
https://regex101.com/r/wdE30O/5
Thank you @Jan for the optimizations.
If you're on Python
, you could write your very own parser:
from parsimonious.grammar import Grammar
from parsimonious.nodes import NodeVisitor
data = """git@github.com:james/example
git@github.com:007/example
git@github.com:22/james/example
git@github.com:22/007/example"""
class GitVisitor(NodeVisitor):
grammar = Grammar(
r"""
expr = user at domain colon rest
user = word+
domain = ~"[^:]+"
rest = (port path) / path
path = word slash word
port = digits slash
slash = "/"
colon = ":"
at = "@"
digits = ~"\d+"
word = ~"\w+"
""")
def generic_visit(self, node, visited_children):
return visited_children or node
def visit_user(self, node, visited_children):
return {"user": node.text}
def visit_domain(self, node, visited_children):
return {"domain": node.text}
def visit_rest(self, node, visited_children):
child = visited_children[0]
if isinstance(child, list):
# first branch, port and path
return {"port": child[0], "path": child[1]}
else:
return {"path": child}
def visit_path(self, node, visited_children):
return node.text
def visit_port(self, node, visited_children):
digits, _ = visited_children
return digits.text
def visit_expr(self, node, visited_children):
out = {}
_ = [out.update(child) for child in visited_children if isinstance(child, dict)]
return out
gv = GitVisitor()
for line in data.split("\n"):
result = gv.parse(line)
print(result)
Which would yield
{'user': 'git', 'domain': 'github.com', 'path': 'james/example'}
{'user': 'git', 'domain': 'github.com', 'path': '007/example'}
{'user': 'git', 'domain': 'github.com', 'port': '22', 'path': 'james/example'}
{'user': 'git', 'domain': 'github.com', 'port': '22', 'path': '007/example'}
A parser allows for some ambiguity which you obviously have here.