Regex to match SSH url parts

2020-02-15 07:05发布

问题:

Given the following SSH urls:

git@github.com:james/example
git@github.com:007/example
git@github.com:22/james/example
git@github.com:22/007/example

How can I pull the following:

{user}@{host}:{optional port}{path (user/repo)}

As you can see in the example, one of the usernames is numeric and NOT a port. I can't figure out how to workaround that. A port isn't always in the URL too.

My current regex is:

^(?P<user>[^@]+)@(?P<host>[^:\s]+)?:(?:(?P<port>\d{1,5})\/)?(?P<path>[^\\].*)$

Not sure what else to try.

回答1:

Lazy quantifiers to the rescue!

This seems to work well and satisfies the optional port:

^
(?P<user>.*?)@
(?P<host>.*?):
(?:(?P<port>.*?)/)?
(?P<path>.*?/.*?)
$

The line breaks are not part of the regex because the /x modifier is enabled. Remove all line breaks if you are not using /x.

https://regex101.com/r/wdE30O/5


Thank you @Jan for the optimizations.



回答2:

If you're on Python, you could write your very own parser:

from parsimonious.grammar import Grammar
from parsimonious.nodes import NodeVisitor

data = """git@github.com:james/example
git@github.com:007/example
git@github.com:22/james/example
git@github.com:22/007/example"""

class GitVisitor(NodeVisitor):
    grammar = Grammar(
        r"""
        expr        = user at domain colon rest

        user        = word+
        domain      = ~"[^:]+"
        rest        = (port path) / path

        path        = word slash word
        port        = digits slash

        slash       = "/"
        colon       = ":"
        at          = "@"
        digits      = ~"\d+"
        word        = ~"\w+"

        """)

    def generic_visit(self, node, visited_children):
        return visited_children or node

    def visit_user(self, node, visited_children):
        return {"user": node.text}

    def visit_domain(self, node, visited_children):
        return {"domain": node.text}

    def visit_rest(self, node, visited_children):
        child = visited_children[0]
        if isinstance(child, list):
            # first branch, port and path
            return {"port": child[0], "path": child[1]}
        else:
            return {"path": child}

    def visit_path(self, node, visited_children):
        return node.text

    def visit_port(self, node, visited_children):
        digits, _ = visited_children
        return digits.text

    def visit_expr(self, node, visited_children):
        out = {}
        _ = [out.update(child) for child in visited_children if isinstance(child, dict)]
        return out

gv = GitVisitor()
for line in data.split("\n"):
    result = gv.parse(line)
    print(result)

Which would yield

{'user': 'git', 'domain': 'github.com', 'path': 'james/example'}
{'user': 'git', 'domain': 'github.com', 'path': '007/example'}
{'user': 'git', 'domain': 'github.com', 'port': '22', 'path': 'james/example'}
{'user': 'git', 'domain': 'github.com', 'port': '22', 'path': '007/example'}

A parser allows for some ambiguity which you obviously have here.