Parse URL in shell script

2020-01-29 06:02发布

I have url like:

sftp://user@host.net/some/random/path

I want to extract user, host and path from this string. Any part can be random length.

11条回答
Lonely孤独者°
2楼-- · 2020-01-29 06:09

I did further parsing, expanding the solution given by @Shirkrin:

#!/bin/bash

parse_url() {
    local query1 query2 path1 path2

    # extract the protocol
    proto="$(echo $1 | grep :// | sed -e's,^\(.*://\).*,\1,g')"

    if [[ ! -z $proto ]] ; then
            # remove the protocol
            url="$(echo ${1/$proto/})"

            # extract the user (if any)
            login="$(echo $url | grep @ | cut -d@ -f1)"

            # extract the host
            host="$(echo ${url/$login@/} | cut -d/ -f1)"

            # by request - try to extract the port
            port="$(echo $host | sed -e 's,^.*:,:,g' -e 's,.*:\([0-9]*\).*,\1,g' -e 's,[^0-9],,g')"

            # extract the uri (if any)
            resource="/$(echo $url | grep / | cut -d/ -f2-)"
    else
            url=""
            login=""
            host=""
            port=""
            resource=$1
    fi

    # extract the path (if any)
    path1="$(echo $resource | grep ? | cut -d? -f1 )"
    path2="$(echo $resource | grep \# | cut -d# -f1 )"
    path=$path1
    if [[ -z $path ]] ; then path=$path2 ; fi
    if [[ -z $path ]] ; then path=$resource ; fi

    # extract the query (if any)
    query1="$(echo $resource | grep ? | cut -d? -f2-)"
    query2="$(echo $query1 | grep \# | cut -d\# -f1 )"
    query=$query2
    if [[ -z $query ]] ; then query=$query1 ; fi

    # extract the fragment (if any)
    fragment="$(echo $resource | grep \# | cut -d\# -f2 )"

    echo "url: $url"
    echo "   proto: $proto"
    echo "   login: $login"
    echo "    host: $host"
    echo "    port: $port"
    echo "resource: $resource"
    echo "    path: $path"
    echo "   query: $query"
    echo "fragment: $fragment"
    echo ""
}

parse_url "http://login:password@example.com:8080/one/more/dir/file.exe?a=sth&b=sth#anchor_fragment"
parse_url "https://example.com/one/more/dir/file.exe#anchor_fragment"
parse_url "http://login:password@example.com:8080/one/more/dir/file.exe#anchor_fragment"
parse_url "ftp://user@example.com:8080/one/more/dir/file.exe?a=sth&b=sth"
parse_url "/one/more/dir/file.exe"
parse_url "file.exe"
parse_url "file.exe#anchor"
查看更多
Root(大扎)
3楼-- · 2020-01-29 06:12

[EDIT 2019] This answer is not meant to be a catch-all, works for everything solution it was intended to provide a simple alternative to the python based version and it ended up having more features than the original.


It answered the basic question in a bash-only way and then was modified multiple times by myself to include a hand full of demands by commenters. I think at this point however adding even more complexity would make it unmaintainable. I know not all things are straight forward (checking for a valid port for example requires comparing hostport and host) but I would rather not add even more complexity.


[Original answer]

Assuming your URL is passed as first parameter to the script:

#!/bin/bash

# extract the protocol
proto="$(echo $1 | grep :// | sed -e's,^\(.*://\).*,\1,g')"
# remove the protocol
url="$(echo ${1/$proto/})"
# extract the user (if any)
user="$(echo $url | grep @ | cut -d@ -f1)"
# extract the host and port
hostport="$(echo ${url/$user@/} | cut -d/ -f1)"
# by request host without port    
host="$(echo $hostport | sed -e 's,:.*,,g')"
# by request - try to extract the port
port="$(echo $hostport | sed -e 's,^.*:,:,g' -e 's,.*:\([0-9]*\).*,\1,g' -e 's,[^0-9],,g')"
# extract the path (if any)
path="$(echo $url | grep / | cut -d/ -f2-)"

echo "url: $url"
echo "  proto: $proto"
echo "  user: $user"
echo "  host: $host"
echo "  port: $port"
echo "  path: $path"

I must admit this is not the cleanest solution but it doesn't rely on another scripting language like perl or python. (Providing a solution using one of them would produce cleaner results ;) )

Using your example the results are:

url: user@host.net/some/random/path
  proto: sftp://
  user: user
  host: host.net
  port:
  path: some/random/path

This will also work for URLs without a protocol/username or path. In this case the respective variable will contain an empty string.

[EDIT]
If your bash version won't cope with the substitutions (${1/$proto/}) try this:

#!/bin/bash

# extract the protocol
proto="$(echo $1 | grep :// | sed -e's,^\(.*://\).*,\1,g')"

# remove the protocol -- updated
url=$(echo $1 | sed -e s,$proto,,g)

# extract the user (if any)
user="$(echo $url | grep @ | cut -d@ -f1)"

# extract the host and port -- updated
hostport=$(echo $url | sed -e s,$user@,,g | cut -d/ -f1)

# by request host without port
host="$(echo $hostport | sed -e 's,:.*,,g')"
# by request - try to extract the port
port="$(echo $hostport | sed -e 's,^.*:,:,g' -e 's,.*:\([0-9]*\).*,\1,g' -e 's,[^0-9],,g')"

# extract the path (if any)
path="$(echo $url | grep / | cut -d/ -f2-)"
查看更多
乱世女痞
4楼-- · 2020-01-29 06:15

This solution in principle works the same as Adam Ryczkowski's, in this thread - but has improved regular expression based on RFC3986, (with some changes) and fixes some errors (e.g. userinfo can contain '_' character). This can also understand relative URIs (e.g. to extract query or fragment).

# !/bin/bash

# Following regex is based on https://tools.ietf.org/html/rfc3986#appendix-B with
# additional sub-expressions to split authority into userinfo, host and port
#
readonly URI_REGEX='^(([^:/?#]+):)?(//((([^:/?#]+)@)?([^:/?#]+)(:([0-9]+))?))?(/([^?#]*))(\?([^#]*))?(#(.*))?'
#                    ↑↑            ↑  ↑↑↑            ↑         ↑ ↑            ↑ ↑        ↑  ↑        ↑ ↑
#                    |2 scheme     |  ||6 userinfo   7 host    | 9 port       | 11 rpath |  13 query | 15 fragment
#                    1 scheme:     |  |5 userinfo@             8 :…           10 path    12 ?…       14 #…
#                                  |  4 authority
#                                  3 //…

parse_scheme () {
    [[ "$@" =~ $URI_REGEX ]] && echo "${BASH_REMATCH[2]}"
}

parse_authority () {
    [[ "$@" =~ $URI_REGEX ]] && echo "${BASH_REMATCH[4]}"
}

parse_user () {
    [[ "$@" =~ $URI_REGEX ]] && echo "${BASH_REMATCH[6]}"
}

parse_host () {
    [[ "$@" =~ $URI_REGEX ]] && echo "${BASH_REMATCH[7]}"
}

parse_port () {
    [[ "$@" =~ $URI_REGEX ]] && echo "${BASH_REMATCH[9]}"
}

parse_path () {
    [[ "$@" =~ $URI_REGEX ]] && echo "${BASH_REMATCH[10]}"
}

parse_rpath () {
    [[ "$@" =~ $URI_REGEX ]] && echo "${BASH_REMATCH[11]}"
}

parse_query () {
    [[ "$@" =~ $URI_REGEX ]] && echo "${BASH_REMATCH[13]}"
}

parse_fragment () {
    [[ "$@" =~ $URI_REGEX ]] && echo "${BASH_REMATCH[15]}"
}
查看更多
Root(大扎)
5楼-- · 2020-01-29 06:17

I did not like above methods and wrote my own. It is for ftp link, just replace ftp with http if your need it. First line is a small validation of link, link should look like ftp://user:pass@host.com/path/to/something.

if ! echo "$url" | grep -q '^[[:blank:]]*ftp://[[:alnum:]]\+:[[:alnum:]]\+@[[:alnum:]\.]\+/.*[[:blank:]]*$'; then return 1; fi

login=$(  echo "$url" | sed 's|[[:blank:]]*ftp://\([^:]\+\):\([^@]\+\)@\([^/]\+\)\(/.*\)[[:blank:]]*|\1|' )
pass=$(   echo "$url" | sed 's|[[:blank:]]*ftp://\([^:]\+\):\([^@]\+\)@\([^/]\+\)\(/.*\)[[:blank:]]*|\2|' )
host=$(   echo "$url" | sed 's|[[:blank:]]*ftp://\([^:]\+\):\([^@]\+\)@\([^/]\+\)\(/.*\)[[:blank:]]*|\3|' )
dir=$(    echo "$url" | sed 's|[[:blank:]]*ftp://\([^:]\+\):\([^@]\+\)@\([^/]\+\)\(/.*\)[[:blank:]]*|\4|' )

My actual goal was to check ftp access by url. Here is the full result:

#!/bin/bash

test_ftp_url()  # lftp may hang on some ftp problems, like no connection
    {
    local url="$1"

    if ! echo "$url" | grep -q '^[[:blank:]]*ftp://[[:alnum:]]\+:[[:alnum:]]\+@[[:alnum:]\.]\+/.*[[:blank:]]*$'; then return 1; fi

    local login=$(  echo "$url" | sed 's|[[:blank:]]*ftp://\([^:]\+\):\([^@]\+\)@\([^/]\+\)\(/.*\)[[:blank:]]*|\1|' )
    local pass=$(   echo "$url" | sed 's|[[:blank:]]*ftp://\([^:]\+\):\([^@]\+\)@\([^/]\+\)\(/.*\)[[:blank:]]*|\2|' )
    local host=$(   echo "$url" | sed 's|[[:blank:]]*ftp://\([^:]\+\):\([^@]\+\)@\([^/]\+\)\(/.*\)[[:blank:]]*|\3|' )
    local dir=$(    echo "$url" | sed 's|[[:blank:]]*ftp://\([^:]\+\):\([^@]\+\)@\([^/]\+\)\(/.*\)[[:blank:]]*|\4|' )

    exec 3>&2 2>/dev/null
    exec 6<>"/dev/tcp/$host/21" || { exec 2>&3 3>&-; echo 'Bash network support is disabled. Skipping ftp check.'; return 0; }

    read <&6
    if ! echo "${REPLY//$'\r'}" | grep -q '^220'; then exec 2>&3  3>&- 6>&-; return 3; fi   # 220 vsFTPd 3.0.2+ (ext.1) ready...

    echo -e "USER $login\r" >&6; read <&6
    if ! echo "${REPLY//$'\r'}" | grep -q '^331'; then exec 2>&3  3>&- 6>&-; return 4; fi   # 331 Please specify the password.

    echo -e "PASS $pass\r" >&6; read <&6
    if ! echo "${REPLY//$'\r'}" | grep -q '^230'; then exec 2>&3  3>&- 6>&-; return 5; fi   # 230 Login successful.

    echo -e "CWD $dir\r" >&6; read <&6
    if ! echo "${REPLY//$'\r'}" | grep -q '^250'; then exec 2>&3  3>&- 6>&-; return 6; fi   # 250 Directory successfully changed.

    echo -e "QUIT\r" >&6

    exec 2>&3  3>&- 6>&-
    return 0
    }

test_ftp_url 'ftp://fz223free:fz223free@ftp.zakupki.gov.ru/out/nsi/nsiProtocol/daily'
echo "$?"
查看更多
乱世女痞
6楼-- · 2020-01-29 06:19

Using Python (best tool for this job, IMHO):

#!/usr/bin/env python

import os
from urlparse import urlparse

uri = os.environ['NAUTILUS_SCRIPT_CURRENT_URI']
result = urlparse(uri)
user, host = result.netloc.split('@')
path = result.path
print('user=', user)
print('host=', host)
print('path=', path)

Further reading:

查看更多
我欲成王,谁敢阻挡
7楼-- · 2020-01-29 06:19

If you have access to Node.js:

export MY_URI=sftp://user@host.net/some/random/path
node -e "console.log(url.parse(process.env.MY_URI).user)"
node -e "console.log(url.parse(process.env.MY_URI).host)"
node -e "console.log(url.parse(process.env.MY_URI).path)"

This will output:

user
host.net
/some/random/path
查看更多
登录 后发表回答