Extract filename and path from URL in bash script

2019-03-09 02:49发布

问题:

In my bash script I need to extract just the path from the given URL. For example, from the variable containing string:

http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth

I want to extract to some other variable only the:

/one/more/dir/file.exe

part. Of course login, password, filename and parameters are optional.

Since I am new to sed and awk I ask you for help. Please, advice me how to do it. Thank you!

回答1:

In bash:

URL='http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth'
URL_NOPRO=${URL:7}
URL_REL=${URL_NOPRO#*/}
echo "/${URL_REL%%\?*}"

Works only if URL starts with http:// or a protocol with the same length Otherwise, it's probably easier to use regex with sed, grep or cut ...



回答2:

There are built-in functions in bash to handle this, e.g., the string pattern-matching operators:

  1. '#' remove minimal matching prefixes
  2. '##' remove maximal matching prefixes
  3. '%' remove minimal matching suffixes
  4. '%%' remove maximal matching suffixes

For example:

FILE=/home/user/src/prog.c
echo ${FILE#/*/}  # ==> user/src/prog.c
echo ${FILE##/*/} # ==> prog.c
echo ${FILE%/*}   # ==> /home/user/src
echo ${FILE%%/*}  # ==> nil
echo ${FILE%.c}   # ==> /home/user/src/prog

All this from the excellent book: "A Practical Guide to Linux Commands, Editors, and Shell Programming by Mark G. Sobell (http://www.sobell.com/)



回答3:

This uses bash and cut as another way of doing this. It's ugly, but it works (at least for the example). Sometimes I like to use what I call cut sieves to whittle down the information that I am actually looking for.

Note: Performance wise, this may be a problem.

Given those caveats:

First let's echo the the line:

echo 'http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth'

Which gives us:

http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth

Then let's cut the line at the @ as a convenient way to strip out the http://login:password:

echo 'http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth' | \
cut -d@ -f2

That give us this:

example.com/one/more/dir/file.exe?a=sth&b=sth

To get rid of the hostname, let's do another cut and use the / as the delimiter while asking cut to give us the second field and everything after (essentially, to the end of the line). It looks like this:

echo 'http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth' | \
cut -d@ -f2 | \
cut -d/ -f2-

Which, in turn, results in:

one/more/dir/file.exe?a=sth&b=sth

And finally, we want to strip off all the parameters from the end. Again, we'll use cut and this time the ? as the delimiter and tell it to give us just the first field. That brings us to the end and looks like this:

echo 'http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth' | \
cut -d@ -f2 | \
cut -d/ -f2- | \
cut -d? -f1

And the output is:

one/more/dir/file.exe

Just another way to do it and this approach is one way to whittle away that data you don't need in an interactive way to come up with something you do need.

If I wanted to stuff this into a variable in a script, I'd do something like this:

#!/bin/bash

url="http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth"
file_path=$(echo ${url} | cut -d@ -f2 | cut -d/ -f2- | cut -d? -f1)
echo ${file_path}

Hope it helps.



回答4:

gawk

echo "http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth" | awk -F"/" '
{
 $1=$2=$3=""
 gsub(/\?.*/,"",$NF)
 print substr($0,3)
}' OFS="/"

output

# ./test.sh
/one/more/dir/file.exe


回答5:

If you have a gawk:

$ echo 'http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth' | \
  gawk '$0=gensub(/http:\/\/[^/]+(\/[^?]+)\?.*/,"\\1",1)'

or

$ echo 'http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth' | \
  gawk -F'(http://[^/]+|?)' '$0=$2'

Gnu awk can use regular expression as field separators(FS).



回答6:

The Perl snippet is intriguing, and since Perl is present in most Linux distros, quite useful, but...It doesn't do the job completely. Specifically, there is a problem in translating the URL/URI format from UTF-8 into path Unicode. Let me give an example of the problem. The original URI may be:

file:///home/username/Music/Jean-Michel%20Jarre/M%C3%A9tamorphoses/01%20-%20Je%20me%20souviens.mp3

The corresponding path would be:

/home/username/Music/Jean-Michel Jarre/Métamorphoses/01 - Je me souviens.mp3

%20 became space, %C3%A9 became 'é'. Is there a Linux command, bash feature, or Perl script that can handle this transformation, or do I have to write a humongous series of sed substring substitutions? What about the reverse transformation, from path to URL/URI?

(Follow-up)

Looking at http://search.cpan.org/~gaas/URI-1.54/URI.pm, I first saw the as_iri method, but that was apparently missing from my Linux (or is not applicable, somehow). Turns out the solution is to replace the "->path" part with "->file". You can then break that further down using basename and dirname, etc. The solution is thus:

path=$( echo "$url" | perl -MURI -le 'chomp($url = <>); print URI->new($url)->file' )

Oddly, using "->dir" instead of "->file" does NOT extract the directory part: rather, it formats the URI so it can be used as an argument to mkdir and the like.

(Further follow-up)

Any reason why the line cannot be shortened to this?

path=$( echo "$url" | perl -MURI -le 'print URI->new(<>)->file' )


回答7:

Best bet is to find a language that has a URL parsing library:

url="http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth"
path=$( echo "$url" | ruby -ruri -e 'puts URI.parse(gets.chomp).path' )

or

path=$( echo "$url" | perl -MURI -le 'chomp($url = <>); print URI->new($url)->path' )


回答8:

How does this :?

echo 'http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth' | \
sed 's|.*://[^/]*/\([^?]*\)?.*|/\1|g'
  • .://[^/]/: http://login:password@example.com/
  • ([^?]*) : one/more/dir/file.exe
  • ?.* : ?a=sth&b=sth
  • /\1 : /one/more/dir/file.exe


回答9:

I agree that "cut" is a wonderful tool on the command line. However, a more purely bash solution is to use a powerful feature of variable expansion in bash. For example:

pass_first_last='password,firstname,lastname'

pass=${pass_first_last%%,*}

first_last=${pass_first_last#*,}

first=${first_last%,*}

last=${first_last#*,}

or, alternatively,

last=${pass_first_last##*,}


回答10:

I wrote a function to that will extract any part or the URL. I've only tested it in bash. Usage:

url_parse <url> [url-part]

example:

$ url_parse "http://example.com:8080/home/index.html" path
home/index.html

code:

url_parse() {
  local -r url=$1 url_part=$2
  #define url tokens and url regular expression
  local -r protocol='^[^:]+' user='[^:@]+' password='[^@]+' host='[^:/?#]+' \
    port='[0-9]+' path='\/([^?#]*)' query='\?([^#]+)' fragment='#(.*)'
  local -r auth="($user)(:($password))?@"
  local -r connection="($auth)?($host)(:($port))?"
  local -r url_regex="($protocol):\/\/($connection)?($path)?($query)?($fragment)?$"
  #parse url and create an array
  IFS=',' read -r -a url_arr <<< $(echo $url | awk -v OFS=, \
    "{match(\$0,/$url_regex/,a);print a[1],a[4],a[6],a[7],a[9],a[11],a[13],a[15]}")

  [[ ${url_arr[0]} ]] || { echo "Invalid URL: $url" >&2 ; return 1 ; }

  case $url_part in
    protocol) echo ${url_arr[0]} ;;
    auth)     echo ${url_arr[1]}:${url_arr[2]} ;; # ex: john.doe:1234
    user)     echo ${url_arr[1]} ;;
    password) echo ${url_arr[2]} ;;
    host-port)echo ${url_arr[3]}:${url_arr[4]} ;; #ex: example.com:8080
    host)     echo ${url_arr[3]} ;;
    port)     echo ${url_arr[4]} ;;
    path)     echo ${url_arr[5]} ;;
    query)    echo ${url_arr[6]} ;;
    fragment) echo ${url_arr[7]} ;;
    info)     echo -e "protocol:${url_arr[0]}\nuser:${url_arr[1]}\npassword:${url_arr[2]}\nhost:${url_arr[3]}\nport:${url_arr[4]}\npath:${url_arr[5]}\nquery:${url_arr[6]}\nfragment:${url_arr[7]}";;
    "")       ;; # used to validate url
    *)        echo "Invalid URL part: $url_part" >&2 ; return 1 ;;
  esac
}


回答11:

Using only bash builtins:

path="/${url#*://*/}" && [[ "/${url}" == "${path}" ]] && path="/"

What this does is:

  1. remove the prefix *://*/ (so this would be your protocol and hostname+port)
  2. check if we actually succeeded in removing anything - if not, then this implies there was no third slash (assuming this is a well-formed URL)
  3. if there was no third slash, then the path is just /

note: the quotation marks aren't actually needed here, but I find it easier to read with them in



回答12:

url="http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth"

GNU grep

$ grep -Po '\w\K/\w+[^?]+' <<<$url
/one/more/dir/file.exe

BSD grep

$ grep -o '\w/\w\+[^?]\+' <<<$url | tail -c+2
/one/more/dir/file.exe

ripgrep

$ rg -o '\w(/\w+[^?]+)' -r '$1' <<<$url
/one/more/dir/file.exe

To get other parts of URL, check: Getting parts of a URL (Regex).



回答13:

This perl one-liner works for me on the command line, so could be added to your script.

echo 'http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth' | perl -n -e 'm{http://[^/]+(/[^?]+)};print $1'

Note that this assumes there will always be a '?' character at the end of the string you want to extract.



标签: bash url parsing