Extract filename and path from URL in bash script

In my bash script I need to extract just the path from the given URL. For example, from the variable containing string:

http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth

I want to extract to some other variable only the:

/one/more/dir/file.exe

part. Of course login, password, filename and parameters are optional.

Since I am new to sed and awk I ask you for help. Please, advice me how to do it. Thank you!

标签： bash url parsing

13条回答

姐就是有狂的资本

2楼-- · 2019-03-09 02:41

This perl one-liner works for me on the command line, so could be added to your script.

echo 'http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth' | perl -n -e 'm{http://[^/]+(/[^?]+)};print $1'

Note that this assumes there will always be a '?' character at the end of the string you want to extract.

0人赞添加讨论(0) 举报

姐就是有狂的资本

3楼-- · 2019-03-09 02:54

Best bet is to find a language that has a URL parsing library:

url="http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth"
path=$( echo "$url" | ruby -ruri -e 'puts URI.parse(gets.chomp).path' )

path=$( echo "$url" | perl -MURI -le 'chomp($url = <>); print URI->new($url)->path' )

0人赞添加讨论(0) 举报

Summer. ? 凉城

4楼-- · 2019-03-09 02:55

In bash:

URL='http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth'
URL_NOPRO=${URL:7}
URL_REL=${URL_NOPRO#*/}
echo "/${URL_REL%%\?*}"

Works only if URL starts with http:// or a protocol with the same length Otherwise, it's probably easier to use regex with sed, grep or cut ...

0人赞添加讨论(0) 举报

太酷不给撩

5楼-- · 2019-03-09 02:56

I wrote a function to that will extract any part or the URL. I've only tested it in bash. Usage:

url_parse <url> [url-part]

example:

$ url_parse "http://example.com:8080/home/index.html" path
home/index.html

code:

url_parse() {
  local -r url=$1 url_part=$2
  #define url tokens and url regular expression
  local -r protocol='^[^:]+' user='[^:@]+' password='[^@]+' host='[^:/?#]+' \
    port='[0-9]+' path='\/([^?#]*)' query='\?([^#]+)' fragment='#(.*)'
  local -r auth="($user)(:($password))?@"
  local -r connection="($auth)?($host)(:($port))?"
  local -r url_regex="($protocol):\/\/($connection)?($path)?($query)?($fragment)?$"
  #parse url and create an array
  IFS=',' read -r -a url_arr <<< $(echo $url | awk -v OFS=, \
    "{match(\$0,/$url_regex/,a);print a[1],a[4],a[6],a[7],a[9],a[11],a[13],a[15]}")

  [[ ${url_arr[0]} ]] || { echo "Invalid URL: $url" >&2 ; return 1 ; }

  case $url_part in
    protocol) echo ${url_arr[0]} ;;
    auth)     echo ${url_arr[1]}:${url_arr[2]} ;; # ex: john.doe:1234
    user)     echo ${url_arr[1]} ;;
    password) echo ${url_arr[2]} ;;
    host-port)echo ${url_arr[3]}:${url_arr[4]} ;; #ex: example.com:8080
    host)     echo ${url_arr[3]} ;;
    port)     echo ${url_arr[4]} ;;
    path)     echo ${url_arr[5]} ;;
    query)    echo ${url_arr[6]} ;;
    fragment) echo ${url_arr[7]} ;;
    info)     echo -e "protocol:${url_arr[0]}\nuser:${url_arr[1]}\npassword:${url_arr[2]}\nhost:${url_arr[3]}\nport:${url_arr[4]}\npath:${url_arr[5]}\nquery:${url_arr[6]}\nfragment:${url_arr[7]}";;
    "")       ;; # used to validate url
    *)        echo "Invalid URL part: $url_part" >&2 ; return 1 ;;
  esac
}

0人赞添加讨论(0) 举报

一夜七次

6楼-- · 2019-03-09 02:58

gawk

echo "http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth" | awk -F"/" '
{
 $1=$2=$3=""
 gsub(/\?.*/,"",$NF)
 print substr($0,3)
}' OFS="/"

output

# ./test.sh
/one/more/dir/file.exe

0人赞添加讨论(0) 举报

小情绪 Triste *

7楼-- · 2019-03-09 02:59

This uses bash and cut as another way of doing this. It's ugly, but it works (at least for the example). Sometimes I like to use what I call cut sieves to whittle down the information that I am actually looking for.

Note: Performance wise, this may be a problem.

Given those caveats:

First let's echo the the line:

echo 'http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth'

Which gives us:

http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth

Then let's cut the line at the @ as a convenient way to strip out the http://login:password:

echo 'http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth' | \
cut -d@ -f2

That give us this:

example.com/one/more/dir/file.exe?a=sth&b=sth

To get rid of the hostname, let's do another cut and use the / as the delimiter while asking cut to give us the second field and everything after (essentially, to the end of the line). It looks like this:

echo 'http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth' | \
cut -d@ -f2 | \
cut -d/ -f2-

Which, in turn, results in:

one/more/dir/file.exe?a=sth&b=sth

And finally, we want to strip off all the parameters from the end. Again, we'll use cut and this time the ? as the delimiter and tell it to give us just the first field. That brings us to the end and looks like this:

echo 'http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth' | \
cut -d@ -f2 | \
cut -d/ -f2- | \
cut -d? -f1

And the output is:

one/more/dir/file.exe

Just another way to do it and this approach is one way to whittle away that data you don't need in an interactive way to come up with something you do need.

If I wanted to stuff this into a variable in a script, I'd do something like this:

#!/bin/bash

url="http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth"
file_path=$(echo ${url} | cut -d@ -f2 | cut -d/ -f2- | cut -d? -f1)
echo ${file_path}

Hope it helps.

0人赞添加讨论(0) 举报

1 2 3 下一页

Extract filename and path from URL in bash script

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间