Python: Get URL path sections

2019-01-08 16:46发布

How do I get specific path sections from a url? For example, I want a function which operates on this:

http://www.mydomain.com/hithere?image=2934

and returns "hithere"

or operates on this:

http://www.mydomain.com/hithere/something/else

and returns the same thing ("hithere")

I know this will probably use urllib or urllib2 but I can't figure out from the docs how to get only a section of the path.

标签: python url
6条回答
叛逆
2楼-- · 2019-01-08 17:09

A combination of urlparse and os.path.split will do the trick. The following script stores all sections of a url in a list, backwards.

import os.path, urlparse

def generate_sections_of_url(url):
    path = urlparse.urlparse(url).path
    sections = []; temp = "";
    while path != '/':
        temp = os.path.split(path)
        path = temp[0]
        sections.append(temp[1])
    return sections

This would return: ["else", "something", "hithere"]

查看更多
手持菜刀,她持情操
3楼-- · 2019-01-08 17:14

Extract the path component of the URL with urlparse:

>>> import urlparse
>>> path = urlparse.urlparse('http://www.example.com/hithere/something/else').path
>>> path
'/hithere/something/else'

Split the path into components with os.path.split:

>>> import os.path
>>> os.path.split(path)
('/hithere/something', 'else')

The dirname and basename functions give you the two pieces of the split; perhaps use dirname in a while loop:

>>> while os.path.dirname(path) != '/':
...     path = os.path.dirname(path)
... 
>>> path
'/hithere'
查看更多
Deceive 欺骗
4楼-- · 2019-01-08 17:17

Note in Python3 import has changed to from urllib.parse import urlparse See documentation. Here is an example:

>>> from urllib.parse import urlparse
>>> url = 's3://bucket.test/my/file/directory'
>>> p = urlparse(url)
>>> p
ParseResult(scheme='s3', netloc='bucket.test', path='/my/file/directory', params='', query='', fragment='')
>>> p.scheme
's3'
>>> p.netloc
'bucket.test'
>>> p.path
'/my/file/directory'
查看更多
手持菜刀,她持情操
5楼-- · 2019-01-08 17:21

Python 3.4+ solution:

url_path = PurePosixPath(urllib.parse.unquote(urllib.parse.urlparse(url‌​).path))
查看更多
爱情/是我丢掉的垃圾
6楼-- · 2019-01-08 17:31
import urlparse

output = urlparse.urlparse('http://www.example.com/temp/something/happen/index.html').path

output

'/temp/something/happen/index.html'

Split the path -- inbuilt rpartition func of string 

output.rpartition('/')[0]

'/temp/something/happen'
查看更多
叛逆
7楼-- · 2019-01-08 17:35

The best option is to use the posixpath module when working with the path component of URLs. This module has the same interface as os.path and consistently operates on POSIX paths when used on POSIX and Windows NT based platforms.


Sample Code:

#!/usr/bin/env python3

import urllib.parse
import sys
import posixpath
import ntpath
import json

def path_parse( path_string, *, normalize = True, module = posixpath ):
    result = []
    if normalize:
        tmp = module.normpath( path_string )
    else:
        tmp = path_string
    while tmp != "/":
        ( tmp, item ) = module.split( tmp )
        result.insert( 0, item )
    return result

def dump_array( array ):
    string = "[ "
    for index, item in enumerate( array ):
        if index > 0:
            string += ", "
        string += "\"{}\"".format( item )
    string += " ]"
    return string

def test_url( url, *, normalize = True, module = posixpath ):
    url_parsed = urllib.parse.urlparse( url )
    path_parsed = path_parse( urllib.parse.unquote( url_parsed.path ),
        normalize=normalize, module=module )
    sys.stdout.write( "{}\n  --[n={},m={}]-->\n    {}\n".format( 
        url, normalize, module.__name__, dump_array( path_parsed ) ) )

test_url( "http://eg.com/hithere/something/else" )
test_url( "http://eg.com/hithere/something/else/" )
test_url( "http://eg.com/hithere/something/else/", normalize = False )
test_url( "http://eg.com/hithere/../else" )
test_url( "http://eg.com/hithere/../else", normalize = False )
test_url( "http://eg.com/hithere/../../else" )
test_url( "http://eg.com/hithere/../../else", normalize = False )
test_url( "http://eg.com/hithere/something/./else" )
test_url( "http://eg.com/hithere/something/./else", normalize = False )
test_url( "http://eg.com/hithere/something/./else/./" )
test_url( "http://eg.com/hithere/something/./else/./", normalize = False )

test_url( "http://eg.com/see%5C/if%5C/this%5C/works", normalize = False )
test_url( "http://eg.com/see%5C/if%5C/this%5C/works", normalize = False,
    module = ntpath )

Code output:

http://eg.com/hithere/something/else
  --[n=True,m=posixpath]-->
    [ "hithere", "something", "else" ]
http://eg.com/hithere/something/else/
  --[n=True,m=posixpath]-->
    [ "hithere", "something", "else" ]
http://eg.com/hithere/something/else/
  --[n=False,m=posixpath]-->
    [ "hithere", "something", "else", "" ]
http://eg.com/hithere/../else
  --[n=True,m=posixpath]-->
    [ "else" ]
http://eg.com/hithere/../else
  --[n=False,m=posixpath]-->
    [ "hithere", "..", "else" ]
http://eg.com/hithere/../../else
  --[n=True,m=posixpath]-->
    [ "else" ]
http://eg.com/hithere/../../else
  --[n=False,m=posixpath]-->
    [ "hithere", "..", "..", "else" ]
http://eg.com/hithere/something/./else
  --[n=True,m=posixpath]-->
    [ "hithere", "something", "else" ]
http://eg.com/hithere/something/./else
  --[n=False,m=posixpath]-->
    [ "hithere", "something", ".", "else" ]
http://eg.com/hithere/something/./else/./
  --[n=True,m=posixpath]-->
    [ "hithere", "something", "else" ]
http://eg.com/hithere/something/./else/./
  --[n=False,m=posixpath]-->
    [ "hithere", "something", ".", "else", ".", "" ]
http://eg.com/see%5C/if%5C/this%5C/works
  --[n=False,m=posixpath]-->
    [ "see\", "if\", "this\", "works" ]
http://eg.com/see%5C/if%5C/this%5C/works
  --[n=False,m=ntpath]-->
    [ "see", "if", "this", "works" ]

Notes:

  • On Windows NT based platforms os.path is ntpath
  • On Unix/Posix based platforms os.path is posixpath
  • ntpath will not handle backslashes (\) correctly (see last two cases in code/output) - which is why posixpath is recommended.
  • remember to use urllib.parse.unquote
  • consider using posixpath.normpath
  • The semantics of multiple path separators (/) is not defined by RFC 3986. However, posixpath collapses multiple adjacent path separators (i.e. it treats ///, // and / the same)
  • Even though POSIX and URL paths have similar syntax and semantics, they are not identical.

Normative References:

查看更多
登录 后发表回答