Python: Get URL path sections

How do I get specific path sections from a url? For example, I want a function which operates on this:

http://www.mydomain.com/hithere?image=2934

and returns "hithere"

or operates on this:

http://www.mydomain.com/hithere/something/else

and returns the same thing ("hithere")

I know this will probably use urllib or urllib2 but I can't figure out from the docs how to get only a section of the path.

标签： python url

6条回答

叛逆

2楼-- · 2019-01-08 17:09

A combination of urlparse and os.path.split will do the trick. The following script stores all sections of a url in a list, backwards.

import os.path, urlparse

def generate_sections_of_url(url):
    path = urlparse.urlparse(url).path
    sections = []; temp = "";
    while path != '/':
        temp = os.path.split(path)
        path = temp[0]
        sections.append(temp[1])
    return sections

This would return: ["else", "something", "hithere"]

0人赞添加讨论(0) 举报

手持菜刀，她持情操

3楼-- · 2019-01-08 17:14

Extract the path component of the URL with urlparse:

>>> import urlparse
>>> path = urlparse.urlparse('http://www.example.com/hithere/something/else').path
>>> path
'/hithere/something/else'

Split the path into components with os.path.split:

>>> import os.path
>>> os.path.split(path)
('/hithere/something', 'else')

The dirname and basename functions give you the two pieces of the split; perhaps use dirname in a while loop:

>>> while os.path.dirname(path) != '/':
...     path = os.path.dirname(path)
... 
>>> path
'/hithere'

0人赞添加讨论(0) 举报

Deceive 欺骗

4楼-- · 2019-01-08 17:17

Note in Python3 import has changed to from urllib.parse import urlparse See documentation. Here is an example:

>>> from urllib.parse import urlparse
>>> url = 's3://bucket.test/my/file/directory'
>>> p = urlparse(url)
>>> p
ParseResult(scheme='s3', netloc='bucket.test', path='/my/file/directory', params='', query='', fragment='')
>>> p.scheme
's3'
>>> p.netloc
'bucket.test'
>>> p.path
'/my/file/directory'

0人赞添加讨论(0) 举报

手持菜刀，她持情操

5楼-- · 2019-01-08 17:21

Python 3.4+ solution:

url_path = PurePosixPath(urllib.parse.unquote(urllib.parse.urlparse(url‌).path))

0人赞添加讨论(0) 举报

爱情/是我丢掉的垃圾

6楼-- · 2019-01-08 17:31

import urlparse

output = urlparse.urlparse('http://www.example.com/temp/something/happen/index.html').path

output

'/temp/something/happen/index.html'

Split the path -- inbuilt rpartition func of string 

output.rpartition('/')[0]

'/temp/something/happen'

0人赞添加讨论(0) 举报

叛逆

7楼-- · 2019-01-08 17:35

The best option is to use the posixpath module when working with the path component of URLs. This module has the same interface as os.path and consistently operates on POSIX paths when used on POSIX and Windows NT based platforms.

Sample Code:

#!/usr/bin/env python3

import urllib.parse
import sys
import posixpath
import ntpath
import json

def path_parse( path_string, *, normalize = True, module = posixpath ):
    result = []
    if normalize:
        tmp = module.normpath( path_string )
    else:
        tmp = path_string
    while tmp != "/":
        ( tmp, item ) = module.split( tmp )
        result.insert( 0, item )
    return result

def dump_array( array ):
    string = "[ "
    for index, item in enumerate( array ):
        if index > 0:
            string += ", "
        string += "\"{}\"".format( item )
    string += " ]"
    return string

def test_url( url, *, normalize = True, module = posixpath ):
    url_parsed = urllib.parse.urlparse( url )
    path_parsed = path_parse( urllib.parse.unquote( url_parsed.path ),
        normalize=normalize, module=module )
    sys.stdout.write( "{}\n  --[n={},m={}]-->\n    {}\n".format( 
        url, normalize, module.__name__, dump_array( path_parsed ) ) )

test_url( "http://eg.com/hithere/something/else" )
test_url( "http://eg.com/hithere/something/else/" )
test_url( "http://eg.com/hithere/something/else/", normalize = False )
test_url( "http://eg.com/hithere/../else" )
test_url( "http://eg.com/hithere/../else", normalize = False )
test_url( "http://eg.com/hithere/../../else" )
test_url( "http://eg.com/hithere/../../else", normalize = False )
test_url( "http://eg.com/hithere/something/./else" )
test_url( "http://eg.com/hithere/something/./else", normalize = False )
test_url( "http://eg.com/hithere/something/./else/./" )
test_url( "http://eg.com/hithere/something/./else/./", normalize = False )

test_url( "http://eg.com/see%5C/if%5C/this%5C/works", normalize = False )
test_url( "http://eg.com/see%5C/if%5C/this%5C/works", normalize = False,
    module = ntpath )

Code output:

http://eg.com/hithere/something/else
  --[n=True,m=posixpath]-->
    [ "hithere", "something", "else" ]
http://eg.com/hithere/something/else/
  --[n=True,m=posixpath]-->
    [ "hithere", "something", "else" ]
http://eg.com/hithere/something/else/
  --[n=False,m=posixpath]-->
    [ "hithere", "something", "else", "" ]
http://eg.com/hithere/../else
  --[n=True,m=posixpath]-->
    [ "else" ]
http://eg.com/hithere/../else
  --[n=False,m=posixpath]-->
    [ "hithere", "..", "else" ]
http://eg.com/hithere/../../else
  --[n=True,m=posixpath]-->
    [ "else" ]
http://eg.com/hithere/../../else
  --[n=False,m=posixpath]-->
    [ "hithere", "..", "..", "else" ]
http://eg.com/hithere/something/./else
  --[n=True,m=posixpath]-->
    [ "hithere", "something", "else" ]
http://eg.com/hithere/something/./else
  --[n=False,m=posixpath]-->
    [ "hithere", "something", ".", "else" ]
http://eg.com/hithere/something/./else/./
  --[n=True,m=posixpath]-->
    [ "hithere", "something", "else" ]
http://eg.com/hithere/something/./else/./
  --[n=False,m=posixpath]-->
    [ "hithere", "something", ".", "else", ".", "" ]
http://eg.com/see%5C/if%5C/this%5C/works
  --[n=False,m=posixpath]-->
    [ "see\", "if\", "this\", "works" ]
http://eg.com/see%5C/if%5C/this%5C/works
  --[n=False,m=ntpath]-->
    [ "see", "if", "this", "works" ]

Notes:

On Windows NT based platforms os.path is ntpath
On Unix/Posix based platforms os.path is posixpath
ntpath will not handle backslashes (\) correctly (see last two cases in code/output) - which is why posixpath is recommended.
remember to use urllib.parse.unquote
consider using posixpath.normpath
The semantics of multiple path separators (/) is not defined by RFC 3986. However, posixpath collapses multiple adjacent path separators (i.e. it treats ///, // and / the same)
Even though POSIX and URL paths have similar syntax and semantics, they are not identical.

Normative References:

0人赞添加讨论(0) 举报

Python: Get URL path sections

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间