How do I get specific path sections from a url? For example, I want a function which operates on this:
http://www.mydomain.com/hithere?image=2934
and returns "hithere"
or operates on this:
http://www.mydomain.com/hithere/something/else
and returns the same thing ("hithere")
I know this will probably use urllib or urllib2 but I can't figure out from the docs how to get only a section of the path.
Extract the path component of the URL with urlparse:
>>> import urlparse
>>> path = urlparse.urlparse('http://www.example.com/hithere/something/else').path
>>> path
'/hithere/something/else'
Split the path into components with os.path.split:
>>> import os.path
>>> os.path.split(path)
('/hithere/something', 'else')
The dirname and basename functions give you the two pieces of the split; perhaps use dirname in a while loop:
>>> while os.path.dirname(path) != '/':
... path = os.path.dirname(path)
...
>>> path
'/hithere'
The best option is to use the posixpath
module when working with the path component of URLs. This module has the same interface as os.path
and consistently operates on POSIX paths when used on POSIX and Windows NT based platforms.
Sample Code:
#!/usr/bin/env python3
import urllib.parse
import sys
import posixpath
import ntpath
import json
def path_parse( path_string, *, normalize = True, module = posixpath ):
result = []
if normalize:
tmp = module.normpath( path_string )
else:
tmp = path_string
while tmp != "/":
( tmp, item ) = module.split( tmp )
result.insert( 0, item )
return result
def dump_array( array ):
string = "[ "
for index, item in enumerate( array ):
if index > 0:
string += ", "
string += "\"{}\"".format( item )
string += " ]"
return string
def test_url( url, *, normalize = True, module = posixpath ):
url_parsed = urllib.parse.urlparse( url )
path_parsed = path_parse( urllib.parse.unquote( url_parsed.path ),
normalize=normalize, module=module )
sys.stdout.write( "{}\n --[n={},m={}]-->\n {}\n".format(
url, normalize, module.__name__, dump_array( path_parsed ) ) )
test_url( "http://eg.com/hithere/something/else" )
test_url( "http://eg.com/hithere/something/else/" )
test_url( "http://eg.com/hithere/something/else/", normalize = False )
test_url( "http://eg.com/hithere/../else" )
test_url( "http://eg.com/hithere/../else", normalize = False )
test_url( "http://eg.com/hithere/../../else" )
test_url( "http://eg.com/hithere/../../else", normalize = False )
test_url( "http://eg.com/hithere/something/./else" )
test_url( "http://eg.com/hithere/something/./else", normalize = False )
test_url( "http://eg.com/hithere/something/./else/./" )
test_url( "http://eg.com/hithere/something/./else/./", normalize = False )
test_url( "http://eg.com/see%5C/if%5C/this%5C/works", normalize = False )
test_url( "http://eg.com/see%5C/if%5C/this%5C/works", normalize = False,
module = ntpath )
Code output:
http://eg.com/hithere/something/else
--[n=True,m=posixpath]-->
[ "hithere", "something", "else" ]
http://eg.com/hithere/something/else/
--[n=True,m=posixpath]-->
[ "hithere", "something", "else" ]
http://eg.com/hithere/something/else/
--[n=False,m=posixpath]-->
[ "hithere", "something", "else", "" ]
http://eg.com/hithere/../else
--[n=True,m=posixpath]-->
[ "else" ]
http://eg.com/hithere/../else
--[n=False,m=posixpath]-->
[ "hithere", "..", "else" ]
http://eg.com/hithere/../../else
--[n=True,m=posixpath]-->
[ "else" ]
http://eg.com/hithere/../../else
--[n=False,m=posixpath]-->
[ "hithere", "..", "..", "else" ]
http://eg.com/hithere/something/./else
--[n=True,m=posixpath]-->
[ "hithere", "something", "else" ]
http://eg.com/hithere/something/./else
--[n=False,m=posixpath]-->
[ "hithere", "something", ".", "else" ]
http://eg.com/hithere/something/./else/./
--[n=True,m=posixpath]-->
[ "hithere", "something", "else" ]
http://eg.com/hithere/something/./else/./
--[n=False,m=posixpath]-->
[ "hithere", "something", ".", "else", ".", "" ]
http://eg.com/see%5C/if%5C/this%5C/works
--[n=False,m=posixpath]-->
[ "see\", "if\", "this\", "works" ]
http://eg.com/see%5C/if%5C/this%5C/works
--[n=False,m=ntpath]-->
[ "see", "if", "this", "works" ]
Notes:
- On Windows NT based platforms
os.path
is ntpath
- On Unix/Posix based platforms
os.path
is posixpath
ntpath
will not handle backslashes (\
) correctly (see last two cases in code/output) - which is why posixpath
is recommended.
- remember to use
urllib.parse.unquote
- consider using
posixpath.normpath
- The semantics of multiple path separators (
/
) is not defined by RFC 3986. However, posixpath
collapses multiple adjacent path separators (i.e. it treats ///
, //
and /
the same)
- Even though POSIX and URL paths have similar syntax and semantics, they are not identical.
Normative References:
- IEEE Std 1003.1, 2013 - Vol. 1: Base Definitions - Section 4.12: Pathname Resolution
- The GNU C Library Reference Manual - Section 11.2: File Names
- IETF RFC 3986: Uniform Resource Identifier (URI): Generic Syntax - Section 3.3: Path
- IETF RFC 3986: Uniform Resource Identifier (URI): Generic Syntax - Section 6: Normalization and Comparison
- Wikipedia: URL normalization
Python 3.4+ solution:
url_path = PurePosixPath(urllib.parse.unquote(urllib.parse.urlparse(url).path))
import urlparse
output = urlparse.urlparse('http://www.example.com/temp/something/happen/index.html').path
output
'/temp/something/happen/index.html'
Split the path -- inbuilt rpartition func of string
output.rpartition('/')[0]
'/temp/something/happen'
A combination of urlparse and os.path.split will do the trick. The following script stores all sections of a url in a list, backwards.
import os.path, urlparse
def generate_sections_of_url(url):
path = urlparse.urlparse(url).path
sections = []; temp = "";
while path != '/':
temp = os.path.split(path)
path = temp[0]
sections.append(temp[1])
return sections
This would return: ["else", "something", "hithere"]
Note in Python3 import has changed to from urllib.parse import urlparse
See documentation. Here is an example:
>>> from urllib.parse import urlparse
>>> url = 's3://bucket.test/my/file/directory'
>>> p = urlparse(url)
>>> p
ParseResult(scheme='s3', netloc='bucket.test', path='/my/file/directory', params='', query='', fragment='')
>>> p.scheme
's3'
>>> p.netloc
'bucket.test'
>>> p.path
'/my/file/directory'