Parsing URI parameter and keyword value pairs

2019-02-25 14:38发布

I would like to parse the parameter and keyword values from URI/L's in a text file. Parameters without values should also be included. Python is fine but am open to suggestion using other tools such as Perl or a one-liner that may also do the trick.

Example source:

www.domain.com/folder/page.php?date=2012-11-20
www2.domain.edu/folder/folder/page.php?l=user&x=0&id=1&page=http%3A//domain.com/page.html&unique=123456&refer=http%3A//domain2.net/results.aspx%3Fq%3Dbob+test+1.21+some%26file%3Dname&text=
www.domain.edu/some/folder/image.php?l=adm&y=5&id=2&page=http%3A//support.domain.com/downloads/index.asp&unique=12345
blog.news.org/news/calendar.php?view=month&date=2011-12-10

Example output:

date=2012-11-20
l=user
x=0
page=http%3A//domain.com/page.html&unique=123456
refer=http%3A//domain2.net/results.aspx%3Fq%3Dbob+test+1.21+some%26file%3Dname
test=
l=adm
y=5
id=2
page=http%3A//support.domain.com/downloads/index.asp
unique=12345
view=month
date=2011-12-10

3条回答
老娘就宠你
2楼-- · 2019-02-25 15:21

You can use a regular expression to extract all the pairs.

>>> url = 'www2.domain.edu/folder/folder/page.php?l=user&x=0&id=1&page=http%3A//domain.com/page.html&unique=123456&refer=http%3A//domain2.net/results.aspx%3Fq%3Dbob+test+1.21+some%26file%3Dname&text='
>>> import re
>>> url = 'www2.domain.edu/folder/folder/page.php?l=user&x=0&id=1&page=http%3A//domain.com/page.html&unique=123456&refer=http%3A//domain2.net/results.aspx%3Fq%3Dbob+test+1.21+some%26file%3Dname&text='
>>> p = re.compile('.*?&(.*?)=(.*?)(?=&|$)')
>>> m = p.findall(url)
>>> m
[('x', '0'), ('id', '1'), ('page', 'http%3A//domain.com/page.html'), ('unique', '123456'), ('refer', 'http%3A//domain2.net/results.aspx%3Fq%3Dbob+test+1.21+some%26file%3Dname'), ('text', '')]

You can even use a dict comprehension to package all the data together.

>>> dic = {k:v for k,v in m}
>>> dic
{'text': '', 'page': 'http%3A//domain.com/page.html', 'x': '0', 'unique': '123456', 'id': '1', 'refer': 'http%3A//domain2.net/results.aspx%3Fq%3Dbob+test+1.21+some%26file%3Dname'}

And then if all you want to do is print them out:

>>> for k,v in dic.iteritems():
    print k,'-->',v

text --> 
page --> http%3A//domain.com/page.html
x --> 0
unique --> 123456
id --> 1
refer --> http%3A//domain2.net/results.aspx%3Fq%3Dbob+test+1.21+some%26file%3Dname
查看更多
Fickle 薄情
3楼-- · 2019-02-25 15:22

I would use a regular expression like this (first code then explanation):

pairs = re.findall(r'(\w+)=(.*?)(?:\n|&)', s, re.S)
for k, v in pairs:
    print('{0} = {1}'.format(k, v))

The first line is where the action happens. The regular expression finds all occurrences of a word followed by an equal sign and then a string that terminates either by a & or by a new line char. The return pairs is a tuple list, where each tuple contains the word (the keyword) and the value. I didn't capture the = sign, and instead I print it in the loop.

Explaining the regex:

\w+ means one or more word chars. The parenthesis around it means to capture it and return that value as a result.

= - the equal sign that must follow the word

.*? - zero or more chars in a non-greedy manner, that is until a new line appears or the & sign, which is designated by \n|&. The (?:.. pattern means that the \n or & should not be captured.

Since we capture 2 things in the regex - the keyword and everything after the = sign, a list of 2-tuples is returned.

The re.S tells the regex engine to allow the match-all regex code - . - include in the search the new line char as well, that is, allow the search span over multiple lines (which is not default behavior).

查看更多
倾城 Initia
4楼-- · 2019-02-25 15:26

You don't need to dive into fragile regex world.

urlparse.parse_qsl() is the tool for the job (urllib.quote() helps to escape special characters):

from urllib import quote
from urlparse import parse_qsl, urlparse


with open('links.txt') as f:
    for url in f:
        params = parse_qsl(urlparse(url.strip()).query, keep_blank_values=True)
        for key, value in params:
            print "%s=%s" % (key, quote(value))

Prints:

date=2012-11-20
l=user
x=0
id=1
page=http%3A//domain.com/page.html
unique=123456
refer=http%3A//domain2.net/results.aspx%3Fq%3Dbob%20test%201.21%20some%26file%3Dname
text=
l=adm
y=5
id=2
page=http%3A//support.domain.com/downloads/index.asp
unique=12345
view=month
date=2011-12-10

Hope that helps.

查看更多
登录 后发表回答