I would like to parse the parameter and keyword values from URI/L's in a text file. Parameters without values should also be included. Python is fine but am open to suggestion using other tools such as Perl or a one-liner that may also do the trick.
Example source:
www.domain.com/folder/page.php?date=2012-11-20
www2.domain.edu/folder/folder/page.php?l=user&x=0&id=1&page=http%3A//domain.com/page.html&unique=123456&refer=http%3A//domain2.net/results.aspx%3Fq%3Dbob+test+1.21+some%26file%3Dname&text=
www.domain.edu/some/folder/image.php?l=adm&y=5&id=2&page=http%3A//support.domain.com/downloads/index.asp&unique=12345
blog.news.org/news/calendar.php?view=month&date=2011-12-10
Example output:
date=2012-11-20
l=user
x=0
page=http%3A//domain.com/page.html&unique=123456
refer=http%3A//domain2.net/results.aspx%3Fq%3Dbob+test+1.21+some%26file%3Dname
test=
l=adm
y=5
id=2
page=http%3A//support.domain.com/downloads/index.asp
unique=12345
view=month
date=2011-12-10
You can use a regular expression to extract all the pairs.
You can even use a dict comprehension to package all the data together.
And then if all you want to do is print them out:
I would use a regular expression like this (first code then explanation):
The first line is where the action happens. The regular expression finds all occurrences of a word followed by an equal sign and then a string that terminates either by a
&
or by a new line char. The returnpairs
is a tuple list, where each tuple contains the word (the keyword) and the value. I didn't capture the=
sign, and instead I print it in the loop.Explaining the regex:
\w+
means one or more word chars. The parenthesis around it means to capture it and return that value as a result.=
- the equal sign that must follow the word.*?
- zero or more chars in a non-greedy manner, that is until a new line appears or the&
sign, which is designated by\n|&
. The(?:..
pattern means that the\n
or&
should not be captured.Since we capture 2 things in the regex - the keyword and everything after the
=
sign, a list of 2-tuples is returned.The
re.S
tells the regex engine to allow the match-all regex code -.
- include in the search the new line char as well, that is, allow the search span over multiple lines (which is not default behavior).You don't need to dive into fragile regex world.
urlparse.parse_qsl()
is the tool for the job (urllib.quote()
helps to escape special characters):Prints:
Hope that helps.