Here is the code I have:
a='<title>aaa</title><title>aaa2</title><title>aaa3</title>'
import re
re.findall(r'<(title)>(.*)<(/title)>', a)
The result is:
[('title', 'aaa</title><title>aaa2</title><title>aaa3', '/title')]
If I ever designed a crawler to get me titles of web sites, I might end up with something like this rather than a title for the web site.
My question is, how do I limit findall
to a single <title></title>
?
Use
re.search
instead ofre.findall
if you only want one match:If you wanted all tags, then you should consider changing it to be non-greedy (ie -
.*?
):But really consider using BeautifulSoup or lxml or similar to parse HTML.
Use a non-greedy search instead:
The question-mark says to match as few characters as possible. Now your findall() will return each of the results you want.
http://docs.python.org/2/howto/regex.html#greedy-versus-non-greedy
It will be much easier using BeautifulSoup module.
https://pypi.python.org/pypi/beautifulsoup4
Add a
?
after the*
, so it will be non-greedy.