For a class, I have an exercise where i need to to count the number of images on any give web page. I know that every image starts with , so I am using a regexp to try and locate them. But I keep getting a count of one which i know is wrong, what is wrong with my code:
import urllib
import urllib.request
import re
img_pat = re.compile('<img.*>',re.I)
def get_img_cnt(url):
try:
w = urllib.request.urlopen(url)
except IOError:
sys.stderr.write("Couldn't connect to %s " % url)
sys.exit(1)
contents = str(w.read())
img_num = len(img_pat.findall(contents))
return (img_num)
print (get_img_cnt('http://www.americascup.com/en/schedules/races'))
Ahhh regular expressions.
Your regex pattern <img.*>
says "Find me something that starts with <img
and stuff and make sure it ends with >
.
Regular expressions are greedy, though; it'll fill that .*
with literally everything it can while leaving a single >
character somewhere afterwards to satisfy the pattern. In this case, it would go all the way to the end, <html>
and say "look! I found a >
right there!"
You should come up with the right count by making .*
non-greedy, like this:
<img.*?>
Don't ever use regex for parsing HTML, use an html parser, like lxml or BeautifulSoup. Here's a working example, how to get img
tag count using BeautifulSoup
and requests:
from bs4 import BeautifulSoup
import requests
def get_img_cnt(url):
response = requests.get(url)
soup = BeautifulSoup(response.content)
return len(soup.find_all('img'))
print(get_img_cnt('http://www.americascup.com/en/schedules/races'))
Here's a working example using lxml
and requests
:
from lxml import etree
import requests
def get_img_cnt(url):
response = requests.get(url)
parser = etree.HTMLParser()
root = etree.fromstring(response.content, parser=parser)
return int(root.xpath('count(//img)'))
print(get_img_cnt('http://www.americascup.com/en/schedules/races'))
Both snippets print 106
.
Also see:
- Python Regex - Parsing HTML
- Python regular expression for HTML parsing (BeautifulSoup)
Hope that helps.
Your regular expression is greedy, so it matches much more than you want. I suggest using an HTML parser.
img_pat = re.compile('<img.*?>',re.I)
will do the trick if you must do it the regex way. The ?
makes it non-greedy.
- A good website for checking what your regex matches on the fly: http://www.pyregex.com/
- Learn more about regexes: http://docs.python.org/2/library/re.html