count the number of images on a webpage, using url

2019-05-17 02:30发布

问题:

For a class, I have an exercise where i need to to count the number of images on any give web page. I know that every image starts with , so I am using a regexp to try and locate them. But I keep getting a count of one which i know is wrong, what is wrong with my code:

import urllib
import urllib.request
import re
img_pat = re.compile('<img.*>',re.I)

def get_img_cnt(url):
  try:
      w =  urllib.request.urlopen(url)
  except IOError:
      sys.stderr.write("Couldn't connect to %s " % url)
      sys.exit(1)
  contents =  str(w.read())
  img_num = len(img_pat.findall(contents))
  return (img_num)

print (get_img_cnt('http://www.americascup.com/en/schedules/races'))

回答1:

Ahhh regular expressions.

Your regex pattern <img.*> says "Find me something that starts with <img and stuff and make sure it ends with >.

Regular expressions are greedy, though; it'll fill that .* with literally everything it can while leaving a single > character somewhere afterwards to satisfy the pattern. In this case, it would go all the way to the end, <html> and say "look! I found a > right there!"

You should come up with the right count by making .* non-greedy, like this:

<img.*?>



回答2:

Don't ever use regex for parsing HTML, use an html parser, like lxml or BeautifulSoup. Here's a working example, how to get img tag count using BeautifulSoup and requests:

from bs4 import BeautifulSoup
import requests


def get_img_cnt(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content)

    return len(soup.find_all('img'))


print(get_img_cnt('http://www.americascup.com/en/schedules/races'))

Here's a working example using lxml and requests:

from lxml import etree
import requests


def get_img_cnt(url):
    response = requests.get(url)
    parser = etree.HTMLParser()
    root = etree.fromstring(response.content, parser=parser)

    return int(root.xpath('count(//img)'))


print(get_img_cnt('http://www.americascup.com/en/schedules/races'))

Both snippets print 106.

Also see:

  • Python Regex - Parsing HTML
  • Python regular expression for HTML parsing (BeautifulSoup)

Hope that helps.



回答3:

Your regular expression is greedy, so it matches much more than you want. I suggest using an HTML parser.

img_pat = re.compile('<img.*?>',re.I) will do the trick if you must do it the regex way. The ? makes it non-greedy.

  • A good website for checking what your regex matches on the fly: http://www.pyregex.com/
  • Learn more about regexes: http://docs.python.org/2/library/re.html