计数HTML的图像与Python(Counting HTML images with Python)

我需要关于如何提取后数与Python 3.01 HTML图像的一些反馈，也许我的正则表达式不正确使用。

这里是我的代码：

import re, os
import urllib.request
def get_image(url):
  url = 'http://www.google.com'
  total = 0
  try:
    f = urllib.request.urlopen(url)
    for line in f.readline():
      line = re.compile('<img.*?src="(.*?)">')
      if total > 0:
        x = line.count(total)
        total += x
        print('Images total:', total)

  except:
    pass

Answer 1:

一对夫妇对你的代码点：

它更easiser使用专用的HTML解析库来解析你的页面（ 这是蟒蛇的方式 ）。我个人更喜欢美味的汤
你过写你line变量在循环
total将始终为0与您当前的逻辑
无需编译你的RE，因为它会被解释缓存
你放弃你的异常，所以没有对代码发生了什么线索！
有可能是其他属性的<img>标签..所以你的正则表达式是一点基本的，另外，使用re.findall()方法捉对同一行的多个实例？

改变周围的一点点你的代码，我得到：

import re
from urllib.request import urlopen

def get_image(url):

    total  = 0
    page   = urlopen(url).readlines()

    for line in page:

        hit   = re.findall('<img.*?>', str(line))
        total += len(hit)

    print('{0} Images total: {1}'.format(url, total))

get_image("http://google.com")
get_image("http://flickr.com")

Answer 2:

使用beautifulsoup4（HTML解析器），而不是一个正则表达式：

import urllib.request

import bs4  # beautifulsoup4

html = urllib.request.urlopen('http://www.imgur.com/').read()
soup = bs4.BeautifulSoup(html)
images = soup.findAll('img')
print(len(images))

文章来源: Counting HTML images with Python