可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I am web scraping beginner.
I am firstly refer to https://www.youtube.com/watch?v=ZAUNEEtzsrg to download image with the specific tag(e.g. cat), and it works!
But I encountered new problem which only can download about 100 images, and this problem seems like "ajax" which only load the first page html and not load all. Therefore, it seem like we must simulate scroll down to download next 100 images or more.
My code: https://drive.google.com/file/d/0Bwjk-LKe_AohNk9CNXVQbGRxMHc/edit?usp=sharing
To sum up,the problems are following:
how to download all images in google image search by source code in python( Please give me some examples :) )
Have any web scraping technique I must need to know?
回答1:
My final solution is using icrawler.
from icrawler.examples import GoogleImageCrawler
google_crawler = GoogleImageCrawler('your_image_dir')
google_crawler.crawl(keyword='sunny', offset=0, max_num=1000,
date_min=None, date_max=None, feeder_thr_num=1,
parser_thr_num=1, downloader_thr_num=4,
min_size=(200,200), max_size=None)
The advantage the framework contains 5 built-in crawler (google, bing, baidu, flicker and general crawl), but it still only provide 100 images when crawl from google.
回答2:
Use Google API to get results, so replace your URL by something like this:
https://ajax.googleapis.com/ajax/services/search/images?v=1.0&q=cat&rsz=8&start=0
You will get 8 results, then call again the service with start=7 to get the next
ones etc. until you receive an error.
The returned data is in JSON format.
Here is a Python example I found on the web:
import urllib2
import simplejson
url = ('https://ajax.googleapis.com/ajax/services/search/images?' +
'v=1.0&q=barack%20obama&userip=INSERT-USER-IP')
request = urllib2.Request(url, None, {'Referer': /* Enter the URL of your site here */})
response = urllib2.urlopen(request)
# Process the JSON string.
results = simplejson.load(response)
# now have some fun with the results...
As for web scrapping techniques there is this page:
http://jakeaustwick.me/python-web-scraping-resource
Hope it helps.
回答3:
To get 100 results, try this:
from urllib import FancyURLopener
import re
import posixpath
import urlparse
class MyOpener(FancyURLopener, object):
version = "Mozilla/5.0 (Linux; U; Android 4.0.3; ko-kr; LG-L160L Build/IML74K) AppleWebkit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30"
myopener = MyOpener()
page = myopener.open('https://www.google.pt/search?q=love&biw=1600&bih=727&source=lnms&tbm=isch&sa=X&tbs=isz:l&tbm=isch')
html = page.read()
for match in re.finditer(r'<a href="http://www\.google\.pt/imgres\?imgurl=(.*?)&imgrefurl', html, re.IGNORECASE | re.DOTALL | re.MULTILINE):
path = urlparse.urlsplit(match.group(1)).path
filename = posixpath.basename(path)
myopener.retrieve(match.group(1), filename)
I can tweak biw=1600&bih=727
to get bigger or smaller images.
回答4:
For any questions about icrawler, you can raise an issue on Github, which may get faster response.
The number limit for google search results seems to be 1000. A work around is to define a date range like the following.
from datetime import date
from icrawler.builtin import GoogleImageCrawler
google_crawler = GoogleImageCrawler(
parser_threads=2,
downloader_threads=4,
storage={'root_dir': 'your_image_dir'})
google_crawler.crawl(
keyword='sunny',
max_num=1000,
date_min=date(2014, 1, 1),
date_max=date(2015, 1, 1))
google_crawler.crawl(
keyword='sunny',
max_num=1000,
date_min=date(2015, 1, 1),
date_max=date(2016, 1, 1))