Download images from google image search (python)

I am web scraping beginner. I am firstly refer to https://www.youtube.com/watch?v=ZAUNEEtzsrg to download image with the specific tag(e.g. cat), and it works! But I encountered new problem which only can download about 100 images, and this problem seems like "ajax" which only load the first page html and not load all. Therefore, it seem like we must simulate scroll down to download next 100 images or more.

My code: https://drive.google.com/file/d/0Bwjk-LKe_AohNk9CNXVQbGRxMHc/edit?usp=sharing

To sum up,the problems are following:

how to download all images in google image search by source code in python( Please give me some examples :) )
Have any web scraping technique I must need to know?

标签： python ajax web-scraping web-crawler google-image-search

4条回答

混吃等死

2楼-- · 2019-04-17 12:12

Use Google API to get results, so replace your URL by something like this:

https://ajax.googleapis.com/ajax/services/search/images?v=1.0&q=cat&rsz=8&start=0

You will get 8 results, then call again the service with start=7 to get the next ones etc. until you receive an error.

The returned data is in JSON format.

Here is a Python example I found on the web:

import urllib2
import simplejson

url = ('https://ajax.googleapis.com/ajax/services/search/images?' +
       'v=1.0&q=barack%20obama&userip=INSERT-USER-IP')

request = urllib2.Request(url, None, {'Referer': /* Enter the URL of your site here */})
response = urllib2.urlopen(request)

# Process the JSON string.
results = simplejson.load(response)
# now have some fun with the results...

As for web scrapping techniques there is this page: http://jakeaustwick.me/python-web-scraping-resource

Hope it helps.

0人赞添加讨论(0) 举报

老娘就宠你

3楼-- · 2019-04-17 12:12

To get 100 results, try this:

from urllib import FancyURLopener
import re
import posixpath
import urlparse 

class MyOpener(FancyURLopener, object):
    version = "Mozilla/5.0 (Linux; U; Android 4.0.3; ko-kr; LG-L160L Build/IML74K) AppleWebkit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30"

myopener = MyOpener()

page = myopener.open('https://www.google.pt/search?q=love&biw=1600&bih=727&source=lnms&tbm=isch&sa=X&tbs=isz:l&tbm=isch')
html = page.read()

for match in re.finditer(r'<a href="http://www\.google\.pt/imgres\?imgurl=(.*?)&amp;imgrefurl', html, re.IGNORECASE | re.DOTALL | re.MULTILINE):
    path = urlparse.urlsplit(match.group(1)).path
    filename = posixpath.basename(path)
    myopener.retrieve(match.group(1), filename)

I can tweak biw=1600&bih=727 to get bigger or smaller images.

0人赞添加讨论(0) 举报

做个烂人

4楼-- · 2019-04-17 12:14

For any questions about icrawler, you can raise an issue on Github, which may get faster response.

The number limit for google search results seems to be 1000. A work around is to define a date range like the following.

from datetime import date
from icrawler.builtin import GoogleImageCrawler

google_crawler = GoogleImageCrawler(
    parser_threads=2, 
    downloader_threads=4,
    storage={'root_dir': 'your_image_dir'})
google_crawler.crawl(
    keyword='sunny',
    max_num=1000,
    date_min=date(2014, 1, 1),
    date_max=date(2015, 1, 1))
google_crawler.crawl(
    keyword='sunny',
    max_num=1000,
    date_min=date(2015, 1, 1),
    date_max=date(2016, 1, 1))

0人赞添加讨论(0) 举报

一纸荒年 Trace。

5楼-- · 2019-04-17 12:25

My final solution is using icrawler.

from icrawler.examples import GoogleImageCrawler

google_crawler = GoogleImageCrawler('your_image_dir')
google_crawler.crawl(keyword='sunny', offset=0, max_num=1000,
                     date_min=None, date_max=None, feeder_thr_num=1,
                     parser_thr_num=1, downloader_thr_num=4,
                     min_size=(200,200), max_size=None)

The advantage the framework contains 5 built-in crawler (google, bing, baidu, flicker and general crawl), but it still only provide 100 images when crawl from google.

0人赞添加讨论(0) 举报

Download images from google image search (python)

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间