How to find number of image results for a given se

2019-09-13 23:16发布

I have been messing around with HTML from within Excel, trying to approximate how common images are at different resolutions. I'm hoping to get something dynamic - a user enters a search term and the code loops through a set of pre-defined image resolutions, ranking how common images are for that search term between the specified resolutions.

Step one though is to get a reliable (and speedy) way of returning the number of images at a specific resolution. I wrote this code:

Sub GoogleWithURL() 'requires Microsoft HTML Object Library

    Dim url As String, searchTerm As String
    Dim objIE As InternetExplorer 'special object variable representing the IE browser
    Dim ws As Worksheet
    Set ws = ThisWorkbook.Sheets("sheet1")
    Dim currPage As HTMLDocument
    Dim xRes As Integer, yRes As Integer
    With ws
        xRes = .Range("XRes")
        yRes = .Range("YRes")
        searchTerm = .Range("search")
    End With

    'create URL to page with these image criteria
    url = WorksheetFunction.Concat("https://www.google.com/search?q=", searchTerm, _
                        "&tbm=isch&source=lnt&tbs=isz:ex,iszw:", xRes, "iszh:", yRes)

    'initiating a new instance of Internet Explorer and asigning it to objIE
    Set objIE = New InternetExplorer
    'objIE.Visible = True 'for debugging purposes

    'Google images search
    objIE.navigate url
    Do While objIE.Busy = True Or objIE.readyState <> 4: DoEvents: Loop
    Set currPage = objIE.document

    'Count image results
    Set valueResult = currPage.getElementById("rg_s").getElementsByTagName("IMG")
    MsgBox WorksheetFunction.Concat("'", searchTerm, "' returns ", valueResult.Length _
    , " images @ ", xRes, "x", yRes, "px.") 'returns number of loaded images on page

    'close the browser
    On Error Resume Next 'required when the browser is visible and I close it manually half way
    objIE.Quit

End Sub

It navigates an internet explorer object to a specific resolution google image search, counts the number of images in the rg_s ID (these are the image results as opposed to the banner images etc.). Then it returns that count as a message box. (When I eventually implement this, I will return the values in a column on the sheet, looping through 30ish different resolutions)

Problems

The main problems with this code are:

  • It doesn't give a very useful count. The resolution is low because it only counts the images that have loaded - meaning most search terms at common resolutions like 1920x1080 or 1366x768 return the max 100 images.

  • It is slow. To me, navigating to pages, counting image tags, it all seems a lot like using .Select in VBA. It's like a manual approach, what a human would do, and therefore inefficient.

Solutions

I can think of some approaches to solve these problems

  1. Resolution of data/ getting a more useful count

    • Scroll down. If I can load more images, it's likely I'll be able to differentiate a little better. I found that scrolling as far down as I can (up to the 'Load more results' button) gives a cap of 400 not 100 - if there are at least that many images for a given resolution then I'm happy and I'll give it the top rank. Doesn't help with problem 2 though. Still, how would I do that?

    • Narrow results. If 100 is returned, I could change the filetype: in the URL I send, like appending filetype:png to maybe halve the number of images returned, giving me a better spread in the 0-100 range. Not ideal though as I would have to iterate through multiple filetypes for some resolutions, slowing down the code and even then, not necessarily giving me what I want.

    • Use Google's (or another search engine's) own values for this. I've asked this on various sites and in various forms, is there any data on image count available direct from Google - i.e. without returning (and slowly loading) the images themselves. Like the about 1,300,500 results in 0.03 seconds for a regular search, only for images? If I could use a pre-calculated value each time that samples a bigger array than the 100 results, I might get a much more detailed picture.

  2. Slowness

    • Try a different kind of HTTP request. Right now I open an instance of internet explorer and navigate to a page. That sounds very human style, I would prefer a computer style request. What I mean is, instead of using my laptop to trawl through images one by one, I get Google's supercomputers to do the leg work by only asking for the count, as oppose to the images themselves. Not a clue how to do this. I'm aware of another 2 ways to search the web from within Excel; web query and CreateObject("MSXML2.serverXMLHTTP"). Don't know either of those much but if you think they would be a better way to go then I'll look into them more closely.

Summary

Hopefully that's plenty to go on, I think my train of thought should be fairly clear. Actual answers on how to scroll down/ load more images/ get Google to return a count rather than images themselves would be best, advice on what to pursue would be useful too.

2条回答
男人必须洒脱
2楼-- · 2019-09-13 23:46

After a few more questions, and now feeling somewhat wiser, I've made a UDF for this:

Public Function GOOGLE_COUNT(searchTerm As String, xRes As Long, yRes As Long, Optional timeout As Long = 10) As Long

    Dim url As String
    Dim objIE As InternetExplorer
    Dim currPage As HTMLDocument
    Dim stTimer As Double, tElapsed As Single
    Dim valueResult As IHTMLElementCollection

    'create URL to page with these image criteria
    url = "https://www.google.com/search?q=" & searchTerm & _
                        "&tbm=isch&source=lnt&tbs=isz:ex,iszw:" & xRes & ",iszh:" & yRes

    'initiating a new instance of Internet Explorer and asigning it to objIE
    Set objIE = New InternetExplorer

    'Google images search
    objIE.navigate url
    Do While objIE.Busy = True Or objIE.readyState <> 4: DoEvents: Loop
    Set currPage = objIE.document
    Dim myDiv As HTMLDivElement: Set myDiv = currPage.getElementById("fbar")
    Dim elemRect As IHTMLRect: Set elemRect = myDiv.getBoundingClientRect
    stTimer = Timer
    'Scroll until bottom of page is in view
    Do Until elemRect.bottom > 0 Or tElapsed > timeout 'timeout after n seconds
        currPage.parentWindow.scrollBy 0, 10000
        Set elemRect = myDiv.getBoundingClientRect
        tElapsed = Timer - stTimer
    Loop
    myDiv.ScrollIntoView
    'Count the images
    Set valueResult = currPage.getElementById("rg_s").getElementsByTagName("IMG")
    GOOGLE_COUNT = valueResult.Length
    objIE.Quit

End Function

Employed like this: to search "St. Mary" at 1366:768 image size then

=GOOGLE_COUNT("St. Mary", 1366, 768)

Or with a 10 second timeout (the search stops scrolling if 10 seconds have passed, and just counts the images loaded)

=GOOGLE_COUNT("St. Mary", 1366, 768, 10)

I explain in another question how the scrolling works, it's now messy but functional.

Important:

As @John Muggins points out, the significant time taken is in loading, not counting. In particular, opening and closing InternetExplorer. So to avoid huge re-calculation times; if (like me) you want to check for more than one term/resolution, put this code in a macro, not a function (comment if you think I should post this). This UDF is only for a one off search

Hope it's useful, I thought I should re-visit the question to post the answer I got to.

Final Note:

  1. Your computer (probably) isn't crashing, the function is just calculating.

  2. For search term, type anything you'd enter in a google search bar - e.g. "Jaguar -car" returns images of the animal, not the car company

  3. The result is a number 0-400; 0-399 is the actual number of images counted (as long as you set timeout to be big enough - auto is 10s). 400 is the max, so there may be more than 400 images availible for that term, at that resolution.

查看更多
爷的心禁止访问
3楼-- · 2019-09-14 00:06

Your bottleneck isn't in the for loop. It's in opening a browser and directing it to a location. If you are concerned about time then you should grab a browser that is already open to that page and not close it until you have run all your searches on it. You should save at least 2 seconds per search. I ran the following code and got these times:

Time to open and set Explorer: 2.41 seconds.

Time to Count 100 Photos(1): 0.1 seconds.

Time to Count 100 Photos(2): 0.11 seconds.

The difference between our approaches is 1/100th of a second.

Also, Google images requires the user to page down for it to call the next 100 images. If you can find the ajax or javascript statement to make that happen then you will be able to make it think it has pages down. That's why you're only getting 100 images.

Or you can open a browser, type in your search term, and page down until there are 299 images on the screen, when you will find a button that says "Show More Images". And then snag that open webpage.

If you are running several search terms than your time bottleneck is in opening and closing browsers, not counting the images.

Sub GoogleWithURL() 'requires Microsoft HTML Object Library
' https://www.google.com/search?q=St+Mary&source=lnms&tbm=isch&sa=X&ved=0ahUKEwj99ay14aPSAhWDMSYKHadiCjkQ_AUICSgC&biw=1600&bih=840
    Dim url As String
    Dim objIE As InternetExplorer 'special object variable representing the IE browser
    Dim ws As Worksheet
    Set ws = ThisWorkbook.Sheets("Sheet1")
    Dim currPage As HTMLDocument
    Dim StartTime As Double, SecondsElapsed As Double

    '****************************************
    '   Hard code url to search images of St Mary
    url = "https://www.google.com/search?q=St+Mary&source=lnms&tbm=" & _
            "isch&sa=X&ved=0ahUKEwj99ay14aPSAhWDMSYKHadiCjkQ_AUICSgC&biw=1600&bih=840"

    StartTime = Timer
    Set objIE = New InternetExplorer
    objIE.Visible = True
    objIE.navigate url
    Do While objIE.Busy = True Or objIE.readyState <> 4: DoEvents: Loop
    Set currPage = objIE.document
    SecondsElapsed = Round(Timer - StartTime, 2)
    Debug.Print "Time to open and set Explorer:  " & SecondsElapsed & " seconds."


    StartTime = Timer
    Set valueResult = currPage.getElementById("rg_s").getElementsByTagName("IMG")
    For Each pic In valueResult
        counter = counter + 1
    Next pic
    SecondsElapsed = Round(Timer - StartTime, 2)

    Debug.Print "Time to Count " & counter & " Photos(1):  " & SecondsElapsed & " seconds."

    counter = 0
    StartTime = Timer
    Set valueResult = currPage.getElementsByTagName("IMG")
    For Each pic In valueResult
        If InStr(1, pic.className, "rg") > 0 Then
            counter = counter + 1
        End If
    Next pic
    SecondsElapsed = Round(Timer - StartTime, 2)

    Debug.Print "Time to Count " & counter & " Photos(2):  " & SecondsElapsed & " seconds."

    On Error Resume Next 'required when the browser is visible and I close it manually half way
    objIE.Quit

End Sub
查看更多
登录 后发表回答