I have been messing around with HTML from within Excel, trying to approximate how common images are at different resolutions. I'm hoping to get something dynamic - a user enters a search term and the code loops through a set of pre-defined image resolutions, ranking how common images are for that search term between the specified resolutions.
Step one though is to get a reliable (and speedy) way of returning the number of images at a specific resolution. I wrote this code:
Sub GoogleWithURL() 'requires Microsoft HTML Object Library
Dim url As String, searchTerm As String
Dim objIE As InternetExplorer 'special object variable representing the IE browser
Dim ws As Worksheet
Set ws = ThisWorkbook.Sheets("sheet1")
Dim currPage As HTMLDocument
Dim xRes As Integer, yRes As Integer
With ws
xRes = .Range("XRes")
yRes = .Range("YRes")
searchTerm = .Range("search")
End With
'create URL to page with these image criteria
url = WorksheetFunction.Concat("https://www.google.com/search?q=", searchTerm, _
"&tbm=isch&source=lnt&tbs=isz:ex,iszw:", xRes, "iszh:", yRes)
'initiating a new instance of Internet Explorer and asigning it to objIE
Set objIE = New InternetExplorer
'objIE.Visible = True 'for debugging purposes
'Google images search
objIE.navigate url
Do While objIE.Busy = True Or objIE.readyState <> 4: DoEvents: Loop
Set currPage = objIE.document
'Count image results
Set valueResult = currPage.getElementById("rg_s").getElementsByTagName("IMG")
MsgBox WorksheetFunction.Concat("'", searchTerm, "' returns ", valueResult.Length _
, " images @ ", xRes, "x", yRes, "px.") 'returns number of loaded images on page
'close the browser
On Error Resume Next 'required when the browser is visible and I close it manually half way
objIE.Quit
End Sub
It navigates an internet explorer object to a specific resolution google image search, counts the number of images in the rg_s
ID (these are the image results as opposed to the banner images etc.). Then it returns that count as a message box. (When I eventually implement this, I will return the values in a column on the sheet, looping through 30ish different resolutions)
Problems
The main problems with this code are:
It doesn't give a very useful count. The resolution is low because it only counts the images that have loaded - meaning most search terms at common resolutions like 1920x1080 or 1366x768 return the max 100 images.
It is slow. To me, navigating to pages, counting image tags, it all seems a lot like using
.Select
in VBA. It's like a manual approach, what a human would do, and therefore inefficient.
Solutions
I can think of some approaches to solve these problems
Resolution of data/ getting a more useful count
Scroll down. If I can load more images, it's likely I'll be able to differentiate a little better. I found that scrolling as far down as I can (up to the 'Load more results' button) gives a cap of 400 not 100 - if there are at least that many images for a given resolution then I'm happy and I'll give it the top rank. Doesn't help with problem 2 though. Still, how would I do that?
Narrow results. If 100 is returned, I could change the
filetype:
in the URL I send, like appendingfiletype:png
to maybe halve the number of images returned, giving me a better spread in the 0-100 range. Not ideal though as I would have to iterate through multiple filetypes for some resolutions, slowing down the code and even then, not necessarily giving me what I want.Use Google's (or another search engine's) own values for this. I've asked this on various sites and in various forms, is there any data on image count available direct from Google - i.e. without returning (and slowly loading) the images themselves. Like the
about 1,300,500 results in 0.03 seconds
for a regular search, only for images? If I could use a pre-calculated value each time that samples a bigger array than the 100 results, I might get a much more detailed picture.
Slowness
- Try a different kind of HTTP request. Right now I open an instance of internet explorer and navigate to a page. That sounds very human style, I would prefer a computer style request. What I mean is, instead of using my laptop to trawl through images one by one, I get Google's supercomputers to do the leg work by only asking for the count, as oppose to the images themselves. Not a clue how to do this. I'm aware of another 2 ways to search the web from within Excel; web query and
CreateObject("MSXML2.serverXMLHTTP")
. Don't know either of those much but if you think they would be a better way to go then I'll look into them more closely.
- Try a different kind of HTTP request. Right now I open an instance of internet explorer and navigate to a page. That sounds very human style, I would prefer a computer style request. What I mean is, instead of using my laptop to trawl through images one by one, I get Google's supercomputers to do the leg work by only asking for the count, as oppose to the images themselves. Not a clue how to do this. I'm aware of another 2 ways to search the web from within Excel; web query and
Summary
Hopefully that's plenty to go on, I think my train of thought should be fairly clear. Actual answers on how to scroll down/ load more images/ get Google to return a count rather than images themselves would be best, advice on what to pursue would be useful too.
After a few more questions, and now feeling somewhat wiser, I've made a UDF for this:
Employed like this: to search "St. Mary" at 1366:768 image size then
Or with a 10 second timeout (the search stops scrolling if 10 seconds have passed, and just counts the images loaded)
I explain in another question how the scrolling works, it's now messy but functional.
Important:
As @John Muggins points out, the significant time taken is in loading, not counting. In particular, opening and closing
InternetExplorer
. So to avoid huge re-calculation times; if (like me) you want to check for more than one term/resolution, put this code in a macro, not a function (comment if you think I should post this). This UDF is only for a one off searchHope it's useful, I thought I should re-visit the question to post the answer I got to.
Final Note:
Your computer (probably) isn't crashing, the function is just calculating.
For search term, type anything you'd enter in a google search bar - e.g. "Jaguar -car" returns images of the animal, not the car company
The result is a number 0-400; 0-399 is the actual number of images counted (as long as you set timeout to be big enough - auto is 10s). 400 is the max, so there may be more than 400 images availible for that term, at that resolution.
Your bottleneck isn't in the for loop. It's in opening a browser and directing it to a location. If you are concerned about time then you should grab a browser that is already open to that page and not close it until you have run all your searches on it. You should save at least 2 seconds per search. I ran the following code and got these times:
Time to open and set Explorer: 2.41 seconds.
Time to Count 100 Photos(1): 0.1 seconds.
Time to Count 100 Photos(2): 0.11 seconds.
The difference between our approaches is 1/100th of a second.
Also, Google images requires the user to page down for it to call the next 100 images. If you can find the ajax or javascript statement to make that happen then you will be able to make it think it has pages down. That's why you're only getting 100 images.
Or you can open a browser, type in your search term, and page down until there are 299 images on the screen, when you will find a button that says "Show More Images". And then snag that open webpage.
If you are running several search terms than your time bottleneck is in opening and closing browsers, not counting the images.