How can Javascript/jQuery be used to identify the description or title corresponding to an image on a webpage with multiple images and descriptions?
The page title can be extracted very easily, but the title may not correspond to the image especially if there are many images present on the page
var title = document.title;
I believe this has been done successfully by Pinterest's Pin-it bookmarklet. I'm guessing it has to do with an algorithm to find the nearest h1
, h2
, h3
or the image's alt
attributes, then fallback to the document.title
if the algorithm fails to identify the image's description on the page.
Any ideas greatly appreciated!
EDIT
This is for data scraping other websites
The best answer is: Look at how Pinterest does it.
For jQuery, look at "closest" function.
Here is just some quick and dirty untested code to give you a starting point for thinking about this, but this is very open ended question and the intelligence in your code can be as complex and robust or as simple as you want it to be.
The OP has provided a great question to expand on. I recently created a jsFiddle for another SO Answer to data scrape URL, Title, and Thumbnail from the new Yahoo! Screen Video Player webpages.
I have just re-written that jsFiddle so it's Pinterest specific and have made direct use of
Metatag Object Numbers
(more on that later) which makes this jsFiddle very different from that one.The overall process involves using Yahoo's Query Language along with jQuery
.ajax()
function to get the desired scraped data, usually available in the webpages sourcemetatag
section.First, let me explain a few things.
The Pinterest Link that I will use will be a direct link to a pinned item. This means that webpage will contain the primary pinned item along with many other smaller pinned items, unlike the homepage which contains a multitude of only pinned items.
That Pinterest Link has for it's Webpage Title the pinned item's
Title
along with a few words that makes up the pinned item'sDescription
. This most likely is not desired, and just the pinned item'sTitle
is all that's needed.Viewing the HTML Source Page for the Pinterest Link shows us the metatags that are currently used. Here's most of them:
As you can see, those
metatags
containsog:title
andog:image
data for which we are after. It's then realized that theseog metatags
are a direct target which to perform the data scraping process.To be sure, the
os:image
content link above is for the full image size version via_c.jpg
. The Thumbnail versions use_b.jpg
. Essentially, you have two unique image sizes per pinned item.Since the data scraping process does not return these
og property names
, onlyMetatag Object Numbers
, we need to analyze the returnedcontent
associated with eachMetatag Object Number
.Looking at the above
metatag
source, it's clear that theimage
will always be located at some place starting withhttp://media-
. Those13
characters are unique among all metatags, and therefore when that's matched, that entire URL is theimage location
.Of course should Pinterest use more than one URL Template for there images, then things will need to be adjusted accordingly.
Looking at
og:title
you immediately realize that there are no unique string of characters in the content portion to indicate that this tag is theimage's title
. Therefore, assuming all metatags follow a template and will not change for some time, we will allocate thisMetatag Object Number 7
to provide thePinterest Pinned Item's Image Title
. To be clear, this number 7 is based on.ajax()
andYQL Results
from this scripts process, not the source HTML structure as seen above.Again, if Pinterest changes there template for the
head section
, then adjustments may be required.What follows now is an live step by step tutorial I wrote, based on data scraping techniques/script seen in this online article.
jsFiddle Pinterest Data Scraping DEMO
Tip:
Although not demonstrated, at your disposal is a numeric value for total found Metatags, which can be checked against a predetermined value for what the page should contain, indicating the
head section
has changed. For example, the current metatag count is25
items. If the returned value is not equal to this value on any other Pinterest Pinned Item webpage, you know there is a differenthead section
in use... which may affect the script since it expects only 25 and calls two of them directly by it'sMetatag Object Number
.Something extra:
If your curious on how to retrieve the current Pinterest Pinned ITEMS as seen on the homepage, first understand how this jsFiddle DEMO works. Then, you'll need to make your own jsFiddle version for testing and use the Pinterest Homepage URL along with changing the
XPATH
in the.ajax()
call to data scrape only therelevant div's
in thebody section
. To learn more aboutXPATH basics
, click HERE. Then you can understand: XPATH for Select Divs in Body on YQL Playground.For example, the
body section
contains a maximum total of 50 pin's in this format:Those
href fragments
will serve as a starting point in recreating the URL's. Important note: Some pins may berepins
which means you will have less than 50 pins returned.For those that read this far, here it is:
Something Extra jsFiddle DEMO.
Here is an improved XPATH for Select Divs in Body on YQL Playground, but do understand how the longer one above works.
Also see my other Pinterest SO Answers for:
Custom Pinterest button for custom URL (Text-Link, Image, or Both)
How can I duplicate Pinterest website's modal effect?