This question sounds easy, but it is not as simple as it sounds.
Brief summary of what's wrong
For an example, use this board; http://pinterest.com/dodo/web-designui-and-mobile/
Examining the HTML for the board itself (inside the div
with the class GridItems
) at the top of the page yields:
<div class="variableHeightLayout padItems GridItems Module centeredWithinWrapper" style="..">
<!-- First div with a displayed board image -->
<div class="item" style="top: 0px; left: 0px; visibility: visible;">..</div>
...
<!-- Last div with a displayed board image -->
<div class="item" style="top: 3343px; left: 1000px; visibility: visible;">..</div>
</div>
Yet at the bottom of the page, after activating the infinite scroll a couple of times, we get this as the HTML:
<div class="variableHeightLayout padItems GridItems Module centeredWithinWrapper" style="..">
<!-- First div with a displayed board image -->
<div class="item" style="top: 12431px; left: 750px; visibility: visible;">..</div>
...
<!-- Last div with a displayed board image -->
<div class="item" style="top: 19944px; left: 750px; visibility: visible;">..</div>
</div>
As you can see, some of the containers for the images higher up on the page have disappeared, and not all of the containers for the images load upon first loading the page.
What I want to do
I want to be able to create a C# script (or any server side language at the moment) that can download the page's full HTML (i.e., retrieve every image on the page), and the images will then be downloaded from their URLs. Downloading the webpage and using an appropriate XPath is easy, but the real challenge is downloading the full HTML for every image.
Is there a way I can emulate scrolling to the bottom of the page, or is there an even easier way that I can retrieve every image? I imagine that Pinterest use AJAX to change the HTML, is there a way I can programmatically trigger the events to receive all the HTML? Thank you in advance for suggestions and solutions, and kudos for even reading this very long question if you do not have any!
Pseudo code
using System;
using System.Net;
using HtmlAgilityPack;
private void Main() {
string pinterestURL = "http://www.pinterest.com/...";
string XPath = ".../img";
HtmlDocument doc = new HtmlDocument();
// Currently only downloads the first 25 images.
doc.Load(strPinterestUrl);
foreach(HtmlNode link in doc.DocumentElement.SelectNodes(strXPath))
{
image_links[] = link["src"];
// Use image links
}
}
You can trigger the json endpoint by making a request with this header:
X-Requested-With:XMLHttpRequest
Try this in command in console:
You will see the pin data in the outputted json. You should be able to parse it and grab the next images that you need.
As for this bit:
&_=1377658213300
. I speculate that this is the id of the last pin of the previous list. You should be able to replace this on every call with the last pin from the previous response.A couple of people have suggested using javascript to emulate scrolling.
I don't think you need to emulate scrolling at all, I think you just need to find out the format of the URIs called via AJAX whenever scrolling occurs, and then you can get each "page" of results sequentially. A little backward engineering is required.
Using the network tab of Chrome inspector I can see that once I reach a certain distance down the page, this URI is called:
http://pinterest.com/resource/BoardFeedResource/get/?source_url=%2Fdodo%2Fweb-designui-and-mobile%2F&data=%7B%22options%22%3A%7B%22board_id%22%3A%22158400180582875562%22%2C%22access%22%3A%5B%5D%2C%22bookmarks%22%3A%5B%22LT4xNTg0MDAxMTE4NjcxMTM2ODk6MjV8ZWJjODJjOWI4NTQ4NjU4ZDMyNzhmN2U3MGQyZGJhYTJhZjY2ODUzNTI4YTZhY2NlNmY0M2I1ODYwYjExZmQ3Yw%3D%3D%22%5D%7D%2C%22context%22%3A%7B%22app_version%22%3A%22fb43cdb%22%7D%2C%22module%22%3A%7B%22name%22%3A%22GridItems%22%2C%22options%22%3A%7B%22scrollable%22%3Atrue%2C%22show_grid_footer%22%3Atrue%2C%22centered%22%3Atrue%2C%22reflow_all%22%3Atrue%2C%22virtualize%22%3Atrue%2C%22item_options%22%3A%7B%22show_rich_title%22%3Afalse%2C%22squish_giraffe_pins%22%3Afalse%2C%22show_board%22%3Afalse%2C%22show_via%22%3Afalse%2C%22show_pinner%22%3Afalse%2C%22show_pinned_from%22%3Atrue%7D%2C%22layout%22%3A%22variable_height%22%7D%7D%2C%22append%22%3Atrue%2C%22error_strategy%22%3A1%7D&_=1377092055381
if we decode that, we see that it's mostly JSON
Scroll down until we get a second request, and we see this
As you can see, not much has changed. The Board_id is the same. error_strategy is now 2, and the &_ at the end is different.
The &_ parameter is key here. I would bet that it tells the page where to begin the next set of photos. I can't find a reference to it in either of the responses or the original Page HTML but it has to be in there somewhere, or be generated by javascript on the client side. Either way, the page / browser has to know what to ask for next, so this information is something you should be able to get at.
Probably a bit late but, with py3-pinterest open source project you can do it easily:
First get all pins as objects from the board, they include the original image url also.
Then you can obtain the image urls and download them or do whatever you like with them
Full code example: https://github.com/bstoilov/py3-pinterest/blob/master/download_board_images.py
Yes its python but if you still insist on c# it should be easy to port it :)
Okay, so I think this may be (with a few alterations) what you need.
Caveats:
Points of interest:
_
parameter takes a timestamp in JavaScript format, ie. like Unix time but it has milliseconds added. It's not actually used for pagination.bookmarks
property, so you make the first request to the 'new' endpoint which doesn't require it, and then take thebookmarks
from the result and use it in your request to get the next 'page' of results, take thebookmarks
from those results to fetch the next page after that, and so on until you run out of results or reach your pre-set limit (or you hit the server max for script execution time). I'd be curious to know exactly what thebookmarks
field encodes. I would like to think there's some fun secret sauce beyond just a pin ID or some other page marker.Let me know if you run into problems getting this adapted to your particular end points. Apols for any sloppiness in the code, it didn't make it to production originally.