HTML Snapshot for crawler - Understanding how it w

2019-05-11 05:35发布

问题:

i'm reading this article today. To be honest, im really interessed to "2. Much of your content is created by a server-side technology such as PHP or ASP.NET" point.

I want understand if i have understood :)

I create that php script (gethtmlsnapshot.php) where i include the server-side ajax page (getdata.php) and i escape (for security) the parameters. Then i add it at the end of the html static page (index-movies.html). Right? Now...

1 - Where i put that gethtmlsnapshot.php? In other words, i need to call (or better, the crawler need) that page. But if i don't have link on the main page, the crawler can't call it :O How can crawler call the page with _escaped_fragment_ parameters? It can't know them if i don't specific them somewhere :)

2 - How can crewler call that page with the parameters? As before, i need link to that script with the parameters, so crewler browse each page and save the content of the dinamic result.

Can you help me? And what do you think about this technique? Won't be better if the developers of crawler do their own bots in some others ways? :)

Let me know what do you think about. Cheers

回答1:

I think you got something wrong so I'll try to explain what's going on here including the background and alternatives. as this indeed a very important topic that most of us stumbled upon (or at least something similar) from time to time.

Using AJAX or rather asynchronous incremental page updating (because most pages actually don't use XML but JSON), has enriched the web and provided great user experience.

It has however also come at a price.

The main problem were clients that didn't support the xmlhttpget object or JavaScript at all. In the beginning you had to provide backwards compatibility. This was usually done by providing links and capture the onclick event and fire an AJAX call instead of reloading the page (if the client supported it).

Today almost every client supports the necessary functions.

So the problem today are search engines. Because they don't. Well that's not entirely true because they partly do (especially Google), but for other purposes. Google evaluates certain JavaScript code to prevent Blackhat SEO (for example a link pointing somewhere but with JavaScript opening some completely different webpage... Or html keyword codes that are invisible to the client because they are removed by JavaScript or the other way round).

But keeping it simple its best to think of a search engine crawler of a very basic browser with no CSS or JS support (it's the same with CSS, its party parsed for special reasons).

So if you have "AJAX links" on your website, and the Webcrawler doesn't support following them using JavaScript, they just don't get crawled. Or do they? Well the answer is JavaScript links (like document.location whatever) get followed. Google is often intelligent enough to guess the target. But ajax calls are not made. simple because they return partial content and no senseful whole page can be constructed from it as the context is unknown and the unique URI doesn't represent the location of the content.

So there are basically 3 strategies to work around that.

  1. have an onclick event on the links with normal href attribute as fallback (imo the best option as it solves the problem for clients as well as search engines)
  2. submitting the content websites via your sitemap so they get indexed, but completely apart from your site links (usually pages provide a permalink to this urls so that external pages link them for the pagerank)
  3. ajax crawling scheme

the idea is to have your JavaScript xmlhttpget requests entangled with corresponding href attributes that look like so: www.example.com/ajax.php#!key=value

so the link looks like:

<a href="http://www.example.com/ajax.php#!page=imprint" onclick="handleajax()">go to my imprint</a>

the function handleajax could evaluate the document.location variable to fire the incremental asynchronous page update. its also possible to pass an id or url or whatever.

the crawler however recognises the ajax crawling scheme format and automatically fetches http://www.example.com/ajax.php.php?%23!page=imprint instead of http://www.example.com/ajax.php#!page=imprint so you the query string then contanis the html fragment from which you can tell which partial content has been updated. so you have just have to make sure that http://www.example.com/ajax.php.php?%23!page=imprint returns a full website that just looks like the website should look to the user after the xmlhttpget update has been made.

a very elegant solution is also to pass the a object itself to the handler function which then fetches the same URL as the crawler would have fetched using ajax but with additional parameters. Your server side script then decides whether to deliver the whole page or just the partial content.

It's a very creative approach indeed and here comes my personal pr/ con analysis:

pro:

  • partial updated pages receive a unique identifier at which point they are fully qualified resources in the semantic web
  • partially updated websites receive a unique identifier that can be presented by search engines

con:

  • it's just a fallback solution for search engines, not for clients without JavaScript
  • it provides opportunities for black hat SEO. So Google for sure won't adopt it fully or rank pages with this technique high with out proper verification of the content.

conclusion:

  • just usual links with fallback legacy working href attributes, but an onclick handler are a better approach because they provide functionality for old browsers.

  • the main advantage of the ajax crawling scheme is that partially updated websites get a unique URI, and you don't have to do create duplicate content that somehow serves as the indexable and linkable counterpart.

  • you could argue that ajax crawling scheme implementation is more consistent and easier to implement. I think this is a question of your application design.