is it possible to write web crawler in javascript?

2019-03-10 16:16发布

I want to crawl the page and check for the hyperlinks in that respective page and also follow those hyperlinks and capture data from the page

10条回答
Explosion°爆炸
2楼-- · 2019-03-10 16:37

Google's Chrome team has released puppeteer on August 2017, a node library which provides a high-level API for both headless and non-headless Chrome (headless Chrome being available since 59).

It uses an embedded version of Chromium, so it is guaranteed to work out of the box. If you want to use an specific Chrome version, you can do so by launching puppeteer with an executable path as parameter, such as:

const browser = await puppeteer.launch({executablePath: '/path/to/Chrome'});

An example of navigating to a webpage and taking a screenshot out of it shows how simple it is (taken from the GitHub page):

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://example.com');
  await page.screenshot({path: 'example.png'});

  await browser.close();
})();
查看更多
Summer. ? 凉城
3楼-- · 2019-03-10 16:47

There is a client side approach for this, using Firefox Greasemonkey extention. with Greasemonkey you can create scripts to be executed each time you open specified urls.

here an example:

if you have urls like these:

http://www.example.com/products/pages/1

http://www.example.com/products/pages/2

then you can use something like this to open all pages containing product list(execute this manually)

var j = 0;
for(var i=1;i<5;i++)
{ 
  setTimeout(function(){
  j = j + 1;
  window.open('http://www.example.com/products/pages/ + j, '_blank');

}, 15000 * i);

}

then you can create a script to open all products in new window for each product list page and include this url in Greasemonkey for that.

http://www.example.com/products/pages/*

and then a script for each product page to extract data and call a webservice passing data and close window and so on.

查看更多
Deceive 欺骗
4楼-- · 2019-03-10 16:51

Generally, browser JavaScript can only crawl within the domain of its origin, because fetching pages would be done via Ajax, which is restricted by the Same-Origin Policy.

If the page running the crawler script is on www.example.com, then that script can crawl all the pages on www.example.com, but not the pages of any other origin (unless some edge case applies, e.g., the Access-Control-Allow-Origin header is set for pages on the other server).

If you really want to write a fully-featured crawler in browser JS, you could write a browser extension: for example, Chrome extensions are packaged Web application run with special permissions, including cross-origin Ajax. The difficulty with this approach is that you'll have to write multiple versions of the crawler if you want to support multiple browsers. (If the crawler is just for personal use, that's probably not an issue.)

查看更多
爷的心禁止访问
5楼-- · 2019-03-10 16:53

My typical setup is to use a browser extension with cross origin privileges set, which is injecting both the crawler code and jQuery.

Another take on Javascript crawlers is to use a headless browser like phantomJS or casperJS (which boosts phantom's powers)

查看更多
Explosion°爆炸
6楼-- · 2019-03-10 16:53

I made an example javascript crawler on github.

It's event driven and use an in-memory queue to store all the resources(ie. urls).

How to use in your node environment

var Crawler = require('../lib/crawler')
var crawler = new Crawler('http://www.someUrl.com');

// crawler.maxDepth = 4;
// crawler.crawlInterval = 10;
// crawler.maxListenerCurrency = 10;
// crawler.redisQueue = true;
crawler.start();

Here I'm just showing you 2 core method of a javascript crawler.

Crawler.prototype.run = function() {
  var crawler = this;
  process.nextTick(() => {
    //the run loop
    crawler.crawlerIntervalId = setInterval(() => {

      crawler.crawl();

    }, crawler.crawlInterval);
    //kick off first one
    crawler.crawl();
  });

  crawler.running = true;
  crawler.emit('start');
}


Crawler.prototype.crawl = function() {
  var crawler = this;

  if (crawler._openRequests >= crawler.maxListenerCurrency) return;


  //go get the item
  crawler.queue.oldestUnfetchedItem((err, queueItem, index) => {
    if (queueItem) {
      //got the item start the fetch
      crawler.fetchQueueItem(queueItem, index);
    } else if (crawler._openRequests === 0) {
      crawler.queue.complete((err, completeCount) => {
        if (err)
          throw err;
        crawler.queue.getLength((err, length) => {
          if (err)
            throw err;
          if (length === completeCount) {
            //no open Request, no unfetcheditem stop the crawler
            crawler.emit("complete", completeCount);
            clearInterval(crawler.crawlerIntervalId);
            crawler.running = false;
          }
        });
      });
    }

  });
};

Here is the github link https://github.com/bfwg/node-tinycrawler. It is a javascript web crawler written under 1000 lines of code. This should put you on the right track.

查看更多
等我变得足够好
7楼-- · 2019-03-10 16:55

If you use server-side javascript it is possible. You should take a look at node.js

And an example of a crawler can be found in the link bellow:

http://www.colourcoding.net/blog/archive/2010/11/20/a-node.js-web-spider.aspx

查看更多
登录 后发表回答