jQuery to access DOM in a site

2020-04-17 08:33发布

问题:

I am trying to scrape various elements in a table from this site to teach myself scraping using node.js, cheerio and request

I have trouble getting the items in the table, essentially I want to get 'rank','company' and '3-year growth' from the table. How do I do this?

Based on an online tutorial, I have developed my scraping.js script to look like this:

    var request = require ('request'),
        cheerio = require ('cheerio');     
    request('http://www.inc.com/inc5000/index.html', function (error, response, html) {
      if (!error && response.statusCode == 200) {
        var $ = cheerio.load(html);
        $('tr.ng-scope').each(function(i, element){ //problem probably lies here
          var a = $(this).get(0);
          console.log(a);
        });
      }
    });

However, I am sure I am not getting the line with comment above right. Is there a way I can access the attributes in the table better?

I notice the Xpaths are as such

//*[@id="col-r"]/table/tbody/tr2/td1 -- rank

//*[@id="col-r"]/table/tbody/tr2/td2/a -- name of company

//*[@id="col-r"]/table/tbody/tr2/td[3] -- 3 year growth rate

Just trying to figure out how to access these attributes accordingly..

回答1:

You're on the right track.

The $().get() method returns the element. In your case var a is the TR. That's not necessarily what you want.

What you need to do is further subdivide each row into the individual TD's. I did this using $(this).find('td'). Then, I grab each TD 1 by 1 and extract the text out of it, converting that into an object where the key represents the field of the table. All of these are aggregated into an array, but you can use the basic concept to build whatever data structure you see fit to utilize.

    request('http://www.inc.com/inc5000/index.html', function (error, response, html) {
        if(error || response.statusCode != 200) return;

        var $ = cheerio.load(html);
        var DATA = [];

        $('tr.ng-scope').each(function(){
            var $tds = $(this).find('td');

            DATA.push({
                rank:     $tds.eq(0).text(),
                company:  $tds.eq(1).text(),
                growth:   $tds.eq(2).text(),
                revenue:  $tds.eq(3).text(),
                industry: $tds.eq(4).text()
            });
        });

        console.log(DATA);
    });