Scraping links from website using Node.js, request

2019-08-11 18:38发布

问题:

I'm trying to scrape links on my school's course schedule website using Node.js, request, and cheerio. However, my code is not reaching all subject links.

Link to course schedule website here.

Below is my code:

var express = require('express');
var request = require('request');
var cheerio = require('cheerio');

var app = express();

app.get('/subjects', function(req, res) {
  var URL = 'http://courseschedules.njit.edu/index.aspx?semester=2016s';

  request(URL, function(error, response, body) {
    if(!error) {
      var $ = cheerio.load(body);

      $('.courseList_section a').each(function() {
        var text = $(this).text();
        var link = $(this).attr('href');

        console.log(text + ' --> ' + link);
      });
    }
    else {
      console.log('There was an error!');
    }
  });
});

app.listen('8080');
console.log('Magic happens on port 8080!');

My output can be found here.

As you can see from my output, some links are missing. More specifically, links from sections 'A', 'I (Continued)', and R '(Continued)'. These are also the first sections of each column.

Each section is contained in its own div with class name 'courseList_section' so I don't understand why '.courseList_section a' doesn't loop through all links. Am I missing something obvious? Any and all insight is very appreciated.

Thank you in advance!

回答1:

The problem isn't your code, it's the site you're trying to parse that's the problem. The HTML tags are invalid. You're trying to parse everything inside the .courseList_section, but the tags looks like this.

<span> <!-- Opening tag -->
    <div class='courseList_section'>
      <a href='index.aspx?semester=2016s&ƒ=ACC '>ACC  - Accounting/Essex CC</a>
      </span> <!-- Invalid closing tag for the first span, menaing that .courseList_section will be closed instead

<!-- Suddenly this link is outside the .courseList_section tag, meaning that it will be ignored by cheerio -->
<a href='index.aspx?semester=2016s&subjectID=ACCT'>ACCT - Accounting</a>
  <!-- and so on -->

The solution. Fetch all links and ignore those that arn't related to any course.

var request = require('request');
var cheerio = require('cheerio');

var URL = 'http://courseschedules.njit.edu/index.aspx?semester=2016s';

request(URL, function(error, response, body) {
  if(error) { return  console.error('There was an error!'); }

  var $ = cheerio.load(body);

  $('a').each(function() {
    var text = $(this).text();
    var link = $(this).attr('href');

    if(link && link.match(/subjectID/)){
      console.log(text + ' --> ' + link);
    };
  });
});

Next time, try looking directly at the HTML and see if it looks okay. If it looks like ****, pass it trough an HTML beautifier and inspect it again. Not even the beautifier could handle this markup which indicated that something was wrong with the tags.