Scraping links from website using Node.js, request

I'm trying to scrape links on my school's course schedule website using Node.js, request, and cheerio. However, my code is not reaching all subject links.

Link to course schedule website here.

Below is my code:

var express = require('express');
var request = require('request');
var cheerio = require('cheerio');

var app = express();

app.get('/subjects', function(req, res) {
  var URL = 'http://courseschedules.njit.edu/index.aspx?semester=2016s';

  request(URL, function(error, response, body) {
    if(!error) {
      var $ = cheerio.load(body);

      $('.courseList_section a').each(function() {
        var text = $(this).text();
        var link = $(this).attr('href');

        console.log(text + ' --> ' + link);
      });
    }
    else {
      console.log('There was an error!');
    }
  });
});

app.listen('8080');
console.log('Magic happens on port 8080!');

My output can be found here.

As you can see from my output, some links are missing. More specifically, links from sections 'A', 'I (Continued)', and R '(Continued)'. These are also the first sections of each column.

Each section is contained in its own div with class name 'courseList_section' so I don't understand why '.courseList_section a' doesn't loop through all links. Am I missing something obvious? Any and all insight is very appreciated.

Thank you in advance!

标签： javascript html node.js web-scraping cheerio

1条回答

我命由我不由天

2楼-- · 2019-08-11 19:10

The problem isn't your code, it's the site you're trying to parse that's the problem. The HTML tags are invalid. You're trying to parse everything inside the .courseList_section, but the tags looks like this.

<span> <!-- Opening tag -->
    <div class='courseList_section'>
      <a href='index.aspx?semester=2016s&ƒ=ACC '>ACC  - Accounting/Essex CC</a>
      </span> <!-- Invalid closing tag for the first span, menaing that .courseList_section will be closed instead

<!-- Suddenly this link is outside the .courseList_section tag, meaning that it will be ignored by cheerio -->
<a href='index.aspx?semester=2016s&subjectID=ACCT'>ACCT - Accounting</a>
  <!-- and so on -->

The solution. Fetch all links and ignore those that arn't related to any course.

var request = require('request');
var cheerio = require('cheerio');

var URL = 'http://courseschedules.njit.edu/index.aspx?semester=2016s';

request(URL, function(error, response, body) {
  if(error) { return  console.error('There was an error!'); }

  var $ = cheerio.load(body);

  $('a').each(function() {
    var text = $(this).text();
    var link = $(this).attr('href');

    if(link && link.match(/subjectID/)){
      console.log(text + ' --> ' + link);
    };
  });
});

Next time, try looking directly at the HTML and see if it looks okay. If it looks like ****, pass it trough an HTML beautifier and inspect it again. Not even the beautifier could handle this markup which indicated that something was wrong with the tags.

0人赞添加讨论(0) 举报

Scraping links from website using Node.js, request

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间