Set session to scrape page

2019-07-31 01:55发布

URL1: https://duapp3.drexel.edu/webtms_du/

URL2: https://duapp3.drexel.edu/webtms_du/Colleges.asp?Term=201125&univ=DREX

URL3: https://duapp3.drexel.edu/webtms_du/Courses.asp?SubjCode=CS&CollCode=E&univ=DREX

As a personal programming project, I want to scrape my University's course catalog and provide it as a RESTful API.

However, I'm running into the following issue.

The page that I need to scrape is URL3. But URL3 only returns meaningful information after I visit URL2 (it sets the term there Colleges.asp?Term=201125), but URL2 can only be visited after visiting URL1.

I tried monitoring the HTTP data going to and fro using Fiddler and I don't think they are using cookies. Closing the browser instantly resets everything, so I suspect they are using Session.

How can I scrape URL 3? I tried, programatically, visiting URLs 1 and 2 first, and then doing file_get_contents(url3) but that doesn't work (probably because it registers as three different sessions.

1条回答
小情绪 Triste *
2楼-- · 2019-07-31 02:35

A session needs a mechanism to identify you as well. Popular methods include: cookies, session id in URL.

A curl -v on URL 1 reveals a session cookie is indeed being set.

Set-Cookie: ASPSESSIONIDASBRRCCS=LKLLPGGDFBGGNFJBKKHMPCDA; path=/

You need to send this cookie back to the server on any subsequent requests to keep your session alive.

If you want to use file_get_contents, you need to manually create a context for it with stream_context_create for to include cookies with the request.

An alternative (which I would personally prefer) would be to use curl functions conveniently provided by PHP. (It can even take care of the cookie traffic for you!) But that's just my preference.

Edit:

Here's a working example to scrape the path in your question.

$scrape = array(
    "https://duapp3.drexel.edu/webtms_du/",
    "https://duapp3.drexel.edu/webtms_du/Colleges.asp?Term=201125&univ=DREX",
    "https://duapp3.drexel.edu/webtms_du/Courses.asp?SubjCode=CS&CollCode=E&univ=DREX"
);

$data = '';
$ch = curl_init();

// Set cookie jar to temporary file, because, even if we don't need them, 
// it seems curl does not store the cookies anywhere otherwise or include
// them in subsequent requests
curl_setopt($ch, CURLOPT_COOKIEJAR, tempnam(sys_get_temp_dir(), 'curl'));

// We don't want direct output by curl
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

// Then run along the scrape path
foreach ($scrape as $url) {
    curl_setopt($ch, CURLOPT_URL, $url);
    $data = curl_exec($ch);
}

curl_close($ch);

echo $data;
查看更多
登录 后发表回答