URL1: https://duapp3.drexel.edu/webtms_du/
URL2: https://duapp3.drexel.edu/webtms_du/Colleges.asp?Term=201125&univ=DREX
URL3: https://duapp3.drexel.edu/webtms_du/Courses.asp?SubjCode=CS&CollCode=E&univ=DREX
As a personal programming project, I want to scrape my University's course catalog and provide it as a RESTful API.
However, I'm running into the following issue.
The page that I need to scrape is URL3. But URL3 only returns meaningful information after I visit URL2 (it sets the term there Colleges.asp?Term=201125
), but URL2 can only be visited after visiting URL1.
I tried monitoring the HTTP data going to and fro using Fiddler and I don't think they are using cookies. Closing the browser instantly resets everything, so I suspect they are using Session.
How can I scrape URL 3? I tried, programatically, visiting URLs 1 and 2 first, and then doing file_get_contents(url3)
but that doesn't work (probably because it registers as three different sessions.
A session needs a mechanism to identify you as well. Popular methods include: cookies, session id in URL.
A curl -v
on URL 1 reveals a session cookie is indeed being set.
Set-Cookie: ASPSESSIONIDASBRRCCS=LKLLPGGDFBGGNFJBKKHMPCDA; path=/
You need to send this cookie back to the server on any subsequent requests to keep your session alive.
If you want to use file_get_contents
, you need to manually create a context for it with stream_context_create
for to include cookies with the request.
An alternative (which I would personally prefer) would be to use curl
functions conveniently provided by PHP. (It can even take care of the cookie traffic for you!) But that's just my preference.
Edit:
Here's a working example to scrape the path in your question.
$scrape = array(
"https://duapp3.drexel.edu/webtms_du/",
"https://duapp3.drexel.edu/webtms_du/Colleges.asp?Term=201125&univ=DREX",
"https://duapp3.drexel.edu/webtms_du/Courses.asp?SubjCode=CS&CollCode=E&univ=DREX"
);
$data = '';
$ch = curl_init();
// Set cookie jar to temporary file, because, even if we don't need them,
// it seems curl does not store the cookies anywhere otherwise or include
// them in subsequent requests
curl_setopt($ch, CURLOPT_COOKIEJAR, tempnam(sys_get_temp_dir(), 'curl'));
// We don't want direct output by curl
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
// Then run along the scrape path
foreach ($scrape as $url) {
curl_setopt($ch, CURLOPT_URL, $url);
$data = curl_exec($ch);
}
curl_close($ch);
echo $data;