URL1: https://duapp3.drexel.edu/webtms_du/
URL2: https://duapp3.drexel.edu/webtms_du/Colleges.asp?Term=201125&univ=DREX
URL3: https://duapp3.drexel.edu/webtms_du/Courses.asp?SubjCode=CS&CollCode=E&univ=DREX
As a personal programming project, I want to scrape my University's course catalog and provide it as a RESTful API.
However, I'm running into the following issue.
The page that I need to scrape is URL3. But URL3 only returns meaningful information after I visit URL2 (it sets the term there Colleges.asp?Term=201125
), but URL2 can only be visited after visiting URL1.
I tried monitoring the HTTP data going to and fro using Fiddler and I don't think they are using cookies. Closing the browser instantly resets everything, so I suspect they are using Session.
How can I scrape URL 3? I tried, programatically, visiting URLs 1 and 2 first, and then doing file_get_contents(url3)
but that doesn't work (probably because it registers as three different sessions.
A session needs a mechanism to identify you as well. Popular methods include: cookies, session id in URL.
A
curl -v
on URL 1 reveals a session cookie is indeed being set.You need to send this cookie back to the server on any subsequent requests to keep your session alive.
If you want to use
file_get_contents
, you need to manually create a context for it withstream_context_create
for to include cookies with the request.An alternative (which I would personally prefer) would be to use
curl
functions conveniently provided by PHP. (It can even take care of the cookie traffic for you!) But that's just my preference.Edit:
Here's a working example to scrape the path in your question.