I have been doing webcrawling for the last couple weeks. Using a PHP library (PHP Simple DOM), im running a php script (using terminal) to fetch some URLs and JSON some data from it. This has been working very nice so far.
Recently i wanted to expand the crawling for a specific site and encountered the following problem:
Unlike any other site so far, this one only echos a barebones markup server side and instead relies on a single JS script to build up the relevant markup onload.
Obviously my PHP script cant handle that (as it is not executing the JS and hence the site stays mostly blank from what i can tell) and so i cant crawl the site, since the content is not yet created.
Im unsure how to proceed. Is it actually possibly to convert my current PHP script to be "compatible" with that site, or do i need to change gears and incorporate a browser, i.e. pick a completely different route ?
Im currently thinking i would need to create html/js site which opens the URL in an iFrame and that way i could run a JS function manually via the console to extract the data. However, im hoping there is a more feasible way.
thanks,
When I need to scrap a website I normally:
1 - Navigate the target website on a normal browser (ff, chrome, etc.), while monitoring/logging any
POST
/GET
requests containing pertinent info viaDeveloper Tools
->Network Tab
.Pay special attention to
XHR
requests, as they normally containjson
encoded data.Here's a small video I've made exemplifying this:
https://www.youtube.com/watch?v=JbiZBGt8cos
You can mimic the
request headers
made previously (explained in the video) and use it on acurl
request, i.e.:2 - In some cases, it's impossible to crawl certain URL's without a JavaScript Enabled Client, when this happens, I normally use Selenium with
Chrome
orFirefox
. You can also use PhantomJS, a headless browser. Latest versions of GeckoDriver (used by Selenium) also support headless browsing.I'm aware the question is about
PHP
, but if the OP needs to useSelenium
,Python
is way more intuitive I'd say. Based on that, here's aSelenium
example inPython
:Example Src
I see two possible paths:
In case the JavaScript that builds up the DOM fetches the data through one or more AJAX calls, you might as well scrape from those URLs directly (and this tends to be easier anyway, e.g. if it talks to a JSON API).
Simulate a browser, e.g. using Selenium. For example, this article discusses the exact challenge you mention and provides a solution using Selenium and Python.