I'm trying to develop a simple web scraper. I want to extract text without the HTML code. In fact, I achieve this goal, but I have seen that in some pages where JavaScript is loaded I didn't obtain good results.
For example, if some JavaScript code adds some text, I can't see it, because when I call
response = urllib2.urlopen(request)
I get the original text without the added one (because JavaScript is executed in the client).
So, I'm looking for some ideas to solve this problem.
Using PyQt5
This seems to be a good solution also, taken from a great blog post
If you have ever used the
Requests
module for python before, I recently found out that the developer created a new module calledRequests-HTML
which now also has the ability to render JavaScript.You can also visit https://html.python-requests.org/ to learn more about this module, or if your only interested about rendering JavaScript then you can visit https://html.python-requests.org/?#javascript-support to directly learn how to use the module to render JavaScript using Python.
Essentially, Once you correctly install the
Requests-HTML
module, the following example, which is shown on the above link, shows how you can use this module to scrape a website and render JavaScript contained within the website:I recently learnt about this from a YouTube video. Click Here! to watch the YouTube video, which demonstrates how the module works.
I personally prefer using scrapy and selenium and dockerizing both in separate containers. This way you can install both with minimal hassle and crawl modern websites that almost all contain javascript in one form or another. Here's an example:
Use the
scrapy startproject
to create your scraper and write your spider, the skeleton can be as simple as this:The real magic happens in the middlewares.py. Overwrite two methods in the downloader middleware,
__init__
andprocess_request
, in the following way:Dont forget to enable this middlware by uncommenting the next lines in the settings.py file:
Next for dockerization. Create your
Dockerfile
from a lightweight image (I'm using python Alpine here), copy your project directory to it, install requirements:And finally bring it all together in
docker-compose.yaml
:Run
docker-compose up -d
. If you're doing this the first time it will take a while for it to fetch the latest selenium/standalone-chrome and the build your scraper image as well.Once it's done, you can check that your containers are running with
docker ps
and also check that the name of the selenium container matches that of the environment variable that we passed to our scraper container (here, it wasSELENIUM_LOCATION=samplecrawler_selenium_1
).Enter your scraper container with
docker exec -ti YOUR_CONTAINER_NAME sh
, the command for me wasdocker exec -ti samplecrawler_my_scraper_1 sh
, cd into the right directory and run your scraper withscrapy crawl my_spider
.The entire thing is on my github page and you can get it from here
You'll want to use urllib, requests, beautifulSoup and selenium web driver in your script for different parts of the page, (to name a few).
Sometimes you'll get what you need with just one of these modules.
Sometimes you'll need two, three, or all of these modules.
Sometimes you'll need to switch off the js on your browser.
Sometimes you'll need header info in your script.
No websites can be scraped the same way and no website can be scraped in the same way forever without having to modify your crawler, usually after a few months. But they can all be scraped! Where there's a will there's a way for sure.
If you need scraped data continuously into the future just scrape everything you need and store it in .dat files with pickle.
Just keep searching how to try what with these modules and copying and pasting your errors into the Google.
Selenium is the best for scraping JS and Ajax content.
Check this article https://likegeeks.com/python-web-scraping/
Then download Chrome webdriver.
Easy, right?