How to web scrape daily news once a day using Pyth

2020-07-21 19:35发布

问题:

I am trying to build an application for which I need daily news feed from several websites. One way to do this is by using BeautifulSoup library of Python. However this is good for pages which have their news on one static page.

Let's consider a site like http://www.techcrunch.com. They have only one their headlines and for more news you need to click on "Read more". For several other news websites, it is similar. How do I extract such information and dump it in a file- txt/.dmp or any other kind of file? What tool should I use? What approach should I take to implement this in Python?

I need this script to automatically download news from several websites ONCE EVERY SINGLE DAY and store it in a file with categories such as, heading, date, content, etc. I would be uploading this script on apache2 server. Any suggestions?

回答1:

How do I extract such information and dump it in a file- txt/.dmp or any other kind of file? What tool should I use?

for more news you need to click on "Read more".

The tools you might leverage are Selenuim as its pure browser automation or iMacros.

  1. Here is an example of leveraging Selenium in Python, server side.
  2. Here is a post (and video) on data extraction using iMacros. Since you need it only once a day you might schedule to run it regulary in Win or Mac.