I am trying to build an application for which I need daily news feed from several websites. One way to do this is by using BeautifulSoup library of Python. However this is good for pages which have their news on one static page.
Let's consider a site like http://www.techcrunch.com. They have only one their headlines and for more news you need to click on "Read more". For several other news websites, it is similar. How do I extract such information and dump it in a file- txt/.dmp or any other kind of file? What tool should I use? What approach should I take to implement this in Python?
I need this script to automatically download news from several websites ONCE EVERY SINGLE DAY and store it in a file with categories such as, heading, date, content, etc. I would be uploading this script on apache2 server. Any suggestions?