currently I am working on a project, where i want to analyze different articles from different blogs, Magazine, etc. published online on their Website.
Therefore i have already built a Webcrawler using Python, which get me every new article as html.
Now here is the point, i want to Analyse the pure content (only the article, without comments or recommendations etc. ), but i cant access this content, without defining a regular expression, to extract the content from the html response i get. Regular Expressions for each source is not a alternative, because i have around 100 different Sources for the articles.
I have tried to use the library html2text to extract the content, but the library only transforms the pure html to markdown, so there is still stuff like comments or recommendations, which i have to remove manually.
Any thoughts, how i can face this problem?
In order to get the main article text and ignore extraneous text, you'd have to write code for specific webpages or devise some heuristics to identify and extract article content.
Luckily there are existing libraries that address this problem.
Newspaper is a Python 3 library:
You may also want to check out similar libraries such as python-readability or python-goose: