currently I am working on a project, where i want to analyze different articles from different blogs, Magazine, etc. published online on their Website.
Therefore i have already built a Webcrawler using Python, which get me every new article as html.
Now here is the point, i want to Analyse the pure content (only the article, without comments or recommendations etc. ), but i cant access this content, without defining a regular expression, to extract the content from the html response i get. Regular Expressions for each source is not a alternative, because i have around 100 different Sources for the articles.
I have tried to use the library html2text to extract the content, but the library only transforms the pure html to markdown, so there is still stuff like comments or recommendations, which i have to remove manually.
Any thoughts, how i can face this problem?