Crawl only content from multiple different Website

2019-07-30 23:13发布

currently I am working on a project, where i want to analyze different articles from different blogs, Magazine, etc. published online on their Website.

Therefore i have already built a Webcrawler using Python, which get me every new article as html.

Now here is the point, i want to Analyse the pure content (only the article, without comments or recommendations etc. ), but i cant access this content, without defining a regular expression, to extract the content from the html response i get. Regular Expressions for each source is not a alternative, because i have around 100 different Sources for the articles.

I have tried to use the library html2text to extract the content, but the library only transforms the pure html to markdown, so there is still stuff like comments or recommendations, which i have to remove manually.

Any thoughts, how i can face this problem?

标签： python web-scraping web-crawler data-analysis

1条回答

傲

2楼-- · 2019-07-30 23:49

In order to get the main article text and ignore extraneous text, you'd have to write code for specific webpages or devise some heuristics to identify and extract article content.

Luckily there are existing libraries that address this problem.

Newspaper is a Python 3 library:

from newspaper import Article
url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
article = Article(url)
article.download()
print(article.text)

You may also want to check out similar libraries such as python-readability or python-goose:

from goose import Goose
url = 'http://edition.cnn.com/2012/02/22/world/europe/uk-occupy-london/index.html?hpt=ieu_c2'
g = Goose()
article = g.extract(url=url)
print(article.cleaned_text)

0人赞添加讨论(0) 举报

Crawl only content from multiple different Website

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间