How much is the difference between html parsing an

2019-05-04 12:16发布

I need to grab some data from websites in my django website. Now i am confused whether i should use python parsing libraries or web crawling libraries. Does search engine libraries also fall in same category

I want to know how much is the difference between the two and if i want to use those functions inside my website which should i use

标签： python django web-crawler

3条回答

霸刀☆藐视天下

2楼-- · 2019-05-04 12:49

If you can get away with background web crawling use scrapy. If need to immediately grab something use html5lib (more robust) or lxml (faster). If you are going to be doing the later, use the awesome requests library. I would avoid using BeautifulSoup, mechanize, urllib2, httplib.

0人赞添加讨论(0) 举报

做个烂人

3楼-- · 2019-05-04 12:57

HTML parse will parse the page and you can collect the links present in it. These links you can add to queue and visit these pages. Combine these steps in a loop and you made a basic crawler.

Crawling libraries are the ready to use solutions which do the crawling. They provide more features like detection of recursive links, cycles etc. A lot of features you would want to code would have already been done within these libraries.

However first option is preferred if you have some special requirements which libraries do not satisfy.

0人赞添加讨论(0) 举报

三岁会撩人

4楼-- · 2019-05-04 13:09

I've done similar things previously. Web-crawlers were not usefull to me if I wanted the parsing to be done immediately in order to fetch something and be presented to the user. For batch-job stuff they're more appropriate. I found BeautifulSoup, lxml and mechanize to be quite usefull.

0人赞添加讨论(0) 举报

How much is the difference between html parsing an

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间