How can I extract the list of urls obtained during

I want to be able to get the list of all URLs that a browser will do a GET request for when we try to open a page. For eg: if we try to open cnn.com, there are multiple URLs within the first HTTP response which the browser recursively requests for.

I'm not trying to render a page but I'm trying to obtain a list of all the urls that are requested when a page is rendered. Doing a simple scan of the http response content wouldn't be sufficient as there could potentially be images in the css which are downloaded. Is there anyway I can do this in python?

标签： python http http-headers

2条回答

来，给爷笑一个

2楼-- · 2020-04-15 15:26

It's likely that you'll have to render the page (not necessarily display it though) to be sure you're getting a complete list of all resources. I've used PyQT and QtWebKit in similar situations. Especially when you start counting resources included dynamically with javascript, trying to parse and load pages recursively with BeautifulSoup just isn't going to work.

Ghost.py is an excellent client to get you started with PyQT. Also, check out the QWebView docs and the QNetworkAccessManager docs.

Ghost.py returns a tuple of (page, resources) when opening a page:

from ghost import Ghost
ghost = Ghost()
page, resources = ghost.open('http://my.web.page')

resources includes all of the resources loaded by the original URL as HttpResource objects. You can retrieve the URL for a loaded resource with resource.url.

0人赞添加讨论(0) 举报

家丑人穷心不美

3楼-- · 2020-04-15 15:38

I guess you will have to create a list of all known file extensions that you do NOT want, and then scan the content of the http response, checking with "if substring not in nono-list:"

The problem is all href's ending with TLDs, forwardslashes, url-delivered variables and so on, so i think it would be easier to check for stuff you know you dont want.

0人赞添加讨论(0) 举报

How can I extract the list of urls obtained during

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间