Can I extract comments of any page from https://ww

I am writing a web crawler. I extracted heading and Main Discussion of the this link but I am unable to find any one of the comment (Ctrl+u -> Ctrl+f . Comment Text). I think the comments are written in JavaScript. Can I extract it?

标签： javascript python beautifulsoup web-crawler

2条回答

成全新的幸福

2楼-- · 2019-09-17 22:07

RT are using a service from spot.im for comments

you need to do make two POST requests, first https://api.spot.im/me/network-token/spotim to get a token, then https://api.spot.im/conversation-read/spot/sp_6phY2k0C/post/353493/get to get the comments as JSON.

i wrote a quick script to do this

import requests
import re
import json

def get_rt_comments(article_url):
    spotim_spotId = 'sp_6phY2k0C' # spotim id for RT
    post_id = re.search('([0-9]+)', article_url).group(0)

    r1 = requests.post('https://api.spot.im/me/network-token/spotim').json()
    spotim_token = r1['token']

    payload = {
        "count": 25, #number of comments to fetch
        "sort_by":"best",
        "cursor":{"offset":0,"comments_read":0},
        "host_url": article_url,
        "canonical_url": article_url
    }

    r2_url ='https://api.spot.im/conversation-read/spot/' + spotim_spotId + '/post/'+ post_id +'/get'
    r2 = requests.post(r2_url, data=json.dumps(payload), headers={'X-Spotim-Token': spotim_token , "Content-Type": "application/json"})

    return r2.json()

if __name__ == '__main__':
    url = 'https://www.rt.com/usa/353493-clinton-speech-affairs-silence/'
    comments = get_rt_comments(url)
    print(comments)

0人赞添加讨论(0) 举报

贼婆χ

3楼-- · 2019-09-17 22:24

Yes, if it can be viewed with a web browser, you can extract it.

If you look at the source it is really an iframe that loads a piece of javascript, that then creates a new tag in the document with the source of that script tag loading bundle.js, which really contains the commenting software. This in turns then fetches the actual comments.

Instead of going through this manually, you could consider using for example webkit to create a headless browser that executes the javascript like an ordinary browser. Then you can scrape from that instead of having to manually make your crawler fetch the external resources.

Examples of such headless browsers could be Spynner, Dryscape, or the PhantomJS derived PhantomPy (the latter seems to be an abandoned project now).

0人赞添加讨论(0) 举报

Can I extract comments of any page from https://ww

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间