Can I extract comments of any page from https://ww

2019-09-17 21:50发布

I am writing a web crawler. I extracted heading and Main Discussion of the this link but I am unable to find any one of the comment (Ctrl+u -> Ctrl+f . Comment Text). I think the comments are written in JavaScript. Can I extract it?

2条回答
成全新的幸福
2楼-- · 2019-09-17 22:07

RT are using a service from spot.im for comments

you need to do make two POST requests, first https://api.spot.im/me/network-token/spotim to get a token, then https://api.spot.im/conversation-read/spot/sp_6phY2k0C/post/353493/get to get the comments as JSON.

i wrote a quick script to do this

import requests
import re
import json

def get_rt_comments(article_url):
    spotim_spotId = 'sp_6phY2k0C' # spotim id for RT
    post_id = re.search('([0-9]+)', article_url).group(0)

    r1 = requests.post('https://api.spot.im/me/network-token/spotim').json()
    spotim_token = r1['token']

    payload = {
        "count": 25, #number of comments to fetch
        "sort_by":"best",
        "cursor":{"offset":0,"comments_read":0},
        "host_url": article_url,
        "canonical_url": article_url
    }

    r2_url ='https://api.spot.im/conversation-read/spot/' + spotim_spotId + '/post/'+ post_id +'/get'
    r2 = requests.post(r2_url, data=json.dumps(payload), headers={'X-Spotim-Token': spotim_token , "Content-Type": "application/json"})

    return r2.json()

if __name__ == '__main__':
    url = 'https://www.rt.com/usa/353493-clinton-speech-affairs-silence/'
    comments = get_rt_comments(url)
    print(comments)
查看更多
贼婆χ
3楼-- · 2019-09-17 22:24

Yes, if it can be viewed with a web browser, you can extract it.

If you look at the source it is really an iframe that loads a piece of javascript, that then creates a new tag in the document with the source of that script tag loading bundle.js, which really contains the commenting software. This in turns then fetches the actual comments.

Instead of going through this manually, you could consider using for example webkit to create a headless browser that executes the javascript like an ordinary browser. Then you can scrape from that instead of having to manually make your crawler fetch the external resources.

Examples of such headless browsers could be Spynner, Dryscape, or the PhantomJS derived PhantomPy (the latter seems to be an abandoned project now).

查看更多
登录 后发表回答