Cannot locate displayed data in source code when S

2019-03-01 04:55发布

I am using Python.org version 2.7 64 bit on Windows Vista 64 bit. I am using a combination of Scrapy and regex to extract information from a Javascript item called 'DataStore.Prime' at the following page:

http://www.whoscored.com/Regions/252/Tournaments/26/Seasons/4057/Stages/8273 The crawler I am using is this:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item
from scrapy.spider import BaseSpider
from scrapy import log
from scrapy.cmdline import execute
from scrapy.utils.markup import remove_tags
import time
import re
import json


class ExampleSpider(CrawlSpider):
    name = "goal4"
    allowed_domains = ["whoscored.com"]
    start_urls = ["http://www.whoscored.com/Regions/252/Tournaments/26"]
    download_delay = 1

    #rules = [Rule(SgmlLinkExtractor(allow=('/Seasons',)), follow=True, callback='parse_item')]
    rules = [Rule(SgmlLinkExtractor(allow=('/Tournaments/26'),deny=('/News', '/Fixtures'),), follow=False, callback='parse_item')]

    def parse_item(self, response):

regex = re.compile('DataStore\.prime\(\'ws-stage-stat\', { stageId: \d+, type: \d+, teamId: -?\d+, against: \d+, field: \d+ }, \[\[\[.*?\]\]', re.S)

        match2h = re.search(regex, response.body)

        if match2h is not None:
            match3h = match2h.group()

            match3h = str(match3h)
            match3h = match3h \
                 .replace('title=', '').replace('"', '').replace("'", '').replace('[', '').replace(']', '') \
                 .replace(' ', ',').replace(',,', ',') \
                 .replace('[', '') \
                 .replace(']','') \
                 .replace("DataStore.prime", '') \
                 .replace('(', ''). replace('-', '').replace('wsstagestat,', '')

            match3h = re.sub("{.*?},", '', match3h)

I am after the fixtures and scores that are displayed under the title 'FA Cup Fixtures'. You can select the game week you want using the calendar on the page itself. If you look at the source code though, it only contains the most recent game week (as this is last season now, that is the FA Cup Final).

The data for all previous weeks are not on the source code for this page. The calendar that you use seems to be generating an item within the code called:

stageFixtures.load(calendarParameter)

This (if I have understood correctly seems to control which game week is selected for display. What I want to know is:

1) Is that assumption correct? 2) Is there somewhere within the source code that is directing to other URL's storing the scores by week (I'm pretty sure there isn't but I'm really new to Javascript)?

Thanks

1条回答
我命由我不由天
2楼-- · 2019-03-01 05:09

There is an XHR request going to load the fixtures. Simulate it and get the data.

For example, fixtures for Jan 2014:

from ast import literal_eval
from datetime import datetime
import requests

date = datetime(year=2014, month=1, day=1)
url = 'http://www.whoscored.com/tournamentsfeed/8273/Fixtures/'

params = {'d': date.strftime('%Y%m'), 'isAggregate': 'false'}
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36'}

response = requests.get(url, params=params, headers=headers)

fixtures = literal_eval(response.content)
print fixtures

Prints:

[
    [789692, 1, 'Saturday, Jan 4 2014', '12:45', 158, 'Blackburn', 0, 167, 'Manchester City', 1, '1 : 1', '0 : 1', 1, 1, 'FT', '0', 0, 0, 4, 1], 
    [789693, 1, 'Saturday, Jan 4 2014', '15:00', 31, 'Everton', 0, 171, 'Queens Park Rangers', 0, '4 : 0', '2 : 0', 1, 0, 'FT', '1', 0, 0, 1, 0],
    ...
]

Note that the response is not a json, but a basically a dump of Python's list of lists, you can load it with ast.literal_eval():

Safely evaluate an expression node or a Unicode or Latin-1 encoded string containing a Python expression. The string or node provided may only consist of the following Python literal structures: strings, numbers, tuples, lists, dicts, booleans, and None.

查看更多
登录 后发表回答