I am using Python.org version 2.7 64 bit on Windows Vista 64 bit. I am using a combination of Scrapy and regex to extract information from a Javascript item called 'DataStore.Prime' at the following page:
http://www.whoscored.com/Regions/252/Tournaments/26/Seasons/4057/Stages/8273 The crawler I am using is this:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item
from scrapy.spider import BaseSpider
from scrapy import log
from scrapy.cmdline import execute
from scrapy.utils.markup import remove_tags
import time
import re
import json
class ExampleSpider(CrawlSpider):
name = "goal4"
allowed_domains = ["whoscored.com"]
start_urls = ["http://www.whoscored.com/Regions/252/Tournaments/26"]
download_delay = 1
#rules = [Rule(SgmlLinkExtractor(allow=('/Seasons',)), follow=True, callback='parse_item')]
rules = [Rule(SgmlLinkExtractor(allow=('/Tournaments/26'),deny=('/News', '/Fixtures'),), follow=False, callback='parse_item')]
def parse_item(self, response):
regex = re.compile('DataStore\.prime\(\'ws-stage-stat\', { stageId: \d+, type: \d+, teamId: -?\d+, against: \d+, field: \d+ }, \[\[\[.*?\]\]', re.S)
match2h = re.search(regex, response.body)
if match2h is not None:
match3h = match2h.group()
match3h = str(match3h)
match3h = match3h \
.replace('title=', '').replace('"', '').replace("'", '').replace('[', '').replace(']', '') \
.replace(' ', ',').replace(',,', ',') \
.replace('[', '') \
.replace(']','') \
.replace("DataStore.prime", '') \
.replace('(', ''). replace('-', '').replace('wsstagestat,', '')
match3h = re.sub("{.*?},", '', match3h)
I am after the fixtures and scores that are displayed under the title 'FA Cup Fixtures'. You can select the game week you want using the calendar on the page itself. If you look at the source code though, it only contains the most recent game week (as this is last season now, that is the FA Cup Final).
The data for all previous weeks are not on the source code for this page. The calendar that you use seems to be generating an item within the code called:
stageFixtures.load(calendarParameter)
This (if I have understood correctly seems to control which game week is selected for display. What I want to know is:
1) Is that assumption correct? 2) Is there somewhere within the source code that is directing to other URL's storing the scores by week (I'm pretty sure there isn't but I'm really new to Javascript)?
Thanks
There is an
XHR
request going to load the fixtures. Simulate it and get the data.For example, fixtures for
Jan 2014
:Prints:
Note that the response is not a json, but a basically a dump of Python's list of lists, you can load it with
ast.literal_eval()
: