Python BeautifulSoup not scraping this url

2019-02-20 05:31发布

I am trying to scrape some rows of player data (tr) from a url, however nothing appears to happen when I run my code. I am positive my code is fine because it works with other statistical websites containing tables. Can anyone tell me why nothing is happening? Thanks in advance.

import urllib
import urllib.request
from bs4 import BeautifulSoup

def make_soup(url):
thepage = urllib.request.urlopen(url)
soupdata = BeautifulSoup(thepage, "html.parser")
return soupdata

soup = make_soup("https://www.whoscored.com/Regions/252/Tournaments/7/Seasons/6365/Stages/13832/PlayerStatistics/England-Championship-2016-2017")
for record in soup.findAll('tr'):
    print(record.text)

3条回答
forever°为你锁心
2楼-- · 2019-02-20 05:46

It happens because the website doesn't want you to scrape it.

Incapsula Protection

I used selenium to send the request and pictured the simulated browser it has created

It's using Incapsula which is a security service ( they even have some information about scraping over their website ) -check it out, its interesting-

  • This might be helpful
查看更多
狗以群分
3楼-- · 2019-02-20 05:54

this page use javascript to fetch data, you can find raw data in this link:

https://www.whoscored.com/StatisticsFeed/1/GetPlayerStatistics?category=summary&subcategory=all&statsAccumulationType=0&isCurrent=true&playerId=&teamIds=&matchId=&stageId=13832&tournamentOptions=7&sortBy=Rating&sortAscending=&age=&ageComparisonType=&appearances=&appearancesComparisonType=&field=Overall&nationality=&positionOptions=&timeOfTheGameEnd=&timeOfTheGameStart=&isMinApp=true&page=&includeZeroValues=&numberOfPlayersToPick=10

each field of the url can be changed to fetch the data you need.

查看更多
放我归山
4楼-- · 2019-02-20 05:58

Short answer: The player data you are looking for is NOT in that URL.

Then you might want to ask why? I've seen them in that page, how come they're not there?

So I'll try to explain what happens when you browse that url with a modern browser such as Chrome.

You: Type in the url and hit enter.

Chrome: Gotcha. I'll get that page for you asap, just a second. (fetching content from that url), great now I have it! But wait let me read/parse it first before I show it to you, (reading what's inside the content), oh crap this javascript tells me to get additional information from another url, ok I'll do it; oh wait here's another one to tell me to load an ads in the header, well I don't like it but I'm just gonna do what I'm told; just a second, these css tells me to display player names in bold, ok not bad; oh here's another photo from url xxx I need to load, no problem... oh man, how many stuff are there for me to process? I'm not happy with this website... (working on a bunch of other stuff...) Finally everything's ready! Now check it out!

You: Player xxx is actually quite good, I'll check it out. (click player xxx)

Chrome:: ......

As you can see every time when you browse a web page, a browser does lots of "behind the scene" stuff to display it for users. So basically: url entered >> content from url fetched >> content parsed >> additional content fetched >> all stuff rendered >> page displayed (one or more steps might be done simultaneously)

And with your codes, it's only "content from url fetched", also those stats you want happens to be "additional content" which has to be loaded from elsewhere, so that's why you got nothing.

How do I get those stats then? Once you know the urls responsible for loading those stats, simply go after them. How do I find out those urls? Well you can always read javascripts... if you are patient enough...

The easiest way to get what you want is to analyze the traffic while that page is loading, and find out all those behind the scenes traffic. I would recommend fiddler, but you can use any tools you see fit.

Now let's see what happens when you load that page: traffic analytics

There're actually hundreds of requests made to fully render that page you visit, and all you need to do is to find out which one feeds the "actual" or "real" stats. There's this one url even with "StatisticsFeed" in it, could it be the one? Let's take a look:

https://www.whoscored.com/StatisticsFeed/1/GetPlayerStatistics?category=summary&subcategory=all&statsAccumulationType=0&isCurrent=true&playerId=&teamIds=&matchId=&stageId=13832&tournamentOptions=7&sortBy=Rating&sortAscending=&age=&ageComparisonType=&appearances=&appearancesComparisonType=&field=Overall&nationality=&positionOptions=&timeOfTheGameEnd=&timeOfTheGameStart=&isMinApp=true&page=&includeZeroValues=&numberOfPlayersToPick=10

{
    "playerTableStats": [{
        "name": "Conor Hourihane",
        "firstName": "Conor",
        "lastName": "Hourihane",
        "playerId": 134172,
        "height": 181,
        "weight": 62,
        "age": 25,
        "isManOfTheMatch": false,
        "isActive": true,
        "isOpta": true,
        "playedPositions": "-MC-",
        "positionText": "Midfielder",
        "playedPositionsShort": "M(C)",
        "teamId": 142,
        "teamName": "Barnsley",
        "seasonId": 6365,
        "seasonName": "2016/2017",
        "tournamentId": 7,
        "tournamentRegionId": 252,
        "tournamentRegionCode": "gb-eng",
        "regionCode": "ie",
        "tournamentName": "Championship",
        "tournamentShortName": "EC",
        "rating": 7.8705882352941181,
        "ranking": 1,
        "apps": 17,
        "subOn": 0,
        "minsPlayed": 1530,
        "manOfTheMatch": 4,
        "yellowCard": 5.0,
        "redCard": 0.0,
        "goal": 3,
        "assistTotal": 8,
        "shotsPerGame": 2.2352941176470589,
        "aerialWonPerGame": 0.6470588235294118,
        "passSuccess": 81.370449678800867
    },
    {
        "name": "Anthony Knockaert",
        "firstName": "Anthony",
        "lastName": "Knockaert",
        "playerId": 86794,
        "height": 172,
        "weight": 69,
        "age": 25,
        "isManOfTheMatch": false,
        "isActive": true,
        "isOpta": true,
        "playedPositions": "-AML-AMR-",
        "positionText": "Midfielder",
        "playedPositionsShort": "AM(LR)",
        "teamId": 211,
        "teamName": "Brighton",
        "seasonId": 6365,
        "seasonName": "2016/2017",
        "tournamentId": 7,
        "tournamentRegionId": 252,
        "tournamentRegionCode": "gb-eng",
        "regionCode": "fr",
        "tournamentName": "Championship",
        "tournamentShortName": "EC",
        "rating": 7.6722222222222216,
        "ranking": 2,
        "apps": 18,
        "subOn": 1,
        "minsPlayed": 1471,
        "manOfTheMatch": 5,
        "yellowCard": 4.0,
        "redCard": 0.0,
        "goal": 6,
        "assistTotal": 0,
        "shotsPerGame": 2.3888888888888888,
        "aerialWonPerGame": 0.22222222222222221,
        "passSuccess": 83.420593368237348
    },
    {
        "name": "Lewis Dunk",
        "firstName": "Lewis",
        "lastName": "Dunk",
        "playerId": 86441,
        "height": 192,
        "weight": 88,
        "age": 25,
        "isManOfTheMatch": false,
        "isActive": true,
        "isOpta": true,
        "playedPositions": "-DC-",
        "positionText": "Defender",
        "playedPositionsShort": "D(C)",
        "teamId": 211,
        "teamName": "Brighton",
        "seasonId": 6365,
        "seasonName": "2016/2017",
        "tournamentId": 7,
        "tournamentRegionId": 252,
        "tournamentRegionCode": "gb-eng",
        "regionCode": "gb-eng",
        "tournamentName": "Championship",
        "tournamentShortName": "EC",
        "rating": 7.660000000000001,
        "ranking": 3,
        "apps": 18,
        "subOn": 0,
        "minsPlayed": 1620,
        "manOfTheMatch": 3,
        "yellowCard": 8.0,
        "redCard": 0.0,
        "goal": 1,
        "assistTotal": 1,
        "shotsPerGame": 0.61111111111111116,
        "aerialWonPerGame": 3.5,
        "passSuccess": 79.72251867662753
    },
    {
        "name": "Tom Clarke",
        "firstName": "Tom",
        "lastName": "Clarke",
        "playerId": 133974,
        "height": 180,
        "weight": 77,
        "age": 28,
        "isManOfTheMatch": false,
        "isActive": true,
        "isOpta": true,
        "playedPositions": "-DC-",
        "positionText": "Defender",
        "playedPositionsShort": "D(C)",
        "teamId": 181,
        "teamName": "Preston",
        "seasonId": 6365,
        "seasonName": "2016/2017",
        "tournamentId": 7,
        "tournamentRegionId": 252,
        "tournamentRegionCode": "gb-eng",
        "regionCode": "gb-eng",
        "tournamentName": "Championship",
        "tournamentShortName": "EC",
        "rating": 7.6126315789473677,
        "ranking": 4,
        "apps": 19,
        "subOn": 0,
        "minsPlayed": 1692,
        "manOfTheMatch": 4,
        "yellowCard": 0.0,
        "redCard": 0.0,
        "goal": 2,
        "assistTotal": 0,
        "shotsPerGame": 0.89473684210526316,
        "aerialWonPerGame": 5.4736842105263159,
        "passSuccess": 66.666666666666657
    },
    {
        "name": "Pontus Jansson",
        "firstName": "Pontus",
        "lastName": "Jansson",
        "playerId": 121123,
        "height": 194,
        "weight": 89,
        "age": 25,
        "isManOfTheMatch": false,
        "isActive": true,
        "isOpta": true,
        "playedPositions": "-DC-",
        "positionText": "Defender",
        "playedPositionsShort": "D(C)",
        "teamId": 19,
        "teamName": "Leeds",
        "seasonId": 6365,
        "seasonName": "2016/2017",
        "tournamentId": 7,
        "tournamentRegionId": 252,
        "tournamentRegionCode": "gb-eng",
        "regionCode": "se",
        "tournamentName": "Championship",
        "tournamentShortName": "EC",
        "rating": 7.5976923076923066,
        "ranking": 5,
        "apps": 13,
        "subOn": 0,
        "minsPlayed": 1126,
        "manOfTheMatch": 1,
        "yellowCard": 6.0,
        "redCard": 0.0,
        "goal": 1,
        "assistTotal": 0,
        "shotsPerGame": 0.53846153846153844,
        "aerialWonPerGame": 3.5384615384615383,
        "passSuccess": 86.336633663366342
    },
    {
        "name": "Angus MacDonald",
        "firstName": "Angus",
        "lastName": "MacDonald",
        "playerId": 110825,
        "height": 184,
        "weight": 70,
        "age": 24,
        "isManOfTheMatch": false,
        "isActive": true,
        "isOpta": true,
        "playedPositions": "-DC-",
        "positionText": "Defender",
        "playedPositionsShort": "D(C)",
        "teamId": 142,
        "teamName": "Barnsley",
        "seasonId": 6365,
        "seasonName": "2016/2017",
        "tournamentId": 7,
        "tournamentRegionId": 252,
        "tournamentRegionCode": "gb-eng",
        "regionCode": "gb-eng",
        "tournamentName": "Championship",
        "tournamentShortName": "EC",
        "rating": 7.5066666666666677,
        "ranking": 6,
        "apps": 12,
        "subOn": 0,
        "minsPlayed": 1080,
        "manOfTheMatch": 0,
        "yellowCard": 3.0,
        "redCard": 0.0,
        "goal": 0,
        "assistTotal": 0,
        "shotsPerGame": 0.33333333333333331,
        "aerialWonPerGame": 4.833333333333333,
        "passSuccess": 72.147651006711413
    },
    {
        "name": "Marc Roberts",
        "firstName": "Marc",
        "lastName": "Roberts",
        "playerId": 138949,
        "height": 183,
        "weight": 81,
        "age": 26,
        "isManOfTheMatch": false,
        "isActive": true,
        "isOpta": true,
        "playedPositions": "-DC-",
        "positionText": "Defender",
        "playedPositionsShort": "D(C)",
        "teamId": 142,
        "teamName": "Barnsley",
        "seasonId": 6365,
        "seasonName": "2016/2017",
        "tournamentId": 7,
        "tournamentRegionId": 252,
        "tournamentRegionCode": "gb-eng",
        "regionCode": "gb-eng",
        "tournamentName": "Championship",
        "tournamentShortName": "EC",
        "rating": 7.503125,
        "ranking": 7,
        "apps": 16,
        "subOn": 0,
        "minsPlayed": 1440,
        "manOfTheMatch": 1,
        "yellowCard": 3.0,
        "redCard": 0.0,
        "goal": 2,
        "assistTotal": 2,
        "shotsPerGame": 0.625,
        "aerialWonPerGame": 7.0625,
        "passSuccess": 61.595547309833023
    },
    {
        "name": "Bradley Johnson",
        "firstName": "Bradley",
        "lastName": "Johnson",
        "playerId": 12490,
        "height": 178,
        "weight": 68,
        "age": 29,
        "isManOfTheMatch": false,
        "isActive": true,
        "isOpta": true,
        "playedPositions": "-MC-ML-",
        "positionText": "Midfielder",
        "playedPositionsShort": "M(CL)",
        "teamId": 20,
        "teamName": "Derby",
        "seasonId": 6365,
        "seasonName": "2016/2017",
        "tournamentId": 7,
        "tournamentRegionId": 252,
        "tournamentRegionCode": "gb-eng",
        "regionCode": "gb-eng",
        "tournamentName": "Championship",
        "tournamentShortName": "EC",
        "rating": 7.4954545454545443,
        "ranking": 8,
        "apps": 11,
        "subOn": 0,
        "minsPlayed": 952,
        "manOfTheMatch": 1,
        "yellowCard": 4.0,
        "redCard": 0.0,
        "goal": 2,
        "assistTotal": 1,
        "shotsPerGame": 1.3636363636363635,
        "aerialWonPerGame": 4.0909090909090908,
        "passSuccess": 71.908127208480565
    },
    {
        "name": "Christophe Berra",
        "firstName": "Christophe",
        "lastName": "Berra",
        "playerId": 8287,
        "height": 186,
        "weight": 81,
        "age": 31,
        "isManOfTheMatch": false,
        "isActive": true,
        "isOpta": true,
        "playedPositions": "-DC-",
        "positionText": "Defender",
        "playedPositionsShort": "D(C)",
        "teamId": 165,
        "teamName": "Ipswich",
        "seasonId": 6365,
        "seasonName": "2016/2017",
        "tournamentId": 7,
        "tournamentRegionId": 252,
        "tournamentRegionCode": "gb-eng",
        "regionCode": "gb-sct",
        "tournamentName": "Championship",
        "tournamentShortName": "EC",
        "rating": 7.4789473684210526,
        "ranking": 9,
        "apps": 19,
        "subOn": 0,
        "minsPlayed": 1710,
        "manOfTheMatch": 3,
        "yellowCard": 4.0,
        "redCard": 0.0,
        "goal": 0,
        "assistTotal": 1,
        "shotsPerGame": 0.94736842105263153,
        "aerialWonPerGame": 6.2105263157894735,
        "passSuccess": 58.636363636363633
    },
    {
        "name": "Adam Webster",
        "firstName": "Adam",
        "lastName": "Webster",
        "playerId": 109922,
        "height": 191,
        "weight": 0,
        "age": 21,
        "isManOfTheMatch": false,
        "isActive": true,
        "isOpta": true,
        "playedPositions": "-DC-",
        "positionText": "Defender",
        "playedPositionsShort": "D(C)",
        "teamId": 165,
        "teamName": "Ipswich",
        "seasonId": 6365,
        "seasonName": "2016/2017",
        "tournamentId": 7,
        "tournamentRegionId": 252,
        "tournamentRegionCode": "gb-eng",
        "regionCode": "gb-eng",
        "tournamentName": "Championship",
        "tournamentShortName": "EC",
        "rating": 7.4780000000000006,
        "ranking": 10,
        "apps": 15,
        "subOn": 1,
        "minsPlayed": 1227,
        "manOfTheMatch": 2,
        "yellowCard": 1.0,
        "redCard": 0.0,
        "goal": 0,
        "assistTotal": 0,
        "shotsPerGame": 0.2,
        "aerialWonPerGame": 5.0666666666666664,
        "passSuccess": 58.256029684601117
    }],
    "paging": {
        "currentPage": 1,
        "totalPages": 34,
        "resultsPerPage": 10,
        "totalResults": 338,
        "firstRecordIndex": 1,
        "lastRecordIndex": 10
    },
    "statColumns": ["apps",
    "subOn",
    "minsPlayed",
    "goal",
    "assistTotal",
    "yellowCard",
    "redCard",
    "shotsPerGame",
    "passSuccess",
    "aerialWonPerGame",
    "manOfTheMatch"]
}

Exactly! So now what? Simulate this request and parse the content, since it's JSON formated already, the builtin module json would do the job easily, you don't even have to use BeautifulSoup

You might ask, how come I got nothing when I browse this link directly? That's because they set limit on their server so that only requests with valid headers would get feeds. So how do I bypass that? Simulate "vividly" with correct parameters(mostly headers) so that they believe you.

查看更多
登录 后发表回答