Python BeautifulSoup not scraping this url

2楼-- · 2019-02-20 05:46

It happens because the website doesn't want you to scrape it.

I used selenium to send the request and pictured the simulated browser it has created

It's using Incapsula which is a security service ( they even have some information about scraping over their website ) -check it out, its interesting-

This might be helpful

0人赞添加讨论(0) 举报

狗以群分

3楼-- · 2019-02-20 05:54

this page use javascript to fetch data, you can find raw data in this link:

https://www.whoscored.com/StatisticsFeed/1/GetPlayerStatistics?category=summary&subcategory=all&statsAccumulationType=0&isCurrent=true&playerId=&teamIds=&matchId=&stageId=13832&tournamentOptions=7&sortBy=Rating&sortAscending=&age=&ageComparisonType=&appearances=&appearancesComparisonType=&field=Overall&nationality=&positionOptions=&timeOfTheGameEnd=&timeOfTheGameStart=&isMinApp=true&page=&includeZeroValues=&numberOfPlayersToPick=10

each field of the url can be changed to fetch the data you need.

0人赞添加讨论(0) 举报

放我归山

4楼-- · 2019-02-20 05:58

Short answer: The player data you are looking for is NOT in that URL.

Then you might want to ask why? I've seen them in that page, how come they're not there?

So I'll try to explain what happens when you browse that url with a modern browser such as Chrome.

You: Type in the url and hit enter.

Chrome: Gotcha. I'll get that page for you asap, just a second. (fetching content from that url), great now I have it! But wait let me read/parse it first before I show it to you, (reading what's inside the content), oh crap this javascript tells me to get additional information from another url, ok I'll do it; oh wait here's another one to tell me to load an ads in the header, well I don't like it but I'm just gonna do what I'm told; just a second, these css tells me to display player names in bold, ok not bad; oh here's another photo from url xxx I need to load, no problem... oh man, how many stuff are there for me to process? I'm not happy with this website... (working on a bunch of other stuff...) Finally everything's ready! Now check it out!

You: Player xxx is actually quite good, I'll check it out. (click player xxx)

Chrome:: ......

As you can see every time when you browse a web page, a browser does lots of "behind the scene" stuff to display it for users. So basically: url entered >> content from url fetched >> content parsed >> additional content fetched >> all stuff rendered >> page displayed (one or more steps might be done simultaneously)

And with your codes, it's only "content from url fetched", also those stats you want happens to be "additional content" which has to be loaded from elsewhere, so that's why you got nothing.

How do I get those stats then? Once you know the urls responsible for loading those stats, simply go after them. How do I find out those urls? Well you can always read javascripts... if you are patient enough...

The easiest way to get what you want is to analyze the traffic while that page is loading, and find out all those behind the scenes traffic. I would recommend fiddler, but you can use any tools you see fit.

Now let's see what happens when you load that page:

There're actually hundreds of requests made to fully render that page you visit, and all you need to do is to find out which one feeds the "actual" or "real" stats. There's this one url even with "StatisticsFeed" in it, could it be the one? Let's take a look:

https://www.whoscored.com/StatisticsFeed/1/GetPlayerStatistics?category=summary&subcategory=all&statsAccumulationType=0&isCurrent=true&playerId=&teamIds=&matchId=&stageId=13832&tournamentOptions=7&sortBy=Rating&sortAscending=&age=&ageComparisonType=&appearances=&appearancesComparisonType=&field=Overall&nationality=&positionOptions=&timeOfTheGameEnd=&timeOfTheGameStart=&isMinApp=true&page=&includeZeroValues=&numberOfPlayersToPick=10

{
    "playerTableStats": [{
        "name": "Conor Hourihane",
        "firstName": "Conor",
        "lastName": "Hourihane",
        "playerId": 134172,
        "height": 181,
        "weight": 62,
        "age": 25,
        "isManOfTheMatch": false,
        "isActive": true,
        "isOpta": true,
        "playedPositions": "-MC-",
        "positionText": "Midfielder",
        "playedPositionsShort": "M(C)",
        "teamId": 142,
        "teamName": "Barnsley",
        "seasonId": 6365,
        "seasonName": "2016/2017",
        "tournamentId": 7,
        "tournamentRegionId": 252,
        "tournamentRegionCode": "gb-eng",
        "regionCode": "ie",
        "tournamentName": "Championship",
        "tournamentShortName": "EC",
        "rating": 7.8705882352941181,
        "ranking": 1,
        "apps": 17,
        "subOn": 0,
        "minsPlayed": 1530,
        "manOfTheMatch": 4,
        "yellowCard": 5.0,
        "redCard": 0.0,
        "goal": 3,
        "assistTotal": 8,
        "shotsPerGame": 2.2352941176470589,
        "aerialWonPerGame": 0.6470588235294118,
        "passSuccess": 81.370449678800867
    },
    {
        "name": "Anthony Knockaert",
        "firstName": "Anthony",
        "lastName": "Knockaert",
        "playerId": 86794,
        "height": 172,
        "weight": 69,
        "age": 25,
        "isManOfTheMatch": false,
        "isActive": true,
        "isOpta": true,
        "playedPositions": "-AML-AMR-",
        "positionText": "Midfielder",
        "playedPositionsShort": "AM(LR)",
        "teamId": 211,
        "teamName": "Brighton",
        "seasonId": 6365,
        "seasonName": "2016/2017",
        "tournamentId": 7,
        "tournamentRegionId": 252,
        "tournamentRegionCode": "gb-eng",
        "regionCode": "fr",
        "tournamentName": "Championship",
        "tournamentShortName": "EC",
        "rating": 7.6722222222222216,
        "ranking": 2,
        "apps": 18,
        "subOn": 1,
        "minsPlayed": 1471,
        "manOfTheMatch": 5,
        "yellowCard": 4.0,
        "redCard": 0.0,
        "goal": 6,
        "assistTotal": 0,
        "shotsPerGame": 2.3888888888888888,
        "aerialWonPerGame": 0.22222222222222221,
        "passSuccess": 83.420593368237348
    },
    {
        "name": "Lewis Dunk",
        "firstName": "Lewis",
        "lastName": "Dunk",
        "playerId": 86441,
        "height": 192,
        "weight": 88,
        "age": 25,
        "isManOfTheMatch": false,
        "isActive": true,
        "isOpta": true,
        "playedPositions": "-DC-",
        "positionText": "Defender",
        "playedPositionsShort": "D(C)",
        "teamId": 211,
        "teamName": "Brighton",
        "seasonId": 6365,
        "seasonName": "2016/2017",
        "tournamentId": 7,
        "tournamentRegionId": 252,
        "tournamentRegionCode": "gb-eng",
        "regionCode": "gb-eng",
        "tournamentName": "Championship",
        "tournamentShortName": "EC",
        "rating": 7.660000000000001,
        "ranking": 3,
        "apps": 18,
        "subOn": 0,
        "minsPlayed": 1620,
        "manOfTheMatch": 3,
        "yellowCard": 8.0,
        "redCard": 0.0,
        "goal": 1,
        "assistTotal": 1,
        "shotsPerGame": 0.61111111111111116,
        "aerialWonPerGame": 3.5,
        "passSuccess": 79.72251867662753
    },
    {
        "name": "Tom Clarke",
        "firstName": "Tom",
        "lastName": "Clarke",
        "playerId": 133974,
        "height": 180,
        "weight": 77,
        "age": 28,
        "isManOfTheMatch": false,
        "isActive": true,
        "isOpta": true,
        "playedPositions": "-DC-",
        "positionText": "Defender",
        "playedPositionsShort": "D(C)",
        "teamId": 181,
        "teamName": "Preston",
        "seasonId": 6365,
        "seasonName": "2016/2017",
        "tournamentId": 7,
        "tournamentRegionId": 252,
        "tournamentRegionCode": "gb-eng",
        "regionCode": "gb-eng",
        "tournamentName": "Championship",
        "tournamentShortName": "EC",
        "rating": 7.6126315789473677,
        "ranking": 4,
        "apps": 19,
        "subOn": 0,
        "minsPlayed": 1692,
        "manOfTheMatch": 4,
        "yellowCard": 0.0,
        "redCard": 0.0,
        "goal": 2,
        "assistTotal": 0,
        "shotsPerGame": 0.89473684210526316,
        "aerialWonPerGame": 5.4736842105263159,
        "passSuccess": 66.666666666666657
    },
    {
        "name": "Pontus Jansson",
        "firstName": "Pontus",
        "lastName": "Jansson",
        "playerId": 121123,
        "height": 194,
        "weight": 89,
        "age": 25,
        "isManOfTheMatch": false,
        "isActive": true,
        "isOpta": true,
        "playedPositions": "-DC-",
        "positionText": "Defender",
        "playedPositionsShort": "D(C)",
        "teamId": 19,
        "teamName": "Leeds",
        "seasonId": 6365,
        "seasonName": "2016/2017",
        "tournamentId": 7,
        "tournamentRegionId": 252,
        "tournamentRegionCode": "gb-eng",
        "regionCode": "se",
        "tournamentName": "Championship",
        "tournamentShortName": "EC",
        "rating": 7.5976923076923066,
        "ranking": 5,
        "apps": 13,
        "subOn": 0,
        "minsPlayed": 1126,
        "manOfTheMatch": 1,
        "yellowCard": 6.0,
        "redCard": 0.0,
        "goal": 1,
        "assistTotal": 0,
        "shotsPerGame": 0.53846153846153844,
        "aerialWonPerGame": 3.5384615384615383,
        "passSuccess": 86.336633663366342
    },
    {
        "name": "Angus MacDonald",
        "firstName": "Angus",
        "lastName": "MacDonald",
        "playerId": 110825,
        "height": 184,
        "weight": 70,
        "age": 24,
        "isManOfTheMatch": false,
        "isActive": true,
        "isOpta": true,
        "playedPositions": "-DC-",
        "positionText": "Defender",
        "playedPositionsShort": "D(C)",
        "teamId": 142,
        "teamName": "Barnsley",
        "seasonId": 6365,
        "seasonName": "2016/2017",
        "tournamentId": 7,
        "tournamentRegionId": 252,
        "tournamentRegionCode": "gb-eng",
        "regionCode": "gb-eng",
        "tournamentName": "Championship",
        "tournamentShortName": "EC",
        "rating": 7.5066666666666677,
        "ranking": 6,
        "apps": 12,
        "subOn": 0,
        "minsPlayed": 1080,
        "manOfTheMatch": 0,
        "yellowCard": 3.0,
        "redCard": 0.0,
        "goal": 0,
        "assistTotal": 0,
        "shotsPerGame": 0.33333333333333331,
        "aerialWonPerGame": 4.833333333333333,
        "passSuccess": 72.147651006711413
    },
    {
        "name": "Marc Roberts",
        "firstName": "Marc",
        "lastName": "Roberts",
        "playerId": 138949,
        "height": 183,
        "weight": 81,
        "age": 26,
        "isManOfTheMatch": false,
        "isActive": true,
        "isOpta": true,
        "playedPositions": "-DC-",
        "positionText": "Defender",
        "playedPositionsShort": "D(C)",
        "teamId": 142,
        "teamName": "Barnsley",
        "seasonId": 6365,
        "seasonName": "2016/2017",
        "tournamentId": 7,
        "tournamentRegionId": 252,
        "tournamentRegionCode": "gb-eng",
        "regionCode": "gb-eng",
        "tournamentName": "Championship",
        "tournamentShortName": "EC",
        "rating": 7.503125,
        "ranking": 7,
        "apps": 16,
        "subOn": 0,
        "minsPlayed": 1440,
        "manOfTheMatch": 1,
        "yellowCard": 3.0,
        "redCard": 0.0,
        "goal": 2,
        "assistTotal": 2,
        "shotsPerGame": 0.625,
        "aerialWonPerGame": 7.0625,
        "passSuccess": 61.595547309833023
    },
    {
        "name": "Bradley Johnson",
        "firstName": "Bradley",
        "lastName": "Johnson",
        "playerId": 12490,
        "height": 178,
        "weight": 68,
        "age": 29,
        "isManOfTheMatch": false,
        "isActive": true,
        "isOpta": true,
        "playedPositions": "-MC-ML-",
        "positionText": "Midfielder",
        "playedPositionsShort": "M(CL)",
        "teamId": 20,
        "teamName": "Derby",
        "seasonId": 6365,
        "seasonName": "2016/2017",
        "tournamentId": 7,
        "tournamentRegionId": 252,
        "tournamentRegionCode": "gb-eng",
        "regionCode": "gb-eng",
        "tournamentName": "Championship",
        "tournamentShortName": "EC",
        "rating": 7.4954545454545443,
        "ranking": 8,
        "apps": 11,
        "subOn": 0,
        "minsPlayed": 952,
        "manOfTheMatch": 1,
        "yellowCard": 4.0,
        "redCard": 0.0,
        "goal": 2,
        "assistTotal": 1,
        "shotsPerGame": 1.3636363636363635,
        "aerialWonPerGame": 4.0909090909090908,
        "passSuccess": 71.908127208480565
    },
    {
        "name": "Christophe Berra",
        "firstName": "Christophe",
        "lastName": "Berra",
        "playerId": 8287,
        "height": 186,
        "weight": 81,
        "age": 31,
        "isManOfTheMatch": false,
        "isActive": true,
        "isOpta": true,
        "playedPositions": "-DC-",
        "positionText": "Defender",
        "playedPositionsShort": "D(C)",
        "teamId": 165,
        "teamName": "Ipswich",
        "seasonId": 6365,
        "seasonName": "2016/2017",
        "tournamentId": 7,
        "tournamentRegionId": 252,
        "tournamentRegionCode": "gb-eng",
        "regionCode": "gb-sct",
        "tournamentName": "Championship",
        "tournamentShortName": "EC",
        "rating": 7.4789473684210526,
        "ranking": 9,
        "apps": 19,
        "subOn": 0,
        "minsPlayed": 1710,
        "manOfTheMatch": 3,
        "yellowCard": 4.0,
        "redCard": 0.0,
        "goal": 0,
        "assistTotal": 1,
        "shotsPerGame": 0.94736842105263153,
        "aerialWonPerGame": 6.2105263157894735,
        "passSuccess": 58.636363636363633
    },
    {
        "name": "Adam Webster",
        "firstName": "Adam",
        "lastName": "Webster",
        "playerId": 109922,
        "height": 191,
        "weight": 0,
        "age": 21,
        "isManOfTheMatch": false,
        "isActive": true,
        "isOpta": true,
        "playedPositions": "-DC-",
        "positionText": "Defender",
        "playedPositionsShort": "D(C)",
        "teamId": 165,
        "teamName": "Ipswich",
        "seasonId": 6365,
        "seasonName": "2016/2017",
        "tournamentId": 7,
        "tournamentRegionId": 252,
        "tournamentRegionCode": "gb-eng",
        "regionCode": "gb-eng",
        "tournamentName": "Championship",
        "tournamentShortName": "EC",
        "rating": 7.4780000000000006,
        "ranking": 10,
        "apps": 15,
        "subOn": 1,
        "minsPlayed": 1227,
        "manOfTheMatch": 2,
        "yellowCard": 1.0,
        "redCard": 0.0,
        "goal": 0,
        "assistTotal": 0,
        "shotsPerGame": 0.2,
        "aerialWonPerGame": 5.0666666666666664,
        "passSuccess": 58.256029684601117
    }],
    "paging": {
        "currentPage": 1,
        "totalPages": 34,
        "resultsPerPage": 10,
        "totalResults": 338,
        "firstRecordIndex": 1,
        "lastRecordIndex": 10
    },
    "statColumns": ["apps",
    "subOn",
    "minsPlayed",
    "goal",
    "assistTotal",
    "yellowCard",
    "redCard",
    "shotsPerGame",
    "passSuccess",
    "aerialWonPerGame",
    "manOfTheMatch"]
}

Exactly! So now what? Simulate this request and parse the content, since it's JSON formated already, the builtin module json would do the job easily, you don't even have to use BeautifulSoup

You might ask, how come I got nothing when I browse this link directly? That's because they set limit on their server so that only requests with valid headers would get feeds. So how do I bypass that? Simulate "vividly" with correct parameters(mostly headers) so that they believe you.

0人赞添加讨论(0) 举报

Python BeautifulSoup not scraping this url

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间