I am trying to scrape some rows of player data (tr) from a url, however nothing appears to happen when I run my code. I am positive my code is fine because it works with other statistical websites containing tables. Can anyone tell me why nothing is happening? Thanks in advance.
import urllib
import urllib.request
from bs4 import BeautifulSoup
def make_soup(url):
thepage = urllib.request.urlopen(url)
soupdata = BeautifulSoup(thepage, "html.parser")
return soupdata
soup = make_soup("https://www.whoscored.com/Regions/252/Tournaments/7/Seasons/6365/Stages/13832/PlayerStatistics/England-Championship-2016-2017")
for record in soup.findAll('tr'):
print(record.text)
It happens because the website doesn't want you to scrape it.
It's using Incapsula which is a security service ( they even have some information about scraping over their website ) -check it out, its interesting-
this page use javascript to fetch data, you can find raw data in this link:
each field of the url can be changed to fetch the data you need.
Short answer: The player data you are looking for is NOT in that URL.
Then you might want to ask why? I've seen them in that page, how come they're not there?
So I'll try to explain what happens when you browse that url with a modern browser such as Chrome.
As you can see every time when you browse a web page, a browser does lots of "behind the scene" stuff to display it for users. So basically: url entered >> content from url fetched >> content parsed >> additional content fetched >> all stuff rendered >> page displayed (one or more steps might be done simultaneously)
And with your codes, it's only "content from url fetched", also those stats you want happens to be "additional content" which has to be loaded from elsewhere, so that's why you got nothing.
How do I get those stats then? Once you know the urls responsible for loading those stats, simply go after them. How do I find out those urls? Well you can always read javascripts... if you are patient enough...
The easiest way to get what you want is to analyze the traffic while that page is loading, and find out all those behind the scenes traffic. I would recommend fiddler, but you can use any tools you see fit.
Now let's see what happens when you load that page:![traffic analytics](https://i.stack.imgur.com/MMgFJ.png)
There're actually hundreds of requests made to fully render that page you visit, and all you need to do is to find out which one feeds the "actual" or "real" stats. There's this one url even with "StatisticsFeed" in it, could it be the one? Let's take a look:
https://www.whoscored.com/StatisticsFeed/1/GetPlayerStatistics?category=summary&subcategory=all&statsAccumulationType=0&isCurrent=true&playerId=&teamIds=&matchId=&stageId=13832&tournamentOptions=7&sortBy=Rating&sortAscending=&age=&ageComparisonType=&appearances=&appearancesComparisonType=&field=Overall&nationality=&positionOptions=&timeOfTheGameEnd=&timeOfTheGameStart=&isMinApp=true&page=&includeZeroValues=&numberOfPlayersToPick=10
Exactly! So now what? Simulate this request and parse the content, since it's JSON formated already, the builtin module
json
would do the job easily, you don't even have to useBeautifulSoup
You might ask, how come I got nothing when I browse this link directly? That's because they set limit on their server so that only requests with valid headers would get feeds. So how do I bypass that? Simulate "vividly" with correct parameters(mostly headers) so that they believe you.