I am trying to parse this Link for searching the results
Please select:
- School= All
- Sport=FootBall
- Conference=All
- Year=2005-2006
- State=All
This search result contains 226 entries and I would like to parse, all 226 entries and convert it into pandas dataframe such that dataframe contains"School","Conference","GSR",'FGR' and 'State'. So, far I was able to parse Table headers, but I cannot parse data from the table. Please advise with code and explanation.
Note:I am new to Python and Beautifulsoup.
Code I have tried so far:
url='https://web3.ncaa.org/aprsearch/gsrsearch'
#Create a handle, page, to handle the contents of the website
page = requests.get(url)
#Store the contents of the website under doc
doc = lh.fromstring(page.content)
#Parse data that are stored between <tr>..</tr> of HTML
tr_elements = doc.xpath('//tr')
#Create empty list
col=[]
i=0
#For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
i+=1
name=t.text_content()
print ('%d:"%s"'%(i,name))
col.append((name,[]))
#Since out first row is the header, data is stored on the second row onwards
for j in range(1,len(tr_elements)):
#T is our j'th row
T=tr_elements[j]
#If row is not of size 10, the //tr data is not from our table
if len(T)!=10:
break
#i is the index of our column
i=0
#Iterate through each element of the row
for t in T.iterchildren():
data=t.text_content()
#Check if row is empty
if i>0:
#Convert any numerical value to integers
try:
data=int(data)
except:
pass
#Append the data to the empty list of the i'th column
col[i][1].append(data)
#Increment i for the next column
i+=1
Dict={title:column for (title,column) in col}
df=pd.DataFrame(Dict)
You can paste in the headers and payload then use
.post
. I'm still learning how to use this properly, and not quite sure whats EXACTLY needed (or what's "sensitive info" which is why I blacked out some of it...like I said, I'm still learning), but managed to have it return the json.This will return the json and then just convert to a dataframe.
You can get the headers and payload by doing an "Inspect" of the page, then click on XHR (you might need to refresh the page so
gsrsearch
appears. Then just click on it and scroll to find it. You'l have to put the quotes in there though.Code:
Output: