Extracting html table from a website in R

2019-01-27 07:00发布

Hi I am trying to extract the table from the premierleague website.

The package I am using is rvest package and the code I am using in the inital phase is as follows:

library(rvest)
library(magrittr)
premierleague <- read_html("https://fantasy.premierleague.com/a/entry/767830/history")
premierleague %>% html_nodes("ism-table")

I couldn't find a html tag that would work to extract the html_nodes for rvest package.

I was using similar approach to extract data from "http://admissions.calpoly.edu/prospective/profile.html" and I was able to extract the data. The code I used for calpoly is as follows:

library(rvest)
library(magrittr)
CPadmissions <- read_html("http://admissions.calpoly.edu/prospective/profile.html")

CPadmissions %>% html_nodes("table") %>%
  .[[1]] %>%
  html_table()

Got the code above from youtube through this link: https://www.youtube.com/watch?v=gSbuwYdNYLM&ab_channel=EvanO%27Brien

Any help on getting data from fantasy.premierleague.com is highly appreciated. Do I need to use some kind of API ?

2条回答
啃猪蹄的小仙女
2楼-- · 2019-01-27 07:47

Since the data is loaded with JavaScript, grabbing the HTML with rvest will not get you what you want, but if you use PhantomJS as a headless browser within RSelenium, it's not all that complicated (by RSelenium standards):

library(RSelenium)
library(rvest)

# initialize browser and driver with RSelenium
ptm <- phantom()
rd <- remoteDriver(browserName = 'phantomjs')
rd$open()

# grab source for page
rd$navigate('https://fantasy.premierleague.com/a/entry/767830/history')
html <- rd$getPageSource()[[1]]

# clean up
rd$close()
ptm$stop()

# parse with rvest
df <- html %>% read_html() %>% 
    html_node('#ismr-event-history table.ism-table') %>% 
    html_table() %>% 
    setNames(gsub('\\S+\\s+(\\S+)', '\\1', names(.))) %>%    # clean column names
    setNames(gsub('\\s', '_', names(.)))

str(df)
## 'data.frame':    20 obs. of  10 variables:
##  $ Gameweek                : chr  "GW1" "GW2" "GW3" "GW4" ...
##  $ Gameweek_Points         : int  34 47 53 51 66 66 65 63 48 90 ...
##  $ Points_Bench            : int  1 6 9 7 14 2 9 3 8 2 ...
##  $ Gameweek_Rank           : chr  "2,406,373" "2,659,789" "541,258" "905,524" ...
##  $ Transfers_Made          : int  0 0 2 0 3 2 2 0 2 0 ...
##  $ Transfers_Cost          : int  0 0 0 0 4 4 4 0 0 0 ...
##  $ Overall_Points          : chr  "34" "81" "134" "185" ...
##  $ Overall_Rank            : chr  "2,406,373" "2,448,674" "1,914,025" "1,461,665" ...
##  $ Value                   : chr  "£100.0" "£100.0" "£99.9" "£100.0" ...
##  $ Change_Previous_Gameweek: logi  NA NA NA NA NA NA ...

As always, more cleaning is necessary, but overall, it's in pretty good shape without too much work. (If you're using the tidyverse, df %>% mutate_if(is.character, parse_number) will do pretty well.) The arrows are images which is why the last column is all NA, but you can calculate those anyway.

查看更多
女痞
3楼-- · 2019-01-27 08:01

This solution uses RSelenium along with the package XML. It also assumes that you have a working installation of RSelenium that can properly work with firefox. Just make sure you have the firefox starter script path added to your PATH.

If you are using OS X, you will need to add /Applications/Firefox.app/Contents/MacOS/ to your PATH. Or, if you're on an Ubuntu machine, it's likely /usr/lib/firefox/. Once you're sure this is working, you can move on to R with the following:

# Install RSelenium and XML for R
#install.packages("RSelenium")
#install.packages("XML")

# Import packages
library(RSelenium)
library(XML)

# Check and start servers for Selenium
checkForServer()
startServer()

# Use firefox as a browser and a port that's not used
remote_driver <- remoteDriver(browserName="firefox", port=4444)
remote_driver$open(silent=T)

# Use RSelenium to browse the site
epl_link <- "https://fantasy.premierleague.com/a/entry/767830/history"
remote_driver$navigate(epl_link)
elem <- remote_driver$findElement(using="class", value="ism-table")

# Get the HTML source
elemtxt <- elem$getElementAttribute("outerHTML")

# Use the XML package to work with the HTML source
elem_html <- htmlTreeParse(elemtxt, useInternalNodes = T, asText = TRUE)

# Convert the table into a dataframe
games_table <- readHTMLTable(elem_html, header = T, stringsAsFactors = FALSE)[[1]]

# Change the column names into something legible
names(games_table) <- unlist(lapply(strsplit(names(games_table), split = "\\n\\s+"), function(x) x[2]))
names(games_table) <- gsub("£", "Value", gsub("#", "CPW", gsub("Â","",names(games_table))))

# Convert the fields into numeric values
games_table <- transform(games_table, GR = as.numeric(gsub(",","",GR)),
                    OP = as.numeric(gsub(",","",OP)),
                    OR = as.numeric(gsub(",","",OR)),
                    Value = as.numeric(gsub("£","",Value)))

This should yield:

 GW   GP PB GR     TM TC    OP   OR    Value CPW
 GW1  34 1  2406373 0  0    34 2406373 100.0    
 GW2  47 6  2659789 0  0    81 2448674 100.0    
 GW3  53 9   541258 2  0   134 1914025  99.9    
 GW4  51 7   905524 0  0   185 1461665 100.0    
 GW5  66 14  379438 3  4   247  958889 100.1    
 GW6  66 2   303704 2  4   309  510376  99.9    
 GW7  65 9   138792 2  4   370  232474  99.8    
 GW8  63 3   108363 0  0   433   87967 100.4    
 GW9  48 8  1114609 2  0   481   75385 100.9    
 GW10 90 2    71210 0  0   571   27716 101.1    
 GW11 71 2   421706 3  4   638   16083 100.9    
 GW12 35 9  2798661 2  4   669   31820 101.2    
 GW13 41 8  2738535 1  0   710   53487 101.1    
 GW14 82 15  308725 0  0   792   29436 100.2    
 GW15 55 9  1048808 2  4   843   29399 100.6    
 GW16 49 8  1801549 0  0   892   35142 100.7    
 GW17 48 4  2116706 2  0   940   40857 100.7    
 GW18 42 2  3315031 0  0   982   78136 100.8    
 GW19 41 9  2600618 0  0  1023   99048 100.6    
 GW20 53 0  1644385 0  0  1076  113148 100.8

Please note that the column CPW (change from previous week) is a vector of empty strings.

I hope this helps.

查看更多
登录 后发表回答