Loop to scrape data from Wikipedia in R

2020-04-11 18:17发布

I am trying to extract data about celebrity/notable deaths for analysis. Wikipedia has a very regular structure to their html paths concerning notable dates of death. It looks like:

https://en.wikipedia.org/wiki/Deaths_in_"MONTH"_"YEAR"

For example, this link leads to the notable deaths in March, 2014.

https://en.wikipedia.org/wiki/Deaths_in_March_2014

I have located the CSS location of the lists I need to be ""#mw-content-text h3+ ul li" and extracted it for a specific link successfully. Now I'm trying to write a loop to go through the months and any years that I choose. I think it's a pretty straightforward nested loop but I'm getting errors when testing it just on 2015.

library(rvest)
data = data.frame()
 mlist = c("January","February","March","April","May","June","July","August",
              "September","October","November","December")

for (y in 2015:2015){
  for (m in 1:12){
    site = read_html(paste("https://en.wikipedia.org/wiki/Deaths_in_",mlist[m],
           "_",y,collapse=""))
    fnames = html_nodes(site,"#mw-content-text h3+ ul li")
    text = html_text(fnames)
    data = rbind(data,text,stringsAsFactors=FALSE)
      }
 }

When I comment out the line:

data = rbind(data,text,stringsAsFactors=FALSE)

no errors are returned so it's clearly related to this bit. I am posting my whole code for other comments as well. The goal here is to loop through many years and then focus on the distribution over the years and months. For this I just need to keep the age, month, and year of death.

Thank you!

EDIT: Sorry, they are technically warnings, not errors. I get over 50 of them and when I try to look at "data" it is a giant mess.

When I run this code not as a loop on one specific URL, it works fine and returns a readable output.

site = read_html("https://en.wikipedia.org/wiki/Deaths_in_January_2015")
fnames = html_nodes(site,"#mw-content-text h3+ ul li")
text = html_text(fnames)

Here are a couple of rows from that data set:

text[1:5]
[1] "Barbara Atkinson, 88, British actress (Z-Cars).[1]"                                         
[2] "Staryl C. Austin, 94, American air force brigadier general.[2]"                             
[3] "Ulrich Beck, 70, German sociologist, heart attack.[3]"                                      
[4] "Fiona Cumming, 77, British television director (Doctor Who).[4]"                            
[5] "Eric Cunningham, 65, Canadian politician, Ontario MPP for Wentworth North (1975–1984).[5]"

2条回答
唯我独甜
2楼-- · 2020-04-11 19:00

I wasn't able to get the same error that you got, but I think I know what you want to do.

I have a feeling this has something to do with the unequal number of deaths in each month.

I'd suggest doing it this way

mlist = c("January","February","March","April","May","June","July","August",
      "September","October","November","December")

for (y in 2015:2015){
  for (m in 1:12){
    site = read_html(paste("https://en.wikipedia.org/wiki/Deaths_in_",mlist[m],
                       "_",y,collapse=""))
    fnames = html_nodes(site,"#mw-content-text h3+ ul li")
    text = html_text(fnames)
    assign(mlist[m],text)
  }
}

This creates a character list for each month's deaths.

An alternative (for easier use later in a loop to join them) is to use a list:

data = vector("list",12)
mlist = c("January","February","March","April","May","June","July","August",
      "September","October","November","December")

for (y in 2015:2015){
  for (m in 1:12){
    site = read_html(paste("https://en.wikipedia.org/wiki/Deaths_in_",mlist[m],
                       "_",y,collapse=""))
    fnames = html_nodes(site,"#mw-content-text h3+ ul li")
    text = html_text(fnames)
    data[[m]] = text
  }
}

Personally, I don't like dealing with lists in R. But this seems to be the best work around.

查看更多
成全新的幸福
3楼-- · 2020-04-11 19:06

html_text(fnames) returns an array. Your problem is trying append an array onto a dataframe.
Try converting your variable text to a dataframe before appending:

for (y in 2015:2015){
  for (m in 1:12){
    site = read_html(paste("https://en.wikipedia.org/wiki/Deaths_in_",mlist[m],
           "_",y,collapse=""))
    fnames = html_nodes(site,"#mw-content-text h3+ ul li")
    text = html_text(fnames)

    temp<-data.frame(text, stringsAsFactors = FALSE)

    data = rbind(data,temp)
    }
 } 

This is not the best technique for the performance reasons. Each time through the loop, the memory for the dataframe is reallocated which slows performance, with this being a one time event and a limit number of requests it should be manageable in this case.

查看更多
登录 后发表回答