Using rvest to grab data returns No matches

2019-01-29 11:22发布

I'm trying to grab some election results from politco's website using rvest.

http://www.politico.com/2016-election/results/map/president/wisconsin/

I couldn't pull all the data on the page at once, so I went for a county-level approach. Each county has a unique css selector (e.g Adams County's is: '#countyAdams .results-table'). So I grabbed all the county names from elsewhere and set up a quick loop (yes I know loops are bad practice in R but I anticipated this method taking me about 3 minutes).

Grab the URL

wiscoSixteen <- read_html("http://www.politico.com/2016-election/results/map/president/wisconsin")

Create an empty data.frame (and no I didn't pre-define the columns)

stateDf <- NULL

Get the list of counties (this isn't complete but to get to the point the routine breaks we don't need all 70 counties)

wiscoCounties <- c("Adams", "Ashland", "Barron", "Bayfield", "Brown", "Buffalo", "Burnett", "Calumet", "Chippewa", "Clark", "Columbia", "Crawford", "Dane", "Dodge", "Door", "Douglas", "Dunn", "Eau Claire", "Florence", "Fond du Lac", "Forest", "Grant", "Green", "Green Lake", "Iowa", "Iron", "Jackson", "Jefferson", "Juneau")

My 'for' loop:

for (i in 1:length(wiscoCounties)){

    #Pull out the i'th county name and paste it in a string
    wiscoResult <- wiscoSixteen %>% html_node(paste("#county"," .results-table", sep=wiscoCounties[i])) %>% html_table()

    #add a column for the county name so I can ID later
    wiscoResult[,4] <- wiscoCounties[i]

    #then rbind 
    stateDf <- rbind(stateDf, wiscoResult)
}

When it gets through the 10th county it stops and returns 'Error: No matches'.

Can't find anything unique about 'Columbia', the 11th county. At a loss for what's happening. I'm sure it's something stupid as that's usually the case. Any help is appreciated.

1条回答
冷血范
2楼-- · 2019-01-29 12:08

So, why not just use the XHR requests that end up populating those tables (I'm kinda surprised you're getting any data at all from them since they get generated from a separate data request):

library(httr)
library(stringi)
library(purrr)
library(dplyr)

res <- GET("http://s3.amazonaws.com/origin-east-elections.politico.com/mapdata/2016/WI_20161108.xml")
dat <- readLines(textConnection(content(res, as="text")))

stri_split_fixed(dat[2], "|")[[1]] %>%
  stri_replace_last_fixed(";", "") %>% 
  stri_split_fixed(";", 3) %>% 
  map_df(~setNames(as.list(.), c("rep_id", "first", "last"))) -> candidates

dat[stri_detect_regex(dat, "^WI;P;G")] %>% 
  stri_replace_first_regex("^WI;P;G;", "") %>% 
  map_df(function(x) {

    county_results <- stri_split_fixed(x, "||", 2)[[1]]

    stri_replace_last_fixed(county_results[1], ";;", "") %>% 
      stri_split_fixed(";") %>% 
      map_df(~setNames(as.list(.), c("fips", "name", "x1", "reporting", "x2", "x3", "x4"))) -> county_prefix

    stri_split_fixed(county_results[2], "|")[[1]] %>% 
      stri_split_fixed(";") %>% 
      map_df(~setNames(as.list(.), c("rep_id", "party", "count", "pct", "x5", "x6", "x7", "x8", "candidate_idx"))) %>% 
      left_join(candidates, by="rep_id") -> df

    df$fips <- county_prefix$fips
    df$name <- county_prefix$name
    df$reporting <- county_prefix$reporting

    select(df, -starts_with("x"))

  }) -> results

It seems to be complete data:

glimpse(results)
## Observations: 511
## Variables: 10
## $ rep_id        <chr> "WI270631108", "WI270621108", "WI270691108", "WI270711108", "WI270701108", "WI270731108", "WI270721108",...
## $ party         <chr> "Dem", "GOP", "Lib", "CST", "ADP", "WW", "Grn", "Dem", "GOP", "Lib", "CST", "ADP", "WW", "Grn", "Dem", "...
## $ count         <chr> "1382210", "1409467", "106442", "12179", "1561", "1781", "30980", "3780", "5983", "207", "44", "4", "9",...
## $ pct           <chr> "46.9", "47.9", "3.6", "0.4", "0.1", "0.1", "1.1", "37.4", "59.2", "2.0", "0.4", "0.0", "0.1", "0.8", "5...
## $ candidate_idx <chr> "1", "2", "3", "4", "5", "6", "7", "1", "2", "3", "4", "5", "6", "7", "1", "2", "3", "4", "5", "6", "7",...
## $ first         <chr> "Clinton", "Trump", "Johnson", "Castle", "De La Fuente", "Moorehead", "Stein", "Clinton", "Trump", "John...
## $ last          <chr> "Hillary", "Donald", "Gary", "Darrell", "Rocky", "Monica", "Jill", "Hillary", "Donald", "Gary", "Darrell...
## $ fips          <chr> "0", "0", "0", "0", "0", "0", "0", "55001", "55001", "55001", "55001", "55001", "55001", "55001", "55003...
## $ name          <chr> "Wisconsin", "Wisconsin", "Wisconsin", "Wisconsin", "Wisconsin", "Wisconsin", "Wisconsin", "Adams", "Ada...
## $ reporting     <chr> "100.0", "100.0", "100.0", "100.0", "100.0", "100.0", "100.0", "100.0", "100.0", "100.0", "100.0", "100....

Despite the ".xml" extension on the URL, it's not XML data. I also don't know what some of the columns actually are, but you can dig into that. Also, there's a whole other section of data:

WI;S;G;0;Wisconsin;X;100.0;X;;50885;;||WI269201108;Dem;1380496;46.8;;X;;;1|WI267231108;GOP;1479262;50.2;X;X;X;;2|WI270541108;Lib;87291;3.0;;X;;;3
WI;S;G;55001;Adams;X;100.0;X;;50885;;||WI269201108;Dem;4093;41.2;;X;;;1|WI267231108;GOP;5346;53.9;X;X;X;;2|WI270541108;Lib;486;4.9;;X;;;3
WI;S;G;55003;Ashland;X;100.0;X;;50885;;||WI269201108;Dem;4349;55.1;;X;;;1|WI267231108;GOP;3337;42.2;X;X;X;;2|WI270541108;Lib;214;2.7;;X;;;3
WI;S;G;55005;Barron;X;100.0;X;;50885;;||WI269201108;Dem;8691;38.8;;X;;;1|WI267231108;GOP;12863;57.4;X;X;X;;2|WI270541108;Lib;853;3.8;;X;;;3
WI;S;G;55007;Bayfield;X;100.0;X;;50885;;||WI269201108;Dem;5161;54.6;;X;;;1|WI267231108;GOP;4022;42.6;X;X;X;;2|WI270541108;Lib;263;2.8;;X;;;3
WI;S;G;55009;Brown;X;100.0;X;;50885;;||WI269201108;Dem;51004;40.0;;X;;;1|WI267231108;GOP;71750;56.3;X;X;X;;2|WI270541108;Lib;4615;3.6;;X;;;3
WI;S;G;55011;Buffalo;X;100.0;X;;50885;;||WI269201108;Dem;2746;39.9;;X;;;1|WI267231108;GOP;3850;56.0;X;X;X;;2|WI270541108;Lib;285;4.1;;X;;;3
WI;S;G;55013;Burnett;X;100.0;X;;50885;;||WI269201108;Dem;3143;37.4;;X;;;1|WI267231108;GOP;4998;59.5;X;X;X;;2|WI270541108;Lib;258;3.1;;X;;;3

which obviously means something for that page (it's kinda obvious, but I'm so weary from the election that I'm kinda done with the data) and you can process in similar fashion as what is above.

查看更多
登录 后发表回答