I'm using rvest to extract the table in the following page:
https://en.wikipedia.org/wiki/List_of_United_States_presidential_elections_by_popular_vote_margin
The following code works:
URL <- 'https://en.wikipedia.org/wiki/List_of_United_States_presidential_elections_by_popular_vote_margin'
table <- URL %>%
read_html %>%
html_nodes("table") %>%
.[[2]] %>%
html_table(trim=TRUE)
but the column of margins and president names have some strange values. The reason is that the source code have the following:
<td><span style="display:none">00.001</span>−10.44%</td>
so instead of getting -10.44% I get 00.001−10.44%
How could I fix this?
One option is to target and replace the problem columns individually.
The margin columns can be targeted with xpath
# get the html
html <- URL %>%
read_html()
# Example using the first margin column (column # 6)
html %>%
html_nodes(xpath = '//table[2]') %>% # get table 2
html_nodes(xpath = '//td[6]/text()') %>% # get column 6 using text()
iconv("UTF-8", "UTF-8") # to convert "−" to "-"
# [1] "−10.44%" "−3.00%" "−0.83%" "−0.51%" "0.09%" "0.17%" "0.57%"
# [8] "0.70%" "1.45%" "2.06%" "2.46%" "3.01%" "3.12%" "3.86%"
#[15] "4.31%" "4.48%" "4.79%" "5.32%" "5.56%" "6.05%" "6.12%"
#[22] "6.95%" "7.27%" "7.50%" "7.72%" "8.51%" "8.53%" "9.74%"
#[29] "9.96%" "10.08%" "10.13%" "10.85%" "11.80%" "12.20%" "12.25%"
#[36] "14.20%" "14.44%" "15.40%" "17.41%" "17.76%" "17.81%" "18.21%"
#[43] "18.83%" "22.58%" "23.15%" "24.26%" "25.22%" "26.17%"
Do the same for the other margin column. I used iconv
to convert the −
to -
, as it's an encoding issue, but you could use a substitution based solution instead (e.g. using sub
).
To target column with president names, you can use xpath again:
html %>%
html_nodes(xpath = '//table[2]') %>%
html_nodes(xpath = '//td[3]/a/text()') %>%
html_text()
# [1] "John Quincy Adams" "Rutherford Hayes" "Benjamin Harrison"
# [4] "George W. Bush" "James Garfield" "John Kennedy"
# [7] "Grover Cleveland" "Richard Nixon" "James Polk"
#[10] "Jimmy Carter" "George W. Bush" "Grover Cleveland"
#[13] "Woodrow Wilson" "Barack Obama" "William McKinley"
#[16] "Harry Truman" "Zachary Taylor" "Ulysses Grant"
#[19] "Bill Clinton" "William Henry Harrison" "William McKinley"
#[22] "Franklin Pierce" "Barack Obama" "Franklin Roosevelt"
#[25] "George H. W. Bush" "Bill Clinton" "William Taft"
#[28] "Ronald Reagan" "Franklin Roosevelt" "Abraham Lincoln"
#[31] "Abraham Lincoln" "Dwight Eisenhower" "Ulysses Grant"
#[34] "James Buchanan" "Andrew Jackson" "Martin Van Buren"
#[37] "Woodrow Wilson" "Dwight Eisenhower" "Herbert Hoover"
#[40] "Franklin Roosevelt" "Andrew Jackson" "Ronald Reagan"
#[43] "Theodore Roosevelt" "Lyndon Johnson" "Richard Nixon"
#[46] "Franklin Roosevelt" "Calvin Coolidge" "Warren Harding"