Using rvest to scrape a website - Selecting html n

2019-06-09 04:28发布

问题:

I have a question about my latest r vest scrape.

I want to scrape this page (and some other stocks as well): http://www.finviz.com/quote.ashx?t=AA&ty=c&p=d&b=1

I need a list of the Market Capital, which is the first box in the second line. This list should contain approx 50-100 stocks.

I am using rvest for that.

library(rvest)

html = read_html("http://www.finviz.com/quote.ashx?t=A")

cast = html_nodes(html, "table-dark-row")

The problem is, I can not get around the html_nodes. Any idea about how to find out the correct node for the html_nodes?

I am using firebug/firefinder to check out the webpage.

回答1:

Not sure if this is what you want because I cannot find a list with aprox. 50-100 stocks.

But for what is worth, using SelectorGadget I came up with this node .table-dark-row:nth-child(2) .snapshot-td2:nth-child(2), to select the Market Cap (first box in the second line of this page http://www.finviz.com/quote.ashx?t=AA&ty=c&p=d&b=1).

> library(rvest)
> 
> html = read_html("http://www.finviz.com/quote.ashx?t=AA&ty=c&p=d&b=1")
> 
> cast = html_nodes(html, ".table-dark-row:nth-child(2) .snapshot-td2:nth-child(2)")
> cast
{xml_nodeset (1)}
[1] <td width="8%" class="snapshot-td2" align="left">\n  <b>11.58B</b>\n</td>
>

If this is not exactly what you want, just use SelectorGadget to locate what you want.

Hope this helps.

EDIT :

Here complete solution:

library(rvest)

html = read_html("http://www.finviz.com/quote.ashx?t=AA&ty=c&p=d&b=1")

cast = html_nodes(html, ".table-dark-row:nth-child(2) .snapshot-td2:nth-child(2)")

html_text(cast) %>%
    gsub(pattern = "B", replacement = "") %>%
    as.numeric()