I'm trying to scrape website data with Excel V

2019-08-20 17:15发布

So I will start by saying that I am very new to VBA. I am trying to extract data from the table on this page. I haven't acomplished much as far as the code goes so take it easy on me. I am looking for some direction on how to approach it and if it can be done, which I believe it can. If anyone is able to help guide me in the right direction that would be much appreciated.

Sub rgnbateamstats()

Dim appIE As Object
Set appIE = CreateObject("internetexplorer.application")


With appIE
.navigate "https://rotogrinders.com/team-stats/nba-earned?site=draftkings"
.Visible = True
End With

Do While appIE.Busy
DoEvents
Loop

Set allRowOfData = appIE.document.getElementById("proj-stats")

Not really sure where to go from here or if I'm even on the right track.

3条回答
来,给爷笑一个
2楼-- · 2019-08-20 17:24

Whilst the table layout may appear a little odd, it actually just needs a 180 in thinking. You can grab the columns by the class name and then simply loop the rows; rather than the usual looping of rows then columns.

I use a CSS class selector to grab the columns with querySelectorAll i.e. I target the columns by their class name. This returns a nodeList containing each of the columns. Below is an example of the first two columns (admittedly abbr is not visible). Each row within a column is in a div so if I loop the columns I get the rows within each column by grabbing the associated div tag collection. Then I simply loop those to write out.

enter image description here

Option Explicit
Public Sub GetInfo()
    Dim IE As New InternetExplorer, iColumns As Object, iRow As Object, i As Long, j As Long, r As Long, c As Long
    Application.ScreenUpdating = False
    With IE
        .Visible = True
        .navigate "https://rotogrinders.com/team-stats/nba-earned?site=draftkings"

        While .Busy Or .readyState < 4: DoEvents: Wend

        Set iColumns = .document.querySelectorAll(".rgt-col")

        With ThisWorkbook.Worksheets("Sheet1")
            For i = 0 To iColumns.Length - 1
                c = c + 1: r = 0
                Set iRow = iColumns.item(i).getElementsByTagName("div")
                For j = 0 To iRow.Length - 1
                    r = r + 1
                    .Cells(r, c) = iRow(j).innerText
                Next
            Next
        End With
        Application.ScreenUpdating = True
        .Quit
    End With
End Sub

References:

VBA > Tools > References > Microsoft Internet Controls

Or change to late bound with:

Dim ie As Object
Set ie = CreateObject("InternetExplorer.Application")
查看更多
别忘想泡老子
3楼-- · 2019-08-20 17:28

This will grab the entire table on that page.

This project uses early-binding. You will need to set references to:

  • Microsoft Internet Controls
  • Microsoft HTML Object Library

You can accomplish this within the VBE > Tools > References.

I will say, this site uses a very strange method on setting up their tables, and it was interesting to figure out a decent way to accomplish this.

Also, another thing that you may or may not be okay with is that there are hidden columns in this table that doesn't show on the site but will show in your excel document. If you are not okay with this, you can simply remove or hide them after this code is executed - or if you're up to modifying this to prevent that from happening during execution, more power to you.

Option Explicit

Sub rgnbateamstats()

    Const url$ = "https://rotogrinders.com/team-stats/nba-earned?site=draftkings"

    Dim IE As New InternetExplorer, doc As HTMLDocument
    Dim ws As Worksheet
    Set ws = ThisWorkbook.Worksheets(1)

    With IE
        .Navigate url
        .Visible = True
        ieBusy IE
        Set doc = .Document
    End With

    Dim r As Long, c As Long, tCol As HTMLDivELement
    Dim subTbls(): subTbls = Array("rgt-bdy left", "rgt-bdy mid", "rgt-bdy right")

    Dim subTbl As Long        
    For subTbl = 0 To 2
        For Each tCol In getSubTblCols(doc, subTbls(subTbl)).getElementsByClassName("rgt-col")
            c = c + 1
            For r = 1 To tCol.getElementsByTagName("div").Length
                ws.Cells(r, c) = tCol.getElementsByTagName("div")(r - 1).innerText
            Next
        Next tCol
    Next subTbl

End Sub

Private Function getSubTblCols(doc As HTMLDocument, ByVal className$) As HTMLDivElement
    Dim tbl As HTMLTable
    Set tbl = doc.getElementById("proj-stats")
    Set getSubTblCols = tbl.getElementsByClassName(className)(0).Children(0). _
            Children(1)
End Function

Private Sub ieBusy(ieObj As InternetExplorer)
    With ieObj
        Do While .Busy Or .ReadyState < READYSTATE_COMPLETE
            DoEvents
        Loop
    End With
End Sub

Okay, time to attempt what's going on here.

There are three sub-tables in your table. This is the best way to explain it, but this means you will first loop through each sub table with this line:

For subTbl = 0 To 2

Within that loop, you will loop that sub-table's columns with this line:

For Each tCol In getSubTblCols(doc, subTbls(subTbl)).getElementsByClassName("rgt-col")

rgt-col is the class name for the column in each table - so at least that part was easy. The function getSubTblCols grabs the main sub-table element class name of one of the three names of the sub table in the array subTbls().

c is your Excel column number, r is the row number. You also use r for each of the HTML's row numbers, but it's uses base 0, so you have to subtract 1.

Then get the cell's value using the innerText property of the cell, place that into your spreadsheet, then rinse and repeat.

I moved your busy webpage function to a new sub, ieBusy. I also added the .readyState property because as I stated in my comment that .busy by itself is unreliable at best.

查看更多
Lonely孤独者°
4楼-- · 2019-08-20 17:35

Try this part for extracting first column

Set allrowofdata = appIE.document.getElementById("proj-stats")

Set newobj = allrowofdata.getElementsByClassName("rgt-col")(0)

For Each x In newobj.Children
r = r + 1
Cells(r, 1).value = x.innerText
Next x
查看更多
登录 后发表回答