Scraping Wikipedia for Season and Episodes (VBA EX

2020-06-29 04:33发布

问题:

I'm trying to create a program using Excel 2013 and Visual Basic. As part of that program, I'm going to have a list of TV shows in a list box. I'd like to be able to double-click on one of those and have it open to another form with a list box containing all the seasons and episodes in those season, for that show.

The best way I've found of doing that, would be to scraping Wikipedia.org. I think it'll be one of the only websites that will have this information in roughly the same format. I also plan to do this with books as well.

I originally read about scraping at this website: http://www.wiseowl.co.uk/blog/s393/scrape-website-html.htm

However, I'd never done anything with the getelementby*, so I wasn't sure how they worked. Any help there would be appreciated. After scouring the internet, the following is the best code I could piece together:

 Private Sub cmdTest_Click()
'to refer to the running copy of Internet Explorer
Dim ie As InternetExplorer
'to refer to the HTML document returned
Dim html As HTMLDocument
'for iteration
Dim i As Integer
Dim j As Integer

'open Internet Explorer in memory, and go to website
Set ie = New InternetExplorer
ie.Visible = False
ie.navigate "http://en.wikipedia.org/wiki/List_of_Archer_episodes"
'ie.navigate "http://en.wikipedia.org/wiki/List_of_The_Simpsons_episodes"

'Wait until IE is done loading page
Do While ie.READYSTATE <> READYSTATE_COMPLETE
Application.StatusBar = "Trying to go to Episodes ..."
DoEvents
Loop

'show text of HTML document returned
Set html = ie.document

'close down IE and reset status bar
Set ie = Nothing
Application.StatusBar = ""

'clear old data out and put titles in
Cells.Clear

'put heading across the top of row 3
Range("A3").Value = "Season"
Range("B3").Value = "Episode"

i = 4

For Each ele In html.getElementsByClassName("summary")
    Sheets("Wiki2").Range("B" & i).Value = ele.innerText
    i = i + 1
Next

i = 4

For Each ele In html.getElementsByClassName("mw-headline")
     Sheets("Wiki2").Range("A" & i).Value = Left(ele.innerText, 8)
     i = i + 1
 Next
End Sub

The first part seems to be a pretty generic way to get the source code for a given page. I have it currently set to pull the episodes from the tv show "Archer". The following code:

i = 4

    For Each ele In html.getElementsByClassName("summary")
        Sheets("Wiki2").Range("B" & i).Value = ele.innerText
        i = i + 1
    Next

    i = 4

    For Each ele In html.getElementsByClassName("mw-headline")
         Sheets("Wiki2").Range("A" & i).Value = Left(ele.innerText, 8)
         i = i + 1
     Next

is what I'm using to pull the text I'm looking for. What I need help with is combining these together. I need it to loop through and find each instance of "mw-headline" and every time if finds that, to look for the class name summary. If it finds summary, then it should display the headlines (aka Season) inner text in the cell of column A and the summary (aka episode name) in the adjacent column B cell.

These currently work independently. If you run this code, you get all the episodes of the tv show starting in cell b4 and you get a list of everything with the "mw-headline" listed in a4. The problem is that "mw-headline" is not only applied to the season, but to a couple of other things as well, hence the need to check to see if it has a "summary" class under it. This will also get rid of the instances where it says a tv show has a new season, but under it, it only says that a new season is forth coming. Without the "summary" tag, it shouldn't list it. I'd like the season to be displayed in the A column beside every episode in the B column list, so if there are 10 episodes in each season, then column A would have 10 instances of "Season 1", then 10 instances of "Season 2" and so on.

Thanks for the help and for anyone who doesn't know of stumbles on this question in the future, you need to put the following code at the top of your coding window:

Enum READYSTATE

READYSTATE_UNINITIALIZED = 0

READYSTATE_LOADING = 1

READYSTATE_LOADED = 2

READYSTATE_INTERACTIVE = 3

READYSTATE_COMPLETE = 4

End Enum

PS - In the code "For Each ele in *", is ele an undefined variable or is it a vba recognized word that stands for element? I got this from a copy and paste job and I don't understand it. Thanks.

回答1:

Here's a possible solution. I looked at the html for that particular page and it presents quite a challenge for correlating season with episode. I took a step back and thought that since the seasons are presumably in numerical order, we don't need to scrape anything for the season number. On the page you presented, the episodes for each particular season are located in the same table, so i just grabbed each episode from one table and assumed it to be season 1, all the episodes from the next table are season 2, ...

Private Sub cmdTest_Click()
'to refer to the running copy of Internet Explorer
Dim ie As InternetExplorer
'to refer to the HTML document returned
Dim html As HTMLDocument
'for iteration
Dim i As Integer
Dim j As Integer

'open Internet Explorer in memory, and go to website
Set ie = New InternetExplorer
ie.Visible = False
ie.navigate "http://en.wikipedia.org/wiki/List_of_Archer_episodes"
'ie.navigate "http://en.wikipedia.org/wiki/List_of_The_Simpsons_episodes"

'Wait until IE is done loading page
Do While ie.READYSTATE <> READYSTATE_COMPLETE
Application.StatusBar = "Trying to go to Episodes ..."
DoEvents
Loop

'show text of HTML document returned
Set html = ie.document

'close down IE and reset status bar
Set ie = Nothing
Application.StatusBar = ""

'clear old data out and put titles in
Cells.Clear

'put heading across the top of row 3
Range("A3").Value = "Season"
Range("B3").Value = "Episode"

i = 4
Dim season As Integer: season = 1

For Each tableTag In html.getElementsByTagName("table") 'look through each table for "summary" (you could change this to be something a bit more discriminating!)

    If (InStr(1, tableTag.innerHTML, "summary")) Then
        Sheets(1).Cells(i, 1) = "Season " & season

        For Each objEpisode In tableTag.getElementsByClassName("summary")
            Sheets(1).Range("B" & i).Value = objEpisode.innerText
            i = i + 1
        Next
    season = season + 1
    End If

Next

End Sub


回答2:

I suggest to try IMDb also. Here is the code that shows how to scrape seasons and episodes via HTTP requests, from both IMDb and Wikipedia.

Option Explicit

Sub ExtractDataWikipedia()
    Dim y, sUrl, sRespText, arrMatchSeasons, arrSeason, arrMatchEpisodes, arrEpisode

    sUrl = "https://en.wikipedia.org/wiki/List_of_Archer_episodes"
    ' sUrl = "https://en.wikipedia.org/wiki/List_of_The_Simpsons_episodes"
    ' sUrl = "https://en.wikipedia.org/wiki/List_of_DuckTales_episodes"

    XmlHttpRequest "GET", sUrl, "", "", "", sRespText
    ParseToArray "<span class=""mw-headline"" id=""Season[\s\S]*?>.*?(Season.*?)<[\s\S]*?(<table[\s\S]*?</table>)", sRespText, arrMatchSeasons
    y = 1
    For Each arrSeason In arrMatchSeasons
        ParseToArray "(<td class=""summary""[\s\S]*?</td>)", arrSeason(1), arrMatchEpisodes
        For Each arrEpisode In arrMatchEpisodes
            Cells(y, 1).Value = arrSeason(0)
            Cells(y, 2).Value = GetInnerText(arrEpisode(0))
            y = y + 1
        Next
    Next
End Sub

Sub ExtractDataIMDb()
    Dim y, sUrl, sRespText, arrData, arrMatchSeasons, arrSeason, sUrlEp, arrMatchEpisodes, arrEpisode

    sUrl = "http://www.imdb.com/title/tt1486217/episodes" ' Archer
    ' sUrl = "http://www.imdb.com/title/tt0096697/episodes" ' The Simpsons
    ' sUrl = "http://www.imdb.com/title/tt0092345/episodes" ' DuckTales

    XmlHttpRequest "GET", sUrl, "", "", "", sRespText
    ParseToArray "(<select id=""bySeason""[\s\S]*?</select>)", sRespText, arrData
    ParseToArray "<option[\s\S]*?value=""([\d]*)"">", arrData(0)(0), arrMatchSeasons
    y = 1
    For Each arrSeason In arrMatchSeasons
        DoEvents
        sUrlEp = sUrl & "?season=" & arrSeason(0)
        XmlHttpRequest "GET", sUrlEp, "", "", "", sRespText
        ParseToArray "itemprop=""episodes""[\s\S]*?itemprop=""name""[\s\S]*?>([\s\S]*?)</a>", sRespText, arrMatchEpisodes
        For Each arrEpisode In arrMatchEpisodes
            Cells(y, 3).Value = "Season " & arrSeason(0)
            Cells(y, 4).Value = arrEpisode(0)
            y = y + 1
        Next
    Next
    MsgBox "Completed"
End Sub

Sub XmlHttpRequest(sMethod, sUrl, arrSetHeaders, sFormData, sRespHeaders, sRespText)
    Dim arrHeader
    With CreateObject("Msxml2.ServerXMLHTTP.3.0")
        .SetOption 2, 13056 ' SXH_SERVER_CERT_IGNORE_ALL_SERVER_ERRORS
        .Open sMethod, sUrl, False
        If IsArray(arrSetHeaders) Then
            For Each arrHeader In arrSetHeaders
                .SetRequestHeader arrHeader(0), arrHeader(1)
            Next
        End If
        .Send sFormData
        sRespHeaders = .GetAllResponseHeaders
        sRespText = .ResponseText
    End With
End Sub

Sub ParseToArray(sPattern, sResponse, arrMatches)
    Dim oMatch, arrSMatches, sSubMatch
    arrMatches = Array()
    With CreateObject("VBScript.RegExp")
        .Global = True
        .MultiLine = True
        .Pattern = sPattern
        For Each oMatch In .Execute(sResponse)
            arrSMatches = Array()
            For Each sSubMatch In oMatch.SubMatches
                PushItem arrSMatches, sSubMatch
            Next
            PushItem arrMatches, arrSMatches
        Next
    End With
End Sub

Sub PushItem(arrList, varItem)
    ReDim Preserve arrList(UBound(arrList) + 1)
    arrList(UBound(arrList)) = varItem
End Sub

Function GetInnerText(sText)
    With CreateObject("htmlfile")
        .Write ("<body>" & sText & "</body>")
        GetInnerText = .DocumentElement.Document.GetElementsByTagName("body")(0).InnerText
    End With
End Function

Regarding HTML parsing with RegExp: here are disclaimer and alternative.

UPDATE

Just for example, there is the below code which retrieves a table containing Season, Episode, Title and Air date from IMDb:

Option Explicit

Sub ExtractDataIMDB()

    Dim i As Long
    Dim sURL As String
    Dim sRespText As String
    Dim aData
    Dim aMatchSeasons
    Dim aSeason
    Dim sUrlEp As String
    Dim aMatchEpisodes
    Dim aEpisode
    Dim aResult() As String
    Dim aCells

    ReDim aResult(1 To 4, 1 To 1)
    aResult(1, 1) = "Season"
    aResult(2, 1) = "Episode"
    aResult(3, 1) = "Title"
    aResult(4, 1) = "Air date"

    sURL = "http://www.imdb.com/title/tt1486217/episodes" ' Archer
    ' sUrl = "http://www.imdb.com/title/tt0096697/episodes" ' The Simpsons
    ' sUrl = "http://www.imdb.com/title/tt0092345/episodes" ' DuckTales

    XmlHttpRequest "GET", sURL, "", "", "", sRespText
    ParseToArray "(<select id=""bySeason""[\s\S]*?</select>)", sRespText, aData
    ParseToArray "<option[\s\S]*?value=""([\d]*)"">", aData(0)(0), aMatchSeasons
    i = 2
    For Each aSeason In aMatchSeasons
        DoEvents
        sUrlEp = sURL & "?season=" & aSeason(0)
        XmlHttpRequest "GET", sUrlEp, "", "", "", sRespText
        ParseToArray "itemprop=""episodes""[\s\S]*?itemprop=""episodeNumber"" content=""(.*?)""[\s\S]*?<div class=""airdate"">[\r\n\s]*([\s\S]*?)[\r\n\s]*</div>[\s\S]*?itemprop=""name""[\s\S]*?>([\s\S]*?)</a>", sRespText, aMatchEpisodes
        For Each aEpisode In aMatchEpisodes
            ReDim Preserve aResult(1 To 4, 1 To i)
            aResult(1, i) = aSeason(0)
            aResult(2, i) = aEpisode(0)
            aResult(3, i) = aEpisode(2)
            aResult(4, i) = aEpisode(1)
            i = i + 1
        Next
    Next
    aCells = WorksheetFunction.Transpose(aResult)
    Cells.Delete
    Output Cells(1, 1), aCells

    MsgBox "Completed"
End Sub

Sub XmlHttpRequest(sMethod, sURL, aSetHeaders, sFormData, sRespHeaders, sRespText)
    Dim aHeader
    With CreateObject("Msxml2.ServerXMLHTTP.3.0")
        .SetOption 2, 13056 ' SXH_SERVER_CERT_IGNORE_ALL_SERVER_ERRORS
        .Open sMethod, sURL, False
        If IsArray(aSetHeaders) Then
            For Each aHeader In aSetHeaders
                .SetRequestHeader aHeader(0), aHeader(1)
            Next
        End If
        .Send sFormData
        sRespHeaders = .GetAllResponseHeaders
        sRespText = .ResponseText
    End With
End Sub

Sub ParseToArray(sPattern, sResponse, aMatches)
    Dim oMatch, aSubMatches, sSubMatch
    aMatches = Array()
    With CreateObject("VBScript.RegExp")
        .Global = True
        .MultiLine = True
        .Pattern = sPattern
        For Each oMatch In .Execute(sResponse)
            aSubMatches = Array()
            For Each sSubMatch In oMatch.SubMatches
                PushItem aSubMatches, sSubMatch
            Next
            PushItem aMatches, aSubMatches
        Next
    End With
End Sub

Sub PushItem(aArray, vElement)
    ReDim Preserve aArray(UBound(aArray) + 1)
    aArray(UBound(aArray)) = vElement
End Sub

Sub Output(oDstRng As Range, aCells As Variant)
    With oDstRng
        .Parent.Select
        With .Resize( _
            UBound(aCells, 1) - LBound(aCells, 1) + 1, _
            UBound(aCells, 2) - LBound(aCells, 2) + 1 _
        )
            .NumberFormat = "@"
            .Value = aCells
            .Columns.AutoFit
        End With
    End With
End Sub