I'm trying to create a program using Excel 2013 and Visual Basic. As part of that program, I'm going to have a list of TV shows in a list box. I'd like to be able to double-click on one of those and have it open to another form with a list box containing all the seasons and episodes in those season, for that show.
The best way I've found of doing that, would be to scraping Wikipedia.org. I think it'll be one of the only websites that will have this information in roughly the same format. I also plan to do this with books as well.
I originally read about scraping at this website: http://www.wiseowl.co.uk/blog/s393/scrape-website-html.htm
However, I'd never done anything with the getelementby*, so I wasn't sure how they worked. Any help there would be appreciated. After scouring the internet, the following is the best code I could piece together:
Private Sub cmdTest_Click()
'to refer to the running copy of Internet Explorer
Dim ie As InternetExplorer
'to refer to the HTML document returned
Dim html As HTMLDocument
'for iteration
Dim i As Integer
Dim j As Integer
'open Internet Explorer in memory, and go to website
Set ie = New InternetExplorer
ie.Visible = False
ie.navigate "http://en.wikipedia.org/wiki/List_of_Archer_episodes"
'ie.navigate "http://en.wikipedia.org/wiki/List_of_The_Simpsons_episodes"
'Wait until IE is done loading page
Do While ie.READYSTATE <> READYSTATE_COMPLETE
Application.StatusBar = "Trying to go to Episodes ..."
DoEvents
Loop
'show text of HTML document returned
Set html = ie.document
'close down IE and reset status bar
Set ie = Nothing
Application.StatusBar = ""
'clear old data out and put titles in
Cells.Clear
'put heading across the top of row 3
Range("A3").Value = "Season"
Range("B3").Value = "Episode"
i = 4
For Each ele In html.getElementsByClassName("summary")
Sheets("Wiki2").Range("B" & i).Value = ele.innerText
i = i + 1
Next
i = 4
For Each ele In html.getElementsByClassName("mw-headline")
Sheets("Wiki2").Range("A" & i).Value = Left(ele.innerText, 8)
i = i + 1
Next
End Sub
The first part seems to be a pretty generic way to get the source code for a given page. I have it currently set to pull the episodes from the tv show "Archer". The following code:
i = 4
For Each ele In html.getElementsByClassName("summary")
Sheets("Wiki2").Range("B" & i).Value = ele.innerText
i = i + 1
Next
i = 4
For Each ele In html.getElementsByClassName("mw-headline")
Sheets("Wiki2").Range("A" & i).Value = Left(ele.innerText, 8)
i = i + 1
Next
is what I'm using to pull the text I'm looking for. What I need help with is combining these together. I need it to loop through and find each instance of "mw-headline" and every time if finds that, to look for the class name summary. If it finds summary, then it should display the headlines (aka Season) inner text in the cell of column A and the summary (aka episode name) in the adjacent column B cell.
These currently work independently. If you run this code, you get all the episodes of the tv show starting in cell b4 and you get a list of everything with the "mw-headline" listed in a4. The problem is that "mw-headline" is not only applied to the season, but to a couple of other things as well, hence the need to check to see if it has a "summary" class under it. This will also get rid of the instances where it says a tv show has a new season, but under it, it only says that a new season is forth coming. Without the "summary" tag, it shouldn't list it. I'd like the season to be displayed in the A column beside every episode in the B column list, so if there are 10 episodes in each season, then column A would have 10 instances of "Season 1", then 10 instances of "Season 2" and so on.
Thanks for the help and for anyone who doesn't know of stumbles on this question in the future, you need to put the following code at the top of your coding window:
Enum READYSTATE
READYSTATE_UNINITIALIZED = 0
READYSTATE_LOADING = 1
READYSTATE_LOADED = 2
READYSTATE_INTERACTIVE = 3
READYSTATE_COMPLETE = 4
End Enum
PS - In the code "For Each ele in *", is ele an undefined variable or is it a vba recognized word that stands for element? I got this from a copy and paste job and I don't understand it. Thanks.
Here's a possible solution. I looked at the html for that particular page and it presents quite a challenge for correlating season with episode. I took a step back and thought that since the seasons are presumably in numerical order, we don't need to scrape anything for the season number. On the page you presented, the episodes for each particular season are located in the same table, so i just grabbed each episode from one table and assumed it to be season 1, all the episodes from the next table are season 2, ...
Private Sub cmdTest_Click()
'to refer to the running copy of Internet Explorer
Dim ie As InternetExplorer
'to refer to the HTML document returned
Dim html As HTMLDocument
'for iteration
Dim i As Integer
Dim j As Integer
'open Internet Explorer in memory, and go to website
Set ie = New InternetExplorer
ie.Visible = False
ie.navigate "http://en.wikipedia.org/wiki/List_of_Archer_episodes"
'ie.navigate "http://en.wikipedia.org/wiki/List_of_The_Simpsons_episodes"
'Wait until IE is done loading page
Do While ie.READYSTATE <> READYSTATE_COMPLETE
Application.StatusBar = "Trying to go to Episodes ..."
DoEvents
Loop
'show text of HTML document returned
Set html = ie.document
'close down IE and reset status bar
Set ie = Nothing
Application.StatusBar = ""
'clear old data out and put titles in
Cells.Clear
'put heading across the top of row 3
Range("A3").Value = "Season"
Range("B3").Value = "Episode"
i = 4
Dim season As Integer: season = 1
For Each tableTag In html.getElementsByTagName("table") 'look through each table for "summary" (you could change this to be something a bit more discriminating!)
If (InStr(1, tableTag.innerHTML, "summary")) Then
Sheets(1).Cells(i, 1) = "Season " & season
For Each objEpisode In tableTag.getElementsByClassName("summary")
Sheets(1).Range("B" & i).Value = objEpisode.innerText
i = i + 1
Next
season = season + 1
End If
Next
End Sub
I suggest to try IMDb also. Here is the code that shows how to scrape seasons and episodes via HTTP requests, from both IMDb and Wikipedia.
Option Explicit
Sub ExtractDataWikipedia()
Dim y, sUrl, sRespText, arrMatchSeasons, arrSeason, arrMatchEpisodes, arrEpisode
sUrl = "https://en.wikipedia.org/wiki/List_of_Archer_episodes"
' sUrl = "https://en.wikipedia.org/wiki/List_of_The_Simpsons_episodes"
' sUrl = "https://en.wikipedia.org/wiki/List_of_DuckTales_episodes"
XmlHttpRequest "GET", sUrl, "", "", "", sRespText
ParseToArray "<span class=""mw-headline"" id=""Season[\s\S]*?>.*?(Season.*?)<[\s\S]*?(<table[\s\S]*?</table>)", sRespText, arrMatchSeasons
y = 1
For Each arrSeason In arrMatchSeasons
ParseToArray "(<td class=""summary""[\s\S]*?</td>)", arrSeason(1), arrMatchEpisodes
For Each arrEpisode In arrMatchEpisodes
Cells(y, 1).Value = arrSeason(0)
Cells(y, 2).Value = GetInnerText(arrEpisode(0))
y = y + 1
Next
Next
End Sub
Sub ExtractDataIMDb()
Dim y, sUrl, sRespText, arrData, arrMatchSeasons, arrSeason, sUrlEp, arrMatchEpisodes, arrEpisode
sUrl = "http://www.imdb.com/title/tt1486217/episodes" ' Archer
' sUrl = "http://www.imdb.com/title/tt0096697/episodes" ' The Simpsons
' sUrl = "http://www.imdb.com/title/tt0092345/episodes" ' DuckTales
XmlHttpRequest "GET", sUrl, "", "", "", sRespText
ParseToArray "(<select id=""bySeason""[\s\S]*?</select>)", sRespText, arrData
ParseToArray "<option[\s\S]*?value=""([\d]*)"">", arrData(0)(0), arrMatchSeasons
y = 1
For Each arrSeason In arrMatchSeasons
DoEvents
sUrlEp = sUrl & "?season=" & arrSeason(0)
XmlHttpRequest "GET", sUrlEp, "", "", "", sRespText
ParseToArray "itemprop=""episodes""[\s\S]*?itemprop=""name""[\s\S]*?>([\s\S]*?)</a>", sRespText, arrMatchEpisodes
For Each arrEpisode In arrMatchEpisodes
Cells(y, 3).Value = "Season " & arrSeason(0)
Cells(y, 4).Value = arrEpisode(0)
y = y + 1
Next
Next
MsgBox "Completed"
End Sub
Sub XmlHttpRequest(sMethod, sUrl, arrSetHeaders, sFormData, sRespHeaders, sRespText)
Dim arrHeader
With CreateObject("Msxml2.ServerXMLHTTP.3.0")
.SetOption 2, 13056 ' SXH_SERVER_CERT_IGNORE_ALL_SERVER_ERRORS
.Open sMethod, sUrl, False
If IsArray(arrSetHeaders) Then
For Each arrHeader In arrSetHeaders
.SetRequestHeader arrHeader(0), arrHeader(1)
Next
End If
.Send sFormData
sRespHeaders = .GetAllResponseHeaders
sRespText = .ResponseText
End With
End Sub
Sub ParseToArray(sPattern, sResponse, arrMatches)
Dim oMatch, arrSMatches, sSubMatch
arrMatches = Array()
With CreateObject("VBScript.RegExp")
.Global = True
.MultiLine = True
.Pattern = sPattern
For Each oMatch In .Execute(sResponse)
arrSMatches = Array()
For Each sSubMatch In oMatch.SubMatches
PushItem arrSMatches, sSubMatch
Next
PushItem arrMatches, arrSMatches
Next
End With
End Sub
Sub PushItem(arrList, varItem)
ReDim Preserve arrList(UBound(arrList) + 1)
arrList(UBound(arrList)) = varItem
End Sub
Function GetInnerText(sText)
With CreateObject("htmlfile")
.Write ("<body>" & sText & "</body>")
GetInnerText = .DocumentElement.Document.GetElementsByTagName("body")(0).InnerText
End With
End Function
Regarding HTML parsing with RegExp: here are disclaimer and alternative.
UPDATE
Just for example, there is the below code which retrieves a table containing Season, Episode, Title and Air date from IMDb:
Option Explicit
Sub ExtractDataIMDB()
Dim i As Long
Dim sURL As String
Dim sRespText As String
Dim aData
Dim aMatchSeasons
Dim aSeason
Dim sUrlEp As String
Dim aMatchEpisodes
Dim aEpisode
Dim aResult() As String
Dim aCells
ReDim aResult(1 To 4, 1 To 1)
aResult(1, 1) = "Season"
aResult(2, 1) = "Episode"
aResult(3, 1) = "Title"
aResult(4, 1) = "Air date"
sURL = "http://www.imdb.com/title/tt1486217/episodes" ' Archer
' sUrl = "http://www.imdb.com/title/tt0096697/episodes" ' The Simpsons
' sUrl = "http://www.imdb.com/title/tt0092345/episodes" ' DuckTales
XmlHttpRequest "GET", sURL, "", "", "", sRespText
ParseToArray "(<select id=""bySeason""[\s\S]*?</select>)", sRespText, aData
ParseToArray "<option[\s\S]*?value=""([\d]*)"">", aData(0)(0), aMatchSeasons
i = 2
For Each aSeason In aMatchSeasons
DoEvents
sUrlEp = sURL & "?season=" & aSeason(0)
XmlHttpRequest "GET", sUrlEp, "", "", "", sRespText
ParseToArray "itemprop=""episodes""[\s\S]*?itemprop=""episodeNumber"" content=""(.*?)""[\s\S]*?<div class=""airdate"">[\r\n\s]*([\s\S]*?)[\r\n\s]*</div>[\s\S]*?itemprop=""name""[\s\S]*?>([\s\S]*?)</a>", sRespText, aMatchEpisodes
For Each aEpisode In aMatchEpisodes
ReDim Preserve aResult(1 To 4, 1 To i)
aResult(1, i) = aSeason(0)
aResult(2, i) = aEpisode(0)
aResult(3, i) = aEpisode(2)
aResult(4, i) = aEpisode(1)
i = i + 1
Next
Next
aCells = WorksheetFunction.Transpose(aResult)
Cells.Delete
Output Cells(1, 1), aCells
MsgBox "Completed"
End Sub
Sub XmlHttpRequest(sMethod, sURL, aSetHeaders, sFormData, sRespHeaders, sRespText)
Dim aHeader
With CreateObject("Msxml2.ServerXMLHTTP.3.0")
.SetOption 2, 13056 ' SXH_SERVER_CERT_IGNORE_ALL_SERVER_ERRORS
.Open sMethod, sURL, False
If IsArray(aSetHeaders) Then
For Each aHeader In aSetHeaders
.SetRequestHeader aHeader(0), aHeader(1)
Next
End If
.Send sFormData
sRespHeaders = .GetAllResponseHeaders
sRespText = .ResponseText
End With
End Sub
Sub ParseToArray(sPattern, sResponse, aMatches)
Dim oMatch, aSubMatches, sSubMatch
aMatches = Array()
With CreateObject("VBScript.RegExp")
.Global = True
.MultiLine = True
.Pattern = sPattern
For Each oMatch In .Execute(sResponse)
aSubMatches = Array()
For Each sSubMatch In oMatch.SubMatches
PushItem aSubMatches, sSubMatch
Next
PushItem aMatches, aSubMatches
Next
End With
End Sub
Sub PushItem(aArray, vElement)
ReDim Preserve aArray(UBound(aArray) + 1)
aArray(UBound(aArray)) = vElement
End Sub
Sub Output(oDstRng As Range, aCells As Variant)
With oDstRng
.Parent.Select
With .Resize( _
UBound(aCells, 1) - LBound(aCells, 1) + 1, _
UBound(aCells, 2) - LBound(aCells, 2) + 1 _
)
.NumberFormat = "@"
.Value = aCells
.Columns.AutoFit
End With
End With
End Sub