My scraper throws errors instead of quitting the b

2019-01-12 06:39发布

问题:

I've written a scraper in vba to parse some movie information from a torrent site. I used IE and queryselector to accomplish the task. When I execute my code it does parse everything along with an error popping up. It seems the error appears out of nowhere instead of continuing on. If I cancel the error box then I can see the results. I have uploaded two images below to show you the errors I'm having. How can I execute the code successfully without having any errors? Thanks in advance.

Here is the full code:

Sub Torrent_Data()
    Dim IE As New InternetExplorer, html As HTMLDocument
    Dim post As Object

    With IE
        .Visible = False
        .navigate "https://yts.am/browse-movies"
        Do While .readyState <> READYSTATE_COMPLETE: Loop
        Set html = .Document
    End With

    For Each post In html.querySelectorAll(".browse-movie-bottom")
        Row = Row + 1: Cells(Row, 1) = post.queryselector(".browse-movie-title").innerText
        Cells(Row, 2) = post.queryselector(".browse-movie-year").innerText
    Next post
    IE.Quit
End Sub

The errors I'm having:

Both of the errors are appearing at the same time. I'm using Internet Explorer 11.

On the other hand, If I try like below it brings the results successfully with no issues.

Sub Torrent_Data()
    Dim IE As New InternetExplorer, html As HTMLDocument
    Dim post As Object

    With IE
        .Visible = False
        .navigate "https://yts.am/browse-movies"
        Do While .readyState <> READYSTATE_COMPLETE: Loop
        Set html = .Document
    End With

    For Each post In html.getElementsByClassName("browse-movie-bottom")
        Row = Row + 1: Cells(Row, 1) = post.queryselector(".browse-movie-title").innerText
        Cells(Row, 2) = post.queryselector(".browse-movie-year").innerText
    Next post
    IE.Quit
End Sub

Reference I've added to the library:

1. Microsoft Internet Controls
2. Microsoft HTML Object Library

So, what is wrong with queryselector or what I'm missing here to make a go successfully? Is there any reference to add to the library to shake off errors?

回答1:

Ok, so there is something seriously unfriendly about that webpage. It kept crashing for me. So I have resorted to running a javascript program within scripting engine/scripting control and it works.

I hope you can follow it. The logic is in the javascript added to the ScriptEngine. I get two lists of nodes, one list of films and one list of years; then I step through each array in sync and add them as key value pair to a Microsoft Scripting Dictionary.

Option Explicit

'*Tools->References
'*    Microsoft Scripting Runtime
'*    Microsoft Scripting Control
'*    Microsoft Internet Controls
'*    Microsoft HTML Object Library

Sub Torrent_Data()
    Dim row As Long
    Dim IE As New InternetExplorer, html As HTMLDocument
    Dim post As Object

    With IE
        .Visible = True
        .navigate "https://yts.am/browse-movies"
        Do While .readyState <> READYSTATE_COMPLETE:
            DoEvents
        Loop
        Set html = .document
    End With

    Dim dicFilms As Scripting.Dictionary
    Set dicFilms = New Scripting.Dictionary

    Call GetScriptEngine.Run("getMovies", html, dicFilms)

    Dim vFilms As Variant
    vFilms = dicFilms.Keys

    Dim vYears As Variant
    vYears = dicFilms.Items

    Dim lRowLoop As Long
    For lRowLoop = 0 To dicFilms.Count - 1

        Cells(lRowLoop + 1, 1) = vFilms(lRowLoop)
        Cells(lRowLoop + 1, 2) = vYears(lRowLoop)

    Next lRowLoop

    Stop

    IE.Quit
End Sub

Private Function GetScriptEngine() As ScriptControl
    '* see code from this SO Q & A
    ' https://stackoverflow.com/questions/37711073/in-excel-vba-on-windows-how-to-get-stringified-json-respresentation-instead-of
    Static soScriptEngine As ScriptControl
    If soScriptEngine Is Nothing Then
        Set soScriptEngine = New ScriptControl
        soScriptEngine.Language = "JScript"

        soScriptEngine.AddCode "function getMovies(htmlDocument, microsoftDict) { " & _
                                    "var titles = htmlDocument.querySelectorAll('a.browse-movie-title'), i;" & _
                                    "var years = htmlDocument.querySelectorAll('div.browse-movie-year'), j;" & _
                                    "if ( years.length === years.length) {" & _
                                    "for (i=0; i< years.length; ++i) {" & _
                                    "   var film = titles[i].innerText;" & _
                                    "   var year = years[i].innerText;" & _
                                    "   microsoftDict.Add(film, year);" & _
                                    "}}}"

    End If
    Set GetScriptEngine = soScriptEngine
End Function


回答2:

The website has an API. Check e. g. result from the URL https://yts.am/api/v2/list_movies.json?page=1&limit=50, which actually represents 50 movies from first page of latest movies category, in JSON format.

Take a look at the below example. Import JSON.bas module into the VBA project for JSON processing.

Option Explicit

Sub Test()

    Dim sJSONString As String
    Dim vJSON
    Dim sState As String
    Dim lPage As Long
    Dim aRes()
    Dim i As Long
    Dim aData()
    Dim aHeader()

    With Sheets(1)
        .Cells.Delete
        .Cells.WrapText = False
    End With
    lPage = 1
    aRes = Array()
    Do
        With CreateObject("MSXML2.XMLHTTP")
            .Open "GET", "https://yts.am/api/v2/list_movies.json?page=" & lPage & "&limit=50", False
            .send
            sJSONString = .responseText
        End With
        JSON.Parse sJSONString, vJSON, sState
        If Not vJSON("data").Exists("movies") Then Exit Do
        vJSON = vJSON("data")("movies")
        ReDim Preserve aRes(UBound(aRes) + UBound(vJSON) + 1)
        For i = 0 To UBound(vJSON)
            Set aRes(UBound(aRes) - UBound(vJSON) + i) = vJSON(i)
        Next
        lPage = lPage + 1
        Debug.Print "Parsed " & (UBound(aRes) + 1)
        DoEvents
    Loop
    JSON.ToArray aRes, aData, aHeader
    With Sheets(1)
        OutputArray .Cells(1, 1), aHeader
        Output2DArray .Cells(2, 1), aData
        .Columns.AutoFit
    End With
    MsgBox "Completed"

End Sub

Sub OutputArray(oDstRng As Range, aCells As Variant)

    With oDstRng
        .Parent.Select
        With .Resize(1, UBound(aCells) - LBound(aCells) + 1)
            .NumberFormat = "@"
            .Value = aCells
        End With
    End With

End Sub

Sub Output2DArray(oDstRng As Range, aCells As Variant)

    With oDstRng
        .Parent.Select
        With .Resize( _
                UBound(aCells, 1) - LBound(aCells, 1) + 1, _
                UBound(aCells, 2) - LBound(aCells, 2) + 1)
            .NumberFormat = "@"
            .Value = aCells
        End With
    End With

End Sub

The output for me as follows, at the moment there are 7182 movies total:

BTW, the similar approach applied in the following answers: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14 and 15.