Unable to shake off hardcoded delay from my script

2019-03-31 23:00发布

I've written a script in vba in combination with selenium to parse all the company names available in a webpage. The webpage has got lazyloading method active so there are only 20 links become visible in each scroll. If I scroll 2 times then the number of links visible are 40 and so on. There are 1000 links available in that webpage. My below script can reach the bottom of that page handling all the scroll and fetch all the names available in that webpage.

However, it is necessary to wait a certain time after each scroll for that webpage to update the content. This is where I've used hardcoded delay but the process of hardcoding thing is very inconsistent and sometimes it makes the browser quit before the completion of the whole operation.

How can I modify this portion .Wait 6000 to make it Explicit Wait instead of Hardcoded Wait.

This is what I've written so far:

Sub Getlinks()
    Dim driver As New ChromeDriver, prevlen&, curlen&
    Dim posts As Object, post As Object

    With driver
        .get "http://fortune.com/fortune500/list/"
        prevlen = .FindElementsByClass("company-title").Count

        Do
            prevlen = curlen
            .ExecuteScript ("window.scrollTo(0, document.body.scrollHeight);")

            .Wait 6000  ''I like to kick out this hardcoded delay and use explicit wait in place

            Set posts = .FindElementsByClass("company-title")
            curlen = posts.Count
            If prevlen = curlen Then Exit Do
        Loop

        For Each post In posts
            R = R + 1: Cells(R, 1) = post.Text
        Next post
    End With
End Sub

6条回答
贼婆χ
2楼-- · 2019-03-31 23:09

I think you are almost there.

Although I don't think you can avoid waiting, the work around is to keep a number of times checking for new posts as you scroll down with a shorter wait.

Example below is to check for new posts 5 times each with 2 seconds wait, so a total of 10 seconds before declaring end of the page. Adjust these 2 parameters to suit.

Sub Getlinks()
    Dim driver As New ChromeDriver, prevlen&, curlen&
    Dim posts As Object, post As Object
    ' Counter for number of times when there are NO NEW POSTS
    Dim NoIncreaseCount As Integer
    Const MaxNoIncreaseCount As Integer = 5
    Const WaitTime As Integer = 2000 ' 2 seconds wait time each scroll down

    With driver
        .get "http://fortune.com/fortune500/list/"
        prevlen = .FindElementsByClass("company-title").Count
        NoIncreaseCount = 0
        Do Until NoIncreaseCount = MaxNoIncreaseCount
            .ExecuteScript ("window.scrollTo(0, document.body.scrollHeight);")
            .Wait WaitTime
            Set posts = .FindElementsByClass("company-title")
            curlen = posts.Count
            If prevlen < curlen Then
                ' There are new Posts
                prevlen = curlen
                NoIncreaseCount = 0
            Else
                ' No new Posts
                NoIncreaseCount = NoIncreaseCount + 1
            End If
        Loop

        For Each post In posts
            R = R + 1: Cells(R, 1) = post.Text
        Next post
    End With
End Sub
查看更多
劫难
3楼-- · 2019-03-31 23:12

Here is a completely different approach that doesn't require using a browser, instead it submits a series of web requests. With this approach, waiting for a page to load isn't a concern.

Typically, with lazy loading pages, it will submit a new request to load up the data for the page as you scroll. If you monitor the web traffic you can spot the requests made and emulate those, I have done that below.

The result should be a list of company names, in ascending order in whatever the first sheet of Excel is.

Things you'll need:

Add References to:

  • Microsoft Scripting Runtime
  • Microsoft XML v6.0
  • Add the VBA-JSON code to your project. You can find that here

Edit

Changed the code to keep pulling data from the site, until there is no more items in the list. Thanks @Qharr for pointing this out.

Code


Public Sub SubmitRequest()
    Const baseURL As String = "http://fortune.com/api/v2/list/2358051/expand/item/ranking/asc/"

    Dim Url            As String
    Dim startingNumber As Long
    Dim j              As Long
    Dim getRequest     As MSXML2.XMLHTTP60
    Dim Json           As Object
    Dim Companies      As Object
    Dim Company        As Variant
    Dim CompanyArray   As Variant

    'Create an array to hold each company
    ReDim CompanyArray(0 To 50000)
    'Create a new XMLHTTP object so we can place a get request
    Set getRequest = New MSXML2.XMLHTTP60

    'The api seems to only support returning 100 records at a time
    'So do in batches of 100
    Do
        'Build the url, the format is something like
        '0/100, where 0 is the starting position, and 100 is the ending position
        Url = baseURL & startingNumber & "/" & startingNumber + 100

        With getRequest
            .Open "GET", Url
            .send

            'The response is a JSON object, for this code to work -
            'You'll need this code https://github.com/VBA-tools/VBA-JSON
            'What is returned is a dictionary
            Set Json = JsonConverter.ParseJson(.responseText)
            Set Companies = Json("list-items")

            'Keep checking in batches of 100 until there are no more
            If Companies.Count = 0 Then Exit Do

            'Iterate the dictionary and return the title (which is the name)
            For Each Company In Companies
                CompanyArray(j) = Company("title")
                j = j + 1
            Next

        End With
        startingNumber = startingNumber + 100
   Loop

    ReDim Preserve CompanyArray(j - 1)

    'Dump the data to the first sheet
    ThisWorkbook.Sheets(1).Range("A1:A" & j) = WorksheetFunction.Transpose(CompanyArray)

End Sub

查看更多
劫难
4楼-- · 2019-03-31 23:14

I don't know if this will help as it's still a 'hard-coded' solution but you could try a delay function rather than the wait function and see if that helps with the program exiting issue.

Function Delay(Seconds As Single)
    Dim StopTime As Single: StopTime = Timer + Seconds
    Do While Timer < StopTime
        DoEvents
    Loop
End Function
查看更多
倾城 Initia
5楼-- · 2019-03-31 23:17

Define a timeout (specified period of time that will be allowed to elapse) to get rid of the hardcoded delay. The timeout needs to be hardcoded.

The differences between this and your original code are:

  • The loop itself is running over and over (doesn't wait 6 s on each iteration) and checks for new content until new content is found or the timeout is reached.
  • If the lazy loading takes more time than expected for instance when loading number 21 to 50 the loop "waits" and tries to get new content for the maximum time defined in timeout.
  • Downside: On the last step when all content is loaded the loop will take as many seconds as the timeout is set to.

Code:

Sub Getlinks()
    Dim driver As New ChromeDriver, prevlen&, curlen&
    Dim posts As Object, post As Object
    Dim timeout As Integer, startTime As Double

    timeout = 10 ' set the timeout to 10 seconds

    With driver
        .get "http://fortune.com/fortune500/list/"
        prevlen = .FindElementsByClass("company-title").Count

        startTime = Timer ' set the initial starting time

        Do
            .ExecuteScript ("window.scrollTo(0, document.body.scrollHeight);")
            Set posts = .FindElementsByClass("company-title")
            curlen = posts.Count
            If curlen > prevlen Then
                startTime = Timer ' reset start time if new elements found
                prevlen = curlen ' set new prevlen
            End If
        Loop While Round(Timer - startTime, 2) <= timeout ' check if timeout is reached

        For Each post In posts
            R = R + 1: Cells(R, 1) = post.Text
        Next post
    End With
End Sub
查看更多
别忘想泡老子
6楼-- · 2019-03-31 23:24

There you go:

Sub Getlinks()
    Dim driver As New ChromeDriver
    Dim pcount As Long, R as long
    Dim posts As Object, post As Object

    With driver
        .get "http://fortune.com/fortune500/list/"
        Do
            .ExecuteScript ("window.scrollTo(0, document.body.scrollHeight);")
            Set posts = .FindElementsByClass("company-title")
            pcount = posts.Count
        Loop Until pcount = 1000

        For Each post In posts
            R = R + 1: Cells(R, 1) = post.Text
        Next post
    End With
End Sub

Or even better, print as you go:

Sub Getlinksasyougo()
    Dim driver As New ChromeDriver
    Dim pcount As Long, R As Long, i As Long
    Dim posts As Object, post As Object


    With driver
        .get "http://fortune.com/fortune500/list/"
        i = 1
        Do
            .ExecuteScript ("window.scrollTo(0, document.body.scrollHeight);")
            Set posts = .FindElementsByClass("company-title")
            pcount = posts.Count
            If i <> pcount Then
                For R = i To pcount - 1
                    Cells(R, 1) = posts(R + 1).Text
                Next R
                i = pcount
            End If
        Loop Until pcount = 1000

    End With
End Sub
查看更多
聊天终结者
7楼-- · 2019-03-31 23:30

Here's a way to approach it using the "look for the spinner element" method discussed in one of the comments, which helps you avoid having to specify the number of elements you're expecting the page to load. The class name of the spinner actually changes depending on whether or not it's visible, which makes it pretty easy to just wait for the spinner to become visible + disappear again before getting the page elements.

This method still involves some waiting; by default, it waits 1/10th of a second after each attempt to find the spinner, either until the spinner is found or for some maximum number of attempts. But that's much faster than waiting 5 seconds every time.

Also, unrelated, but don't write stuff to cells one at a time, it's really slow. It's much faster to write it to an array first + write the entire array at once.

Sub getLinks()

    Dim bot As New ChromeDriver
    bot.Get "http://fortune.com/fortune500/list/"

    Dim posts As WebElements
    Dim numPosts As Long
    Dim finishedScrolling As Boolean
    finishedScrolling = False
    Do Until finishedScrolling
        'Set beginning post count and scroll down
        Dim startPosts As Long
        startPosts = numPosts
        bot.ExecuteScript "window.scrollTo(0, document.body.scrollHeight);"

        'Wait for spinner to become visible, then wait for up to 5 seconds for rehide
        Call waitForElements(bot, "div[class^='F500-spinner  ']", 50)
        Call waitForElements(bot, "div[class^='F500-spinner hide']", 50)

        'See if any new posts have loaded
        Set posts = bot.FindElementsByClass("company-title")
        numPosts = posts.Count
        If numPosts = startPosts Then
            finishedScrolling = True
        End If
    Loop

    'Write text to results array
    Dim post As WebElement
    ReDim resultsArr(1 To posts.Count, 1 To 1) As String
    Dim i As Long
    i = 1
    For Each post In posts
        resultsArr(i, 1) = post.Text
        i = i + 1
    Next

    'Write array to sheet
    With ActiveSheet
        .Range(.Cells(1, 1), .Cells(UBound(resultsArr, 1), 1)).Value = resultsArr
    End With

End Sub
Sub waitForElements(bot As WebDriver, css As String, maxAttempts As Long, Optional waitTimeMS As Long = 100)
'Use a CSS selector string to wait for element(s) to appear on a page or to reach max number of attempts
'By default, bot waits 0.1 second after each attempt

    Dim i As Long
    Dim foundElem As Boolean
    foundElem = False
    Do Until foundElem
        i = i + 1
        If bot.FindElementsByCss(css).Count > 0 Then
            foundElem = True
        ElseIf i = maxAttempts Then
            foundElem = True
        Else
            bot.Wait waitTimeMS
        End If
    Loop

End Sub
查看更多
登录 后发表回答