PhantomJS unexpected load behavior with multiple p

2019-04-29 22:14发布

问题:

i have a script (below) that scrapes a site with a 3 step process. it works great when set to a maximum of 1 page at a time. however, when i increase that to 2 at a time things start getting wonky. the onFinished fires earlier than i would expect and the page isn't completely loaded yet. because of this the rest of my script breaks. any idea why this might be happening? i should add that i'm using the newest version (1.5).

MAX_PAGES = 1
### 
changing MAX_PAGES to >1 causes some pages onFinished event to fire before
the page is fully rendered.  this is evident by the fact that there are >1 images
for some pages.  i havent been able to reproduce using microsoft.com, but on some
pages i was working on the first onLoadFinished seemed to be called before the page
was actually fully loaded based on the look of the rendered images
###

newPage = (id) ->
context = {}
context.id = id
context.step = 0
context.page = require('webpage').create()
context.page.onLoadStarted = ->
    context.step++
context.page.onLoadFinished = (status) ->
    console.log status
    if status is 'success'
        context.page.render("#{context.id}_#{context.step}.png")
    else
        context.page.release()
        context.page.open('http://www.microsoft.com')
        console.log 'started loading'

newPage id for id in [1..MAX_PAGES]

回答1:

I think the problem has to do with the fact that each webpage within PhantomJS is using the same QNetworkAccessManager, thus, the finished() signal is firing when each webpage object finishes loading. Modifications to PhantomJS's code might need to be made in order to fix this problem. I have noticed this before when trying to load multiple pages in parallel in PhantomJS. An application I'm working on uses QtWebkit and loads multiple pages simultaneously so I have to make sure that each webpage gets its own QNetworkAccessManager so that the finished() signals don't interfere with each other.



回答2:

To crawl multiple pages, see the example follow.js that is bundled with the library. https://github.com/ariya/phantomjs/blob/master/examples/follow.js

You need to use recursion to wait for the current page to load before loading the next page.