HTML page vastly different when using a headless w

2019-04-29 08:46发布

I was under the impression that using a headless browser implementation of webkit using PyQT will automatically get me the html code for each URL even with heavy JS code in it. But I am only seeing it partially. I am comparing with the page I get when I save the page from the firefox window.

I am using the following code -

class JabbaWebkit(QWebPage):
    # 'html' is a class variable

    def __init__(self, url, wait, app, parent=None):
        super(JabbaWebkit, self).__init__(parent)
        JabbaWebkit.html = ''

        if wait:
            QTimer.singleShot(wait * SEC, app.quit)
        else:
            self.loadFinished.connect(app.quit)

        self.mainFrame().load(QUrl(url))

    def save(self):
        JabbaWebkit.html = self.mainFrame().toHtml()

    def userAgentForUrl(self, url):
        return USER_AGENT


    def get_page(url, wait=None):
        # here is the trick how to call it several times
        app = QApplication.instance() # checks if QApplication already exists

        if not app: # create QApplication if it doesnt exist
            app = QApplication(sys.argv)
        #
        form = JabbaWebkit(url, wait, app)
        app.aboutToQuit.connect(form.save)
        app.exec_()
        return JabbaWebkit.html

Can some one see anything obviously wrong with the code?

After running the code through a few URLs, here is one I found that shows the problems I am running into quite clearly - http://www.chilis.com/EN/Pages/menu.aspx

Thanks for any pointers.

1条回答
女痞
2楼-- · 2019-04-29 09:06

The page have ajax code, when it finish load, it still need some time to update the page with ajax. But you code will quit when it finish load.

You should add some code like this to wait some time and process events in webkit:

for i in range(200): #wait 2 seconds
    app.processEvents()
    time.sleep(0.01)
查看更多
登录 后发表回答