Preferred technique for 'pacing' HTTP requ

2019-09-02 00:25发布

问题:

I'm trying to "spider" a small set of data from a single site using TamperMonkey/Javascript/jQuery and collate it on to a single page.

I've written a TM script (which fires when I open a target page) to do the following:

  • Search the page for links of a certain type (typically around 8 links)
  • "Follow" each link found to a new page, locate and follow a single link from there
  • Extract the data I'm interested in and "incorporate" it into the original page I opened.

Iterating through these actions typically results in 16 (8 * 2 Links) HTTP requests being fired at the site. The code I've written works fine if I manually call it (via console) to perform the actions in a single step manner for all 16 pieces of data.

However if I try and set a loop up and let the code just "do it's thing" I get The page you requested isn't responding type HTML back (Status=OK) after about 4 iterations. I'm guessing the site is protecting itself against some sort of XSRF attack or is just genuinely slow?

My question is what would be the preferred technique to lower the rate at which I'm requesting data from the site? I've considered building an array of HTTP function calls or URLs to process, but this seems clunky, is there anything more idiomatic available to me?

I'm guessing this must be such a common problem and solid solutions exist for it, but I just don't have a good enough grip on terminology to search properly for it.

回答1:

Similar answer I posted on the other question: Browser stops working for a while after synchronous ajax call in a for loop

You can use a "recursive" function to help you control flow with asynchronous calls. Instead of running then synchronously, you can run them all asynchronously and the function when it is time for the next one.

Something like:

function doCall() {
    setTimeout(function() {
        $.ajax({
            //...
            succcess: function(data) {
                //...
                //time to start the next one
                doCall();
            },
            error: function() {
                //call the next one on error?
                doCallI();
            }
        });
    }, 1000); //1 second wait before each run
}

This way they run async, don't block everything while they are calling; but still run in series. You can even put a small delay within the doCall function so there is some space.