Screen scraping web page after delay

2019-01-27 06:41发布

问题:

I'm trying to scrape a web page using C#, however after the page loads, it executes some javascript which loads more elements into the DOM which I need to scrape. A standard scraper simply grabs the html of the page on load and doesn't pick up the DOM changes made via javascript. How do I put in some sort of functionality to wait for a second or two and then grab the source?

Here is my current code:

private string ScrapeWebpage(string url, DateTime? updateDate)
        {
            HttpWebRequest request = null;
            HttpWebResponse response = null;
            Stream responseStream = null;
            StreamReader reader = null;
            string html = null;

            try
            {
                //create request (which supports http compression)
                request = (HttpWebRequest)WebRequest.Create(url);
                request.Pipelined = true;
                request.Headers.Add(HttpRequestHeader.AcceptEncoding, "gzip,deflate");
                if (updateDate != null)
                    request.IfModifiedSince = updateDate.Value;

                //get response.
                response = (HttpWebResponse)request.GetResponse();
                responseStream = response.GetResponseStream();
                if (response.ContentEncoding.ToLower().Contains("gzip"))
                    responseStream = new GZipStream(responseStream, CompressionMode.Decompress);
                else if (response.ContentEncoding.ToLower().Contains("deflate"))
                    responseStream = new DeflateStream(responseStream, CompressionMode.Decompress);

                //read html.
                reader = new StreamReader(responseStream, Encoding.Default);
                html = reader.ReadToEnd();
            }
            catch
            {
                throw;
            }
            finally
            {//dispose of objects.
                request = null;
                if (response != null)
                {
                    response.Close();
                    response = null;
                }
                if (responseStream != null)
                {
                    responseStream.Close();
                    responseStream.Dispose();
                }
                if (reader != null)
                {
                    reader.Close();
                    reader.Dispose();
                }
            }
            return html;
        }

Here is a sample url:

http://www.realtor.com/realestateandhomes-search/geneva_ny#listingType-any/pg-4

You'll see when the page first loads it says 134 listings found, then after a second it says 187 properties found.

回答1:

To execute the JavaScript I use webkit to render the page, which is the engine used by Chrome and Safari. Here is an example using its Python bindings.

Webkit also has .NET bindings but I haven't used them.



回答2:

The approach you have will not work regardless how long you wait, you need a browser to execute the javascript (or something that understands javascript).

Try this question: What's a good tool to screen-scrape with Javascript support?



回答3:

You would need to execute the javascript yourself to get this functionality. Currently, your code only receives whatever the server replies with at the URL you request. The rest of the listings are "showing up" because the browser downloads, parses, and executes the accompanying javascript.



回答4:

The answer to this similar question says to use a web browser control to read the page in and process it before scraping it. Perhaps with some kind of timer delay to give the javascript some time to execute and return results.