Get html that is generated via AJAX in webclient

2019-01-28 09:42发布

问题:

I often go to a site to look stuff up. I thought to myself: "Hold on. I can program. Why am I going to this site manually when I can write a piece of software that does it for me?".

And so I started. I'm using C#, so I found WebClient and Uri.

I've managed to get the source code for the site, yet the problem occurred that the specific data I'm looking for is generated via AJAX, after the source code has loaded.

So that's my problem. How can I get that code, if it needs to be requested via an AJAX call first?

回答1:

The general approach is this:

  1. using a tool like Fiddler, find out which HTTP requests are made by the browser in order to fetch the data you're looking for.
  2. use WebClient to fetch the HTTP request(s) you need.

Take a look at my answer to this question for more info about HTML screen scraping for more details and how to work around various issues you may run across.

For #1 above, here's how to use fiddler to understand how a specific request is being made:

First, find the request you care about (the request which contains the data you want in its response). You can do this by inspecting each request by double-clicking it on the left pane in fiddler and looking inside the "text fiew" tab on the lower-right pane. You can also use CTRL+F to find content across multiple requests, but some requests are compressed so you'll want to ensure the "autodecode" button is selected in the toolbar before making your requests if you want to be sure you can text-search across all of them.

Once you've found the request you want, double-click it in Fiddler and select the "headers" tab in the upper-right pane. Those are the headers being sent. If your client sends exactly these headers to the server, you should get back the same data. But usually not all the headers are needed, so you'll want to figure out which ones are needed. You do this using Fiddler's Request Builder tab in the upper-right pane. Select that tab and drag your data request over from the left pane onto the request builder. Then submit the request to validate that it returns the correct results. Then start deleting headers, one header at a time, until the request stops working-- you know that that header was required. Try to delete each header until you find the ones that are required.

Then, you'll need to write code to generate the right header. Don't worry about the Host: header, that's generated automatically for you. For the Cookie: header, you'll need to generate it using the CookieContainer class. For the other headers (e.g. UserAgent:, Accept:, etc. you can generally copy them and add them to your request as-is.