Web scraping attempt at website with flash plugin

2019-09-20 03:14发布

问题:

I am attempting to scrape a website which has some kind of flash plugin which is loading data after i retrieve the html. The following object is received in the page

<OBJECT classid="clsid:D27CDB6E-AE6D-11cf-96B8-444553540000" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=6,0,40,0" WIDTH="250" HEIGHT="20" id="Preloader"><PARAM NAME="movie" VALUE="/images/preloader.swf">
      <PARAM NAME="quality" VALUE="high">
      <PARAM NAME**strong text**="bgcolor" VALUE="#FFFFFF"><EMBED src="/images/preloader.swf" quality="high" bgcolor="#FFFFFF" WIDTH="250" HEIGHT="20" NAME="Preloader" ALIGN="" TYPE="application/x-shockwave-flash" PLUGINSPAGE="http://www.macromedia.com/go/getflashplayer"></EMBED></OBJECT>

Ive attempted to locate the data being received on wireshark but no luck. My knowledge of this flash plugin or how it works is nil. Im guessing the worst case scenario that I will not be able to do this.

HttpWebRequest mainRequest = (HttpWebRequest)(WebRequest.Create(URL));
            mainRequest.Method = "GET";
            mainRequest.Proxy = null;
            WebResponse mainResponse = mainRequest.GetResponse();
            StreamReader dataReader = new StreamReader(mainResponse.GetResponseStream(), System.Text.Encoding.UTF8);
            string data = dataReader.ReadToEnd();
            dataReader.Close();
            mainResponse.Close();
            return data;

Does anyone know a way I can receive this data or make the webresponse wait for the data to be injected to the html before it is received. Any help would be greatly appreciated.

UPDATE: It seems I may have jumped the gun a little with the flash object. I think this is just a loading animation while the table populates. I've been using fiddler to see what is going on. The page is returned after a request with a loading div and the flash object contained inside. A few seconds later when the data is ready another page is returned with the data. From what I can rememebr (im not at home so cannot confirm right now) the new page has the same request header as the original. Theres no json or ajax data in fiddler. Theres no script on the client to cause a refresh that I can see. I do not understand what is causing this to update.

Ive briefly looked at the web browser object but I imagine this will be quite a performance hit when im scraping about 200 pages, currently taking a minute or so. I will try the amf viewer later to confirm that the flash object is not the source of the update.

Im guessing that the server is causing this page to be resent when it has the table ready. If the server is finding the loading div and replacing this with the table of data, would this cause the whole page to be resent? Or wouldnt this show up in ajax/json data? If it is the server resending the data, how can I keep the response open until it is ready to send the new page?

Thanks. JM.

回答1:

If the content is being loaded dynamically into the Flash movie it's very likely occurring over a standard HTTP request. Wire Shark may be a little overkill for detecting something like this. I'd recommend using a utility that will capture HTTP, such as Charles, HttpFox, or screen-scraper. Using one of those tools, watch the HTTP requests that occur while the content is loading. Once you determine which request it is it's likely you can just replicate it in your code.

That said, I've also seen cases (though not very common) where the data loaded into the Flash movie is done with a binary protocol, which makes things a little more difficult. AMF is often the protocol used in these cases. Charles proxy will detect this protocol, so that may be the tool to use in this case. A while back I wrote a blog post on extracting data that's delivered via AMF. It deals with a Java library, but you may be able to find something equivalent in .NET.



回答2:

You won't be able to do that with a plain HttpWebRequest because the Flash content isn't running. The response you get back is just the HTML. It requires a browser (or a browser-like object) to actually execute, load that object, and pull down the content. I know there are libraries for executing Javascript, but I don't know of anything that will let you run a Flash plugin outside of a browser.

You might be better off using a WebBrowser object. But even if it will execute the Flash content (I honestly don't know if it will), you might not be able to access it. You'll have to look at the DOM and see.



回答3:

Use Firebug and / or TamperData, load your page with flash as usual, and wait until Flash makes the HTTP POST/GET for getting the data.

Flash has three options to get data:

  • Sockets
  • HTTP GET
  • HTTP POST

You can fool this thing any day. Just have to make sure your request contains all this little things:

  • Method (GET or POST)
  • Cookies
  • Form Values (why? session state, for example)
  • URL Referrer
  • User Agent
  • Custom HTTP-Headers? (some guys might put this in the HTTP request so no one can "fool" the server)

This could make difference of having a response with data a default html error page.

One last thing: If the content is delivered via HTTPS, then, don't worry, it's just an extra layer somewhere but still possible.

If the content is delivered via sockets, then forget it.