I am attempting to scrape a website which has some kind of flash plugin which is loading data after i retrieve the html. The following object is received in the page
<OBJECT classid="clsid:D27CDB6E-AE6D-11cf-96B8-444553540000" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=6,0,40,0" WIDTH="250" HEIGHT="20" id="Preloader"><PARAM NAME="movie" VALUE="/images/preloader.swf">
<PARAM NAME="quality" VALUE="high">
<PARAM NAME**strong text**="bgcolor" VALUE="#FFFFFF"><EMBED src="/images/preloader.swf" quality="high" bgcolor="#FFFFFF" WIDTH="250" HEIGHT="20" NAME="Preloader" ALIGN="" TYPE="application/x-shockwave-flash" PLUGINSPAGE="http://www.macromedia.com/go/getflashplayer"></EMBED></OBJECT>
Ive attempted to locate the data being received on wireshark but no luck. My knowledge of this flash plugin or how it works is nil. Im guessing the worst case scenario that I will not be able to do this.
HttpWebRequest mainRequest = (HttpWebRequest)(WebRequest.Create(URL));
mainRequest.Method = "GET";
mainRequest.Proxy = null;
WebResponse mainResponse = mainRequest.GetResponse();
StreamReader dataReader = new StreamReader(mainResponse.GetResponseStream(), System.Text.Encoding.UTF8);
string data = dataReader.ReadToEnd();
dataReader.Close();
mainResponse.Close();
return data;
Does anyone know a way I can receive this data or make the webresponse wait for the data to be injected to the html before it is received. Any help would be greatly appreciated.
UPDATE: It seems I may have jumped the gun a little with the flash object. I think this is just a loading animation while the table populates. I've been using fiddler to see what is going on. The page is returned after a request with a loading div and the flash object contained inside. A few seconds later when the data is ready another page is returned with the data. From what I can rememebr (im not at home so cannot confirm right now) the new page has the same request header as the original. Theres no json or ajax data in fiddler. Theres no script on the client to cause a refresh that I can see. I do not understand what is causing this to update.
Ive briefly looked at the web browser object but I imagine this will be quite a performance hit when im scraping about 200 pages, currently taking a minute or so. I will try the amf viewer later to confirm that the flash object is not the source of the update.
Im guessing that the server is causing this page to be resent when it has the table ready. If the server is finding the loading div and replacing this with the table of data, would this cause the whole page to be resent? Or wouldnt this show up in ajax/json data? If it is the server resending the data, how can I keep the response open until it is ready to send the new page?
Thanks. JM.