How to save complete html page with frames/iframes

2019-08-09 08:18发布

问题:

During the web scraping I want to save current page's html to a file for later debug. browser.html helps in most cases, but when the page contains an iframe/frame, it's content is not returned in browser.html, I have to get it separately with something like browser.iframe.html There are also cases when inside an iframe is another iframe. I can find every frame recursively and save its content, but separated files won't be very useful because I don't know the exact structure of the page.

For example I have the following page:

<!DOCTYPE html>
<html>
<head>
</head>
  <frameset cols="50%,20%,30%">
     <frame name="left" src="/html/left_frame.htm" />
     <frame name="right" src="/html/right_frame.htm" />
     <noframes>
       <body>
          Your browser does not support frames.
       </body>
     </noframes>
     <frame src="http://example.com"/>
  </frameset>
</html>

I want to save it to file using watir. Any ideas?

回答1:

Frames act much like a completely separate web page, and while you can see the content as it appears in the rendered document and the dom, contents of a frame are not technically part of the html for a page. You can see this in the browser, right click the main doc and view html, then compare that to what you get right clicking content that is in a frame and viewing html.

To write all the html out to files, you are likely going to need to make a method that writes out html of a frame, looks for other frames, and calls the same method recursively on any frames found inside.

Alternativly maybe look at a gem like nokogiri that is designed to parse html, it might have better methods for this sort of thing, or existing examples for how to do what you want