How to automate saving webpages?

2020-01-29 03:05发布

I need to archive several hundred webpages in the style of what browsers call "Save as, complete", meaning they save an HTML file for the page itself along with a folder full of other files needed to render the page correctly, such as CSS, JS, and image files. This allows the pages to be viewed offline looking the same as when displayed online.

Here are the methods I've tried and the problems with each:

  • Manual process in Firefox:

    • On the link for the next page, right click. Type "A" for "Save link location" to copy the destination URL to the clipboard.
    • Click the link to go to the page.
    • Type "Alt-F-A" for "Save page as". If not already selected, set "Save as type" to "Web page, complete".
    • If not already there, put the cursor in "File name". Type "Ctrl-Insert" to paste the clipboard, which contains the URL of the current page.
    • Move the cursor to the end of the URL, then move it back until it reaches the last "/". Select from there to the left to select the path part of the URL.
    • Press "Delete" to remove the path from the URL, leaving just the filename.
    • Press "Enter" on the keyboard or "Ok" in the dialog box.
    • The page is now saved. Repeat the process by clicking the link for the next page. (This assumes each page has a "Next" link, which is true for the pages I'm archiving. If that were not true, then there would be an additional step of going back to the page with a list of all links and clicking the next one from there.)

    That's pretty tedious to do over and over again. It is the process I want to automate.

  • iMacros. This kind of repetitive task is exactly what macros are for. I've used iMacros before for similar tasks in a Web browser, but hadn't used it in a long time. I reinstalled it, figured out how to use it again, and wrote a one- or two-line macro to save the current page with the filename of its URL. Then, when I attempted to run it, iMacros informed me that the SaveAs command is not available in the free version and I need to step up to the $100 version (with a 30-day free trial) to get that capability. I was not impressed with what I'd seen in the current version of the software, and found it to be clunky and poorly documented. So I preferred to look for another solution.

  • Wget. This is very cool. Wikipedia describes it as "a computer program that retrieves content from web servers." It was new to me and took a while to figure out. Mainly billed as a Unix program, it's also available for Windows, and it's just a small executable file that requires no installation. I learned it enough to get it to download a few test pages, but when I went to the pages I need to archive, it didn't work on them. I've sent an e-mail to the Wget mailing list about the problem and am waiting to see if I can figure it out with some help. (The linked e-mail has the Wget command line I used, including the URL of a page I want to archive, with attached image files of what the page looks like online and after being saved by Wget.)

    Edit: As of more than a week later, there has been no reply on the Wget mailing list.

  • Selenium. Although this doesn't say it's for building macros, but "test cases.", it looks like a much higher quality macro system than iMacros. So I tried it out. But I found that it doesn't record everything I need to do in the procedure under Manual process in Firefox above. For example, when I right-clicked on the link and typed "A" to store the linked URL, Selenium did not add anything to the algorithm it was recording. After following the link, when I saved the page, Selenium again did nothing. So, while it looks like quality software, it doesn't seem to have the capability I need, unless I'm misunderstanding something.

So I'm stumped. I'm not going to do that manual process several hundred times. So I need to find a way to automate it. How can I do that?

1条回答
戒情不戒烟
2楼-- · 2020-01-29 03:41

I've accepted Tim Vanderzeil's answer because he directed me to the tool that I needed for this. Now I want to share what I've done with what he gave me. The solution is only semi-automated because of a problem with Kantu, but it's far and away better than trying to do it all manually. I'm posting this here both to share what I've learned and to see if anyone can offer improvements, including a solution to the problem that is preventing full automation.

First, let me mention some background of the technology, which is interesting. Kantu, and especially its extension XModules (which is what I needed for this project), are pretty new. The company that makes them was founded in 2016 and Kantu was announced in September 2017. But their history is way deeper than that since its founders include Mathias Roth, the original developer of iMacros. Kantu is a different implementation of another tool I mentioned in my question, Selenium. So there's a lot of cross-pollination in this esoteric field of browser automation.

Many people have been asking on StackOverflow for a long time how to automate saving of webpages, such as 1, 2, 3, 4, 5, and 6. None of the answers appear to me to be all that helpful. It's a bit strange because all browsers have the capability, so there have to be some modules floating around somewhere for this, so I don't know why I can't just call a function for it in PHP. The question linked as #5 above says it appears in browsers through "Webkit", but knowing that hasn't led me anywhere useful yet.

So, in the meantime, until I find that PHP function, I have to do it by turning my Web browser into a robot. I developed the code below for a few e-books behind a paywall that I have a legitimate account for and want to preserve for offline use, and that are not offered as pdfs. I determined two ways I could download the pages with Kantu:

  • I massaged the HTML of the tables of contents pages to extract the needed URLs and put them into CSV files. This can be read by Kantu's command csvRead. The URL is passed to command open to open the page, then command XType sends Ctrl-S (or Alt-F-A) to tell the browser to save the page. XType is used again to enter the filename to save as (the part of the URL after the last "\"), and a final XType sends Enter to conclude the browser's Save-As dialog. Loop this, and the book is saved. The looping can be done either inside the macro using a label and command gotoLabel, or the macro can be written to do one page and the looping can be done in Kantu's GUI.

  • Alternatively, I can use the links on each page to go to the next page. This is the process I described in my question. I first used Kantu's recording process to get the identification of the next-page link, and use that as data in the code for the macro below (specifically as the "target" of commands XClick and click). I start up Kantu on the first webpage and the macro uses command XClick to right-click the next-page link, then XType to send "A" to the browser, telling it to copy the linked URL to the clipboard. Then the commend click clicks the link to open the page, and the rest is the same as the previous method. Here, I'm using the next-page links to get the URLs instead of a CSV file.

Now, I mentioned that there is a problem in Kantu that prevents this from being fully automated. The last step of the process, sending Enter to the browser to conclude the Save-As dialog, is flaky for unknown reasons. Sometimes it works, and sometimes the dialog box just sits there, requiring me to press Enter myself to allow the process to move on to the next webpage. This is tedious and means that I need to participate in the process instead of leaving it running on its own. So, not perfect, but a whole lot better than having to do all the rest of the procedure manually as well, which would be out of the question for several hundred pages.

The free version of XModules has a limit of 25 commands per run. To pass that limit there is a one-time charge of $50. That would probably be well worth it if I could let the process run on its own. But since I have to babysit it anyway, I'm currently running the macro by clicking on Kantu's Play macro button for each page as well as watching for when I need to press Enter.

I've posted about the Enter problem and some other issues on Kantu's forum. Their team has been very responsive and helpful. I hope that I or they or someone reading this can figure out a solution. In the meantime, the semi-automated process is better than nothing.

Between the two methods described above, it's only the second one, using the next-page links to get the URLs, that can run without a loop, i.e., with a manual press of Play macro for each page. So that's the one I've been using for now. The code has a rather inelegant repetition of 25 Ctrl-Lefts as a workaround for the surprising absence of the Home key in XType's vocabulary, as well as the absence (as far as I've found) of a command for repeating a key-press.

Here is the Kantu code, in JSON:

{"Name": "SavePageAsComplete",
 "CreationDate": "2019-01-03",
 "Commands":
  [{"Command": "comment",
    "Target":  "Macro for Kantu with XModules. Based on demo macros DemoXClick and 
         DemoXType and docs https://a9t9.com/kantu/docs/xclick and https://a9t9.com/kantu/docs/xtype. 
         The target in the XClick and click commands are what was obtained from 
         attempting to record this macro on the website, which resulted in only an open 
         command and two identical click commands with that target.",
    "Value":   ""
    },
   {"Command": "comment",
    "Target":  "Set play speed to 0.3 seconds. (See Kantu manual section 'Setting the right macro replay speed'.)",
    "Value":   ""
    },
   {"Command": "store",
    "Target":  "medium",
    "Value":   "!replayspeed"
    },
   {"Command": "bringBrowserToForeground",
    "Target":  "",
    "Value":   ""
    },
   {"Command": "comment",
    "Target":  "Right-click the link for the next page and copy its URL to the clipboard.",
    "Value":   ""
    },
   {"Command": "XClick",
    "Target":  "//*[@id=\"container\"]/div[2]/section/div[2]/a/div",
    "Value":   "#right"
    },
   {"Command": "XType",
    "Target":  "A",
    "Value":   ""
    },
   {"Command": "comment",
    "Target":  "Click the link for the next page. (Tried with 'clickAndWait' instead in 
         order to wait for the page to load, but that yielded error 'No page load 
         event detected after 10 seconds.')",
    "Value":   ""
    },
   {"Command": "click",
    "Target":  "//*[@id=\"container\"]/div[2]/section/div[2]/a/div",
    "Value":   ""
    },
   {"Command": "comment",
    "Target":  "Open the Save-as dialog.",
    "Value":   ""
    },
   {"Command": "XType",
    "Target":  "${KEY_CTRL+KEY_S}",
    "Value":   ""
    },
   {"Command": "comment",
    "Target":  "Wait for the dialog to appear.",
    "Value":   ""
    },
   {"Command": "pause",
    "Target":  "2000",
    "Value":   ""
    },
   {"Command": "comment",
    "Target":  "Paste the clipboard (URL of now-current page) into Filename text box.",
    "Value":   ""
    },
   {"Command": "XType",
    "Target":  "${KEY_CTRL+KEY_V}",
    "Value":   ""
    },
   {"Command": "comment",
    "Target":  "Move the cursor to the beginning of the URL. (There is no Home key!)",
    "Value":   ""
    },
   {"Command": "XType",
    "Target":  "${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}",
    "Value":   ""
    },
   {"Command": "XType",
    "Target":  "${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}",
    "Value":   ""
    },
   {"Command": "XType",
    "Target":  "${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}",
    "Value":   ""
    },
   {"Command": "XType",
    "Target":  "${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}",
    "Value":   ""
    },
   {"Command": "XType",
    "Target":  "${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}",
    "Value":   ""
    },
   {"Command": "comment",
    "Target":  "Select from the beginning of the URL to the end of its path part.",
    "Value":   ""
    },
   {"Command": "XType",
    "Target":  "${KEY_SHIFT+KEY_CTRL+KEY_RIGHT}${KEY_SHIFT+KEY_CTRL+KEY_RIGHT}${KEY_SHIFT+KEY_CTRL+KEY_RIGHT}${KEY_SHIFT+KEY_CTRL+KEY_RIGHT}",
    "Value":   ""
    },
   {"Command": "XType",
    "Target":  "${KEY_SHIFT+KEY_CTRL+KEY_RIGHT}${KEY_SHIFT+KEY_CTRL+KEY_RIGHT}${KEY_SHIFT+KEY_CTRL+KEY_RIGHT}${KEY_SHIFT+KEY_CTRL+KEY_RIGHT}",
    "Value":   ""
    },
   {"Command": "XType",
    "Target":  "${KEY_SHIFT+KEY_CTRL+KEY_RIGHT}${KEY_SHIFT+KEY_CTRL+KEY_RIGHT}${KEY_SHIFT+KEY_CTRL+KEY_RIGHT}",
    "Value":   ""
    },
   {"Command": "comment",
    "Target":  "Delete the selection, leaving just the filename.",
    "Value":   ""
    },
   {"Command": "XType",
    "Target":  "${KEY_DEL}",
    "Value":   ""
    },
   {"Command": "pause",
    "Target":  "500",
    "Value":   ""
    },
   {"Command": "comment",
    "Target":  "Save the page.",
    "Value":   ""
    },
   {"Command": "XType",
    "Target":  "${KEY_ENTER}",
    "Value":   ""
    }
   ]
 }

Maybe this will be of some help to other people who've been wanting to automate saving of pages. And if anyone can improve on this, maybe you could say how in a comment or another answer. Especially if you know why the Save-As dialog box doesn't close reliably, and know how to fix that.

查看更多
登录 后发表回答