How do I completely mirror a web page?

I have several web pages on several different sites that I want to mirror completely. This means that I will need images, CSS, etc, and the links need to be converted. This functionality would be similar to using Firefox to "Save Page As" and selecting "Web Page, complete". I'd like to name the files and corresponding directories as something sensible (e.g. myfavpage1.html,myfavpage1.dir).

I do not have access to the servers, and they are not my pages. Here is one sample link: Click Me!

A little more clarification... I have about 100 pages that I want to mirror (many from slow servers), I will be cron'ing the job on Solaris 10 and dumping the results every hour to a samba mount for people to view. And, yes, I have obviously tried wget with several different flags but I haven't gotten the results for which I am looking. So, pointing to the GNU wget page is not really helpful. Let me start with where I am with a simple example.

 wget --mirror -w 2 -p --html-extension --tries=3 -k -P stackperl.html "https://stackoverflow.com/tags/perl"

From this, I should see the https://stackoverflow.com/tags/perl page in the stackper.html file, if I had the flags correct.

标签： perl wget mirror

9条回答

ゆ、 Hurt°

2楼-- · 2019-03-17 02:00

Personally, the last time I had the urge to do this, I wrote a python script which made a copy of my browser cache, then manually visited all the pages I wished to mirror. A very ugly solution, but it has the nice advantage of not triggering any, "don't scrape my page" alarms. Thanks to Opera's links tab bar, "manually" downloading tens of thousands of pages wasn't nearly as hard as you'd think.

0人赞添加讨论(0) 举报

Root（大扎）

3楼-- · 2019-03-17 02:08

For an HTML-ized version of your sites you could use WinHTTrack - a free, open source, GPL program available. It will pull down pre-rendered versions of your pages, graphics, documents, zip files, movies, etc... Of course, since this is a mirrored copy any dynamic backend code such as database calls won't be dynamic anymore.

http://www.httrack.com/

0人赞添加讨论(0) 举报

看我几分像从前

4楼-- · 2019-03-17 02:10

You may use wget gnu tools to grab an entire site like this:

wget -r -p -np -k URL

or, if you use perl, try these modules:

LWP::Simple
WWW::Mechanize

0人赞添加讨论(0) 举报

ゆ、 Hurt°

5楼-- · 2019-03-17 02:10

If wget is complicated or you dont have a linuxbox you could always user WebZip

0人赞添加讨论(0) 举报

Anthone

6楼-- · 2019-03-17 02:16

It sounds like you want the caching functionality provided by a good proxy server.

Maybe look into something like SQUID? Pretty sure it can do it.

This is more of a sysadmin type question than programming though.

0人赞添加讨论(0) 举报

疯言疯语

7楼-- · 2019-03-17 02:17

In most modern websites the front end only tells a small part of the story. Regardless of tools for stripping html, css and javascript you will still be missing the core functionality that is contained at the server.

Or maybe you were meaning something else.

0人赞添加讨论(0) 举报

1 2 下一页

How do I completely mirror a web page?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间