How to 'Grab' content from another website

2019-03-04 23:25发布

A friend has asked me this, and I couldn't answer.

He asked: I am making this site where you can archive your site...

It works like this, you enter your site like, something.com and then our site grabs the content on that website like images, and all that and uploads it to our site. Then people can view an exact copy of the site at oursite.com/something.com even if the server that is holding up something.com is down.

How could he do this? (php?) and what would be some requirements?

3条回答
男人必须洒脱
2楼-- · 2019-03-04 23:51

Use wget. Either the linux version or the windows version from the gnuwin32 package. get it here.

查看更多
SAY GOODBYE
3楼-- · 2019-03-05 00:03

It sounds like you need to create a webcrawler. Web crawlers can be written in any language, although I would recommend using C++ (using cURL), Java (using URLConnection), or Python (w/ urrlib2) for that. You could probably also hack something quickly together with the curl or wget commands and BASH, although that is probably not the best long-term solution. Also, don't forget that you should download, parse, and respect the "robots.txt" file if it is present whenever you crawl someone's website.

查看更多
我想做一个坏孩纸
4楼-- · 2019-03-05 00:06
  1. Fetch the html using curl.
  2. Now change all the images,css,javascript to absolute url if they are relative urls. ( This is bit unethical). You can fetch all these assets and host on from your site.
  3. Respect "robots.txt" of all the sites. read here.
查看更多
登录 后发表回答