httrack wget curl scrape & fetch

2020-05-21 04:35发布

问题:

There are a number of tools on the internet for downloading a static copy of a website, such as HTTrack. There are also many tools, some commercial, for “scraping” content from a website, such as Mozenda. Then there are tools which are apparently built in to programs like PHP and *nix where you can “file_get_contents” or “wget” or “cURL” or just “file()”.

I am thoroughly confused by all of this, and I think the main reason is that none of the descriptions I have come across use the same vocabulary. On the surface, at least, it seems like they are all doing the same thing, but maybe not.

That is my question. What are these tools doing, exactly? Are they doing the same thing? Are they doing the same thing via different technology? If they aren’t doing the same thing, how are they different?

回答1:

First, let me clarify the difference between "mirroring" and "scraping".

Mirroring refers to downloading the entire contents of a website, or some prominent section(s) of it (including HTML, images, scripts, CSS stylesheets, etc). This is often done to preserve and expand access to a valuable (and often limited) internet resource, or to add additional fail-over redundancy. For example, many universities and IT companies mirror various Linux vendors' release archives. Mirroring may imply that you plan on hosting a copy of the website on your own server (with the original content owner's permission).

Scraping refers to copying and extracting some interesting data from a website. Unlike mirroring, scraping targets a particular dataset (names, phone numbers, stock quotes, etc) rather than the entire contents of the site. For example, you could "scrape" average income data from the US Census Bureau or stock quotes from Google Finance. This is sometimes done against the terms and conditions of the host, making it illegal.

The two can be combined in order to separate data copying (mirroring) from information extraction (scraping) concerns. For example, you may find that its quicker to mirror a site, and then scrape your local copy if the extraction and analysis of the data is slow or process-intensive.

To answer the rest of your question...

file_get_contents and file PHP functions are for reading a file from a local or remote machine. The file may be an HTML file, or it could be something else, like a text file or a spreadsheet. This is not what either "mirroring" or "scraping" usually refers to, although you could write your own PHP-based mirror/scraper using these.

wget and curl are command-line stand-alone programs for downloading one or more files from remote servers, using a variety of options, conditions and protocols. Both are incredibly powerful and popular tools, the main difference being that wget has rich built-in features for mirroring entire websites.

HTTrack is similar to wget in its intent, but uses a GUI instead of a command-line. This makes it easier to use for those not comfortable running commands from a terminal, at the cost of losing the power and flexibility provided by wget.

You can use HTTrack and wget for mirroring, but you will have to run your own programs on the resulting downloaded data to extract (scrape) information, if that's your ultimate goal.

Mozenda is a scraper, which, unlike HTTrack, wget or curl allows you to target specific data to be extracted, rather than blindly copying all contents. I have little experience with it, however.

P.S. I usually use wget to mirror the HTML pages I'm interested in, and then run a combination of Ruby and R scripts to extract and analyze data.



回答2:

There is one more thing, rather technical, that would be said: during downloading (mirroring), HTTrack/wget/curl not only downloads many HTML files, but also changes internal links in these files - so, that these links allow proper "running" of the downloaded web pages on the new location (or on new domain or on your local). These internal links include: links from one downloaded page to another downloaded page, links to embeded images and other media and links to "auxiliary" files, like javascripts, ccs and so on.