Is there a way to download partial part of a webpa

2019-05-10 02:43发布

问题:

We only want a particular element from the HTML document at nytimes.com/technology. This page contains many articles, but we only want the article's title, which is in a . If we use wget, cURL, or any other tools or some package like requests in Python , whole HTML document is returned. Can we limite the returned data to specific element, such as the 's?

回答1:

The HTTP protocol knows nothing about HTML or DOM. Using HTTP you can fetch partial documents from supporting web servers using the Content-Range header, but you'll need to know the byte offsets of the data you want.

The short answer is that the web service itself must support what you're requesting. It is not something that can be provided at the HTTP layer.



回答2:

If you are specifically wanting to process parts of an HTML document located at the ny times url you give, you are probably going about it the wrong way. If you just want a list of the articles, by headline for instance, then what you want is the web feed. In this case, the times publishes an RSS feed from that very category of articles. Note, if you hit this page with a browser, the browser will recognize it as a feed and handle it at higher level, i.e. ask if you want to subscribe to the feed. But, you can hit this with curl and see an unparsed stream of XML. Each item in the feed will represent an article and contain meta data such as a URL to the full article, the title, etc.

Also note that there is probably some web feed specific packages to whatever language platform you are using that will give you high level access to the feed data. This will allow you to write code like:

foreach ( article in feed )
    title = article.getTitle();

rather than parsing the xml your self.



回答3:

Yes, cURL does have the ability to only download the HTML file headers and not the rest of the content. Use the -I switch to issue a HEAD http request.

From the Man page:

-I, --head

(HTTP/FTP/FILE) Fetch the HTTP-header only! HTTP-servers feature the command HEAD which this uses to get nothing but the header of a document. When used on a FTP or FILE file, curl displays the file size and last modification time only.