Use HTTPWebRequest to get remote page's title

2019-05-31 18:29发布

问题:

I have a web service that acts as an interface between a farm of websites and some analytics software. Part of the analytics tracking requires harvesting the page title. Rather than passing it from the webpage to the web service, I would like to use HTTPWebRequest to call the page.

I have code that will get the entire page and parse out the html to grab the title tag but I don't want to have to download the entire page to just get information that's in the head.

I've started with

HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create("url");  
request.Method = "HEAD";

回答1:

Great idea, but a HEAD request only returns the document's HTTP headers. This does not include the title element, which is part of the HTTP message body.



回答2:

Try this:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Net;
using System.IO;
using System.Text.RegularExpressions;

namespace ConsoleApplication2
{
    class Program
    {
        static void Main(string[] args)
        {
            string page = @"http://stackoverflow.com/";
            HttpWebRequest req = (HttpWebRequest)HttpWebRequest.Create(page);
            StreamReader SR = new StreamReader(req.GetResponse().GetResponseStream());

            Char[] buf = new Char[256];
            int count = SR.Read(buf, 0, 256);
            while (count > 0)
            {
                String outputData = new String(buf, 0, count);
                Match match = Regex.Match(outputData, @"<title>([^<]+)", RegexOptions.IgnoreCase);
                if (match.Success)
                {
                    Console.WriteLine(match.Groups[1].Value);
                }
                count = SR.Read(buf, 0, 256);
            }
        }

    }
}


回答3:

If you don't want to request the entire page, you can request it in pieces. The http spec defines a http header called Range. You would use it like below:

Range: bytes=0-100

You can look through the returned content and find the title. If it is not there, then request Range: 101-200 and so on until you get what you need.

Obviously, the web server needs to support range, so this may be hit or miss.



回答4:

So I would have to go with something like...

HttpWebRequest req   = (HttpWebRequest)WebRequest.Create(URL);
HttpWebResponse resp = (HttpWebResponse)req.GetResponse();
Stream st            = resp.GetResponseStream();
StreamReader sr      = new StreamReader(st);
string buffer        = sr.ReadToEnd();
int startPos, endPos;
startPos = buffer.IndexOf("&lt;title>",
StringComparison.CurrentCultureIgnoreCase) + 7;
endPos = buffer.IndexOf("&lt;/title>",
StringComparison.CurrentCultureIgnoreCase);
string title = buffer.Substring(startPos, endPos - startPos);
Console.WriteLine("Response code from {0}: {1}", s,
        resp.StatusCode);
Console.WriteLine("Page title: {0}", title);
sr.Close();
st.Close();