HTML Agility Pack Get Content Of

2019-09-01 07:04发布

I'm trying get the content of using HTML agility pack. Here's a sample of the HTML i'm trying to parse :

         <p itemprop="articleBody">
    Hundreds of thousands of Ukrainians filled the streets of Kiev on Sunday, first to hear speeches and music and then to fan out and erect barricades in the district where government institutions have their headquarters.</p><p itemprop="articleBody">
    Carrying blue-and-yellow Ukrainian and European Union flags, the teeming crowd filled 
Independence Square, where protests have steadily gained momentum since Mr. Yanukovich refused on Nov. 21 to sign trade and political agreements with the European Union. The square has been transformed by a vast and growing tent encampment, and demonstrators have occupied City Hall and other public buildings nearby. Thousands more people gathered in other cities across the country.        </p><p itemprop="articleBody">
    “Resignation! Resignation!” people in the Kiev crowd chanted on Sunday, demanding that Mr. Yanukovich and the government led by Prime Minister Mykola Azarov leave office.        </p>

I'm trying to parse the HTML above using the folllowing code :

HtmlAgilityPack.HtmlWeb nytArticlePage = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument nytArticleDoc = new HtmlAgilityPack.HtmlDocument();

System.Diagnostics.Debug.WriteLine(articleUrl);
nytArticleDoc = nytArticlePage.Load(articleUrl);
var articleBodyScope = 
        nytArticleDoc.DocumentNode.SelectNodes("//p[@itemprop='articleBody']");

EDIT:

But it seems like articleBodyScope is empty,because:

if (articleBodyScope != null)
{
    System.Diagnostics.Debug.WriteLine("CONTENT NOT NULL");
    foreach (var node in articleBodyScope)
    {
        articleBodyText += node.InnerText;
    }
}

Does not print "CONTENT NOT NULL" and articleBodyText remains empty. If anyone could point me to the solution i'd be grateful, thanks in advance !

1条回答
手持菜刀,她持情操
2楼-- · 2019-09-01 07:23

It seems that the New York Times actually detects that you're not accepting cookies from them. As such, they present you with a cookie warning and a logon box. By actually providing a CookieContainer you can have .NET handle the whole cookie business under the hood and NYT will actually present you its contents:

using System;
using Microsoft.VisualStudio.TestTools.UnitTesting;

namespace UnitTestProject3
{
    using System.Net;
    using System.Runtime;

    using HtmlAgilityPack;

    [TestClass]
    public class UnitTest1
    {
        [TestMethod]
        public void WhenProvidingCookiesYouSeeContent()
        {
            HtmlDocument doc = new HtmlDocument();

            WebClient wc = new WebClientEx(new CookieContainer());

            string contents = wc.DownloadString(
                "http://www.nytimes.com/2013/12/10/world/asia/thailand-protests.html?partner=rss&emc=rss&_r=1&");
            doc.LoadHtml(contents);

            var nodes = doc.DocumentNode.SelectNodes(@"//p[@itemprop='articleBody']");

            Assert.IsNotNull(nodes);
            Assert.IsTrue(nodes.Count > 0);
        }
    }

    public class WebClientEx : WebClient
    {
        public WebClientEx(CookieContainer container)
        {
            this.container = container;
        }

        private readonly CookieContainer container = new CookieContainer();

        protected override WebRequest GetWebRequest(Uri address)
        {
            WebRequest r = base.GetWebRequest(address);
            var request = r as HttpWebRequest;
            if (request != null)
            {
                request.CookieContainer = container;
            }
            return r;
        }

        protected override WebResponse GetWebResponse(WebRequest request, IAsyncResult result)
        {
            WebResponse response = base.GetWebResponse(request, result);
            ReadCookies(response);
            return response;
        }

        protected override WebResponse GetWebResponse(WebRequest request)
        {
            WebResponse response = base.GetWebResponse(request);
            ReadCookies(response);
            return response;
        }

        private void ReadCookies(WebResponse r)
        {
            var response = r as HttpWebResponse;
            if (response != null)
            {
                CookieCollection cookies = response.Cookies;
                container.Add(cookies);
            }
        }
    }
}

With thanks to this answer for the extended WebClient class.

Note

It might be against the NYT terms of usage to blatantly scrape the new stories off their website.

查看更多
登录 后发表回答