I'm trying get the content of using HTML agility pack. Here's a sample of the HTML i'm trying to parse :
<p itemprop="articleBody">
Hundreds of thousands of Ukrainians filled the streets of Kiev on Sunday, first to hear speeches and music and then to fan out and erect barricades in the district where government institutions have their headquarters.</p><p itemprop="articleBody">
Carrying blue-and-yellow Ukrainian and European Union flags, the teeming crowd filled
Independence Square, where protests have steadily gained momentum since Mr. Yanukovich refused on Nov. 21 to sign trade and political agreements with the European Union. The square has been transformed by a vast and growing tent encampment, and demonstrators have occupied City Hall and other public buildings nearby. Thousands more people gathered in other cities across the country. </p><p itemprop="articleBody">
“Resignation! Resignation!” people in the Kiev crowd chanted on Sunday, demanding that Mr. Yanukovich and the government led by Prime Minister Mykola Azarov leave office. </p>
I'm trying to parse the HTML above using the folllowing code :
HtmlAgilityPack.HtmlWeb nytArticlePage = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument nytArticleDoc = new HtmlAgilityPack.HtmlDocument();
System.Diagnostics.Debug.WriteLine(articleUrl);
nytArticleDoc = nytArticlePage.Load(articleUrl);
var articleBodyScope =
nytArticleDoc.DocumentNode.SelectNodes("//p[@itemprop='articleBody']");
EDIT:
But it seems like articleBodyScope is empty,because:
if (articleBodyScope != null)
{
System.Diagnostics.Debug.WriteLine("CONTENT NOT NULL");
foreach (var node in articleBodyScope)
{
articleBodyText += node.InnerText;
}
}
Does not print "CONTENT NOT NULL" and articleBodyText
remains empty.
If anyone could point me to the solution i'd be grateful, thanks in advance !
It seems that the New York Times actually detects that you're not accepting cookies from them. As such, they present you with a cookie warning and a logon box. By actually providing a
CookieContainer
you can have .NET handle the whole cookie business under the hood and NYT will actually present you its contents:With thanks to this answer for the extended WebClient class.
Note
It might be against the NYT terms of usage to blatantly scrape the new stories off their website.