I'm working on a project that involves some basic web crawling. I've been using HttpWebRequest and HttpWebResponse quite successfully. For cookie handling I just have one CookieContainer that I assign to HttpWebRequest.CookieContainer each time. I automatically gets populated with the new cookies each time and requires no additional handling from me. This has all been working fine until a little while ago when one of the web sites that used to work suddenly stopped working. I'm reasonably sure it's a problem with the cookies, but I didn't keep a record of the cookies from when it used to work so I'm not 100% sure.
I've managed to simulate the issue as I see it with the following code:
CookieContainer cookieJar = new CookieContainer();
Uri uri1 = new Uri("http://www.somedomain.com/some/path/page1.html");
CookieCollection cookies1 = new CookieCollection();
cookies1.Add(new Cookie("NoPathCookie", "Page1Value"));
cookies1.Add(new Cookie("CookieWithPath", "Page1Value", "/some/path/"));
Uri uri2 = new Uri("http://www.somedomain.com/some/path/page2.html");
CookieCollection cookies2 = new CookieCollection();
cookies2.Add(new Cookie("NoPathCookie", "Page2Value"));
cookies2.Add(new Cookie("CookieWithPath", "Page2Value", "/some/path/"));
Uri uri3 = new Uri("http://www.somedomain.com/some/path/page3.html");
// Add the cookies from page1.html
cookieJar.Add(uri1, cookies1);
// Add the cookies from page2.html
cookieJar.Add(uri2, cookies2);
// We should now have 3 cookies
Console.WriteLine(string.Format("CookieJar contains {0} cookies", cookieJar.Count));
Console.WriteLine(string.Format("Cookies to send to page1.html: {0}", cookieJar.GetCookieHeader(uri1)));
Console.WriteLine(string.Format("Cookies to send to page2.html: {0}", cookieJar.GetCookieHeader(uri2)));
Console.WriteLine(string.Format("Cookies to send to page3.html: {0}", cookieJar.GetCookieHeader(uri3)));
This simulates visiting two pages, both of which set two cookies. It then checks which of those cookies would be set to each of three pages.
Of the two cookies, one is set without specifying a path and the other has a path specified. When a path is not specified, I had assumed that the cookie would be sent back to any page in that domain, but it seems to only get sent back to that specific page. I'm now assuming that is correct as it is consistent.
The main problem for me is the handling of cookies with a path specified. Surely, if a path is specified then the cookie should be sent to any page contained within that path. So, in the code above, 'CookieWithPath' should be valid for any page within /some/path/, which includes page1.html, page2.html and page3.html. Certainly if you comment out the two 'NoPathCookie' instances, then the 'CookieWithPath' gets sent to all three pages as I would expect. However, with the inclusion of 'NoPathCookie' as above, then 'CookieWithPath' only gets sent to page2.html and page3.html, but not page1.html.
Why is this, and is it correct?
Searching for this issue I have come across discussion about a problem with domain handling in CookieContainer, but have not been able to find any discussion about path handling.
I'm using Visual Studio 2005 / .NET 2.0
Yep, that's correct. Whenever domain or path is not specified, it's taken from current URI.
OK, let's take a look at CookieContainer. The method in question is InternalGetCookies(Uri). Here's the interesting part:
enumerator2
here is a (sorted) list of cookies' paths. It is sorted in such a way, that more specific paths (like/directory/subdirectory/
) go before less specific ones (like/directory/
), and otherwise - in lexicographical order (/directory/page1
goes before/directory/page2
).The code does actually the following: it iterates over this list of cookies' paths until it finds a first path, that is a prefix for requested URI's path. Then it adds a cookies under that path to the output and sets
flag2
totrue
, which means "OK, I finally found the place in the list that actually relate to requested URI". After that, the first met path, that is NOT a prefix for requested URI's path is considered to be the end of related paths, so the code stops searching for cookies by doingbreak
.Obviously, this is some kind of optimization to prevent scanning the whole list and it apparently works if none of paths leads to concrete page. Now, for your case, the path list looks like that:
You can check that with a debugger, looking up
((System.Net.PathList)(cookieJar.m_domainTable["www.somedomain.com"])).m_list
in watch windowSo, for 'page1.html' URI, the code breaks on
page2.html
item, not having a chance to process also/some/path/
item.In conclusion: this is obviously yet another bug in CookieContainer. I believe it should be reported on connect.
PS: That's too many bugs per one class. I only hope the guy at MS who wrote tests for this class is already fired.