可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

Last night a customer called, frantic, because Google had cached versions of private employee information. The information is not available unless you login.

They had done a Google search for their domain, e.g.:

site:example.com

and noticed that Googled had crawled, and cached, some internal pages.

Looking at the cached versions of the pages myself:

This is Google's cache of https://example.com/(F(NSvQJ0SS3gYRJB4UUcDa1z7JWp7Qy7Kb76XGu8riAA1idys-nfR1mid8Qw7sZH0DYcL64GGiB6FK_TLBy3yr0KnARauyjjDL3Wdf1QcS-ivVwWrq-htW_qIeViQlz6CHtm0faD8qVOmAzdArbgngDfMMSg_N4u45UysZxTnL3d6mCX7pe2Ezj0F21g4w9VP57ZlXQ_6Rf-HhK8kMBxEdtlrEm2gBwBhOCcf_f71GdkI1))/ViewTransaction.aspx?transactionNumber=12345. It is a snapshot of the page as it appeared on 15 Sep 2013 00:07:22 GMT

I was confused by the long url. Rather than:

https://example.com/ViewTransaction.aspx?transactionNumber=12345

there was a long string inserted:

https://example.com/[...snip...]/ViewTransaction.aspx?transactionNumber=12345

It took me a few minutes to remember: that might be a symptom of ASP.net's "cookie-less sessions". If your browser does not support Set-Cookie, the web-site will embed a cookie in the URL.

Except our site doesn't use that.

And even if our site did have cookie-less sessions auto-detected, and Google managed to cajole the web-server into handing it a session in the url, how did it take over another user's session?

Yes, Google a non-malicious bot hijacked a session

The site has been crawled by bots for years. And this past May 29 was no different.

Google usually starts its crawl by checking the robots.txt file (we don't have one). But nobody is allowed to ready anything on the site (including robots.txt) without first being authenticated, so it fails:

Time      Uri                      Port  User Name         Status
========  =======================  ====  ================  ======
1:33:04   GET /robots.txt          80                      302    ;not authenticated, see /Account/Login.aspx
1:33:04   GET /Account/Login.aspx  80                      302    ;use https plesae
1:33:04   GET /Account/Login.aspx  443                     200    ;go ahead, try to login

All that time Google was looking for a robots.txt file. It never got one. Then it returns to try to crawl the root:

Time      Uri                      Port  User Name         Status
========  =======================  ====  ================  ======
1:33:04   GET /                    80                      302    ;not authenticated, see /Account/Login.aspx
1:33:04   GET /Account/Login.aspx  80                      302    ;use https plesae
1:33:04   GET /Account/Login.aspx  443                     200    ;go ahead, try to login

And another check of robots.txt on the secure site:

Time      Uri                      Port  User Name         Status
========  =======================  ====  ================  ======
1:33:04   GET /robots.txt          443                     302    ;not authenticated, see /Account/Login.aspx
1:33:04   GET /Account/Login.aspx  443                     200    ;go ahead, try to login

And then the stylesheet on the login page:

Time      Uri                      Port  User Name         Status
========  =======================  ====  ================  ======
1:33:04   GET /Styles/Site.css     443                     200

And that's how every crawl from GoogleBot, msnbot, and BingBot works. Robots, login, secure, login. Never getting anywhere, because it cannot get past WebForms Authentication. And all is well with the world.

Until one day; out of nowhere

Until one day, GoogleBot shows up, with a Session cookie in hand!

Time      Uri                        Port  User Name            Status
========  =========================  ====  ===================  ======
1:49:21   GET /                      443   jatwood@example.com  200    ;they showed up logged in!
1:57:35   GET /ControlPanel.aspx     443   jatwood@example.com  200    ;now they're crawling that user's stuff!
1:57:35   GET /Defautl.aspx          443   jatwood@example.com  200    ;back to the homepage
2:07:21   GET /ViewTransaction.aspx  443   jatwood@example.com  200    ;and here comes the private information

The user, jatwood@example.com had not been logged in for over a day. (I was hoping that IIS had giving the same session identifier to two simultaneous visitors, separated by an application recycle). And our site (web.config) is not configured to enable session-less cookies. And the server (machine.config) is not configured to enable session-less cookies.

So:

how did Google get ahold of a sessionless cookie?
how did Google get ahold of a valid sessionless cookie?
how did Google get ahold of a valid sessionless cookie that belonged to another user?

As recently as October 1 (4 days ago), the GoogleBot was still showing up, cookie in hand, logging in as this user, crawling, caching, and publishing, some of their private details.

How is ~~Google~~ a non-malicious web-crawler bypassing WebForms authentication?

IIS7, Windows Server 2008 R2, single server.

Theories

The server is not configured to give out cookieless sessions. But ignoring that fact, how can Google bypass authentication?

GoogleBot is visting the web-site, and attempting random usernames and passwords (not likely, the logs show no attempts to login)
GoogleBot decided to insert a random cookieless session into the url string, and it happened to match the session of an existing user (not likely)
The user managed to figure out how to make an IIS web-site return a cookieless url (not likely), then pasted that url onto another web-site (not likely), where Google found the cookieless url and crawled it
The user is running through mobile proxy (which they're not). The proxy server doesn't support cookies, so IIS creates a cookieless session. That (e.g. Opera Mobile) caching server was breached (not likely) and all cached links posted on a hacker forum. GoogleBot crawled the hacker forum, and started following all links; including our jatwood@example.com cookieless session url.
The user has a virus, which manages to cajole any IIS web-servers into handing back a cookieless url. That virus then reports back to headquarters. The urls are posted onto a publicly accessible resource, that GoogleBot crawl. GoogleBot then shows up at our server with the cookieless url.

None of these are really plausable.

How can ~~Google~~ a non-malicous web-crawler bypass WebForms authentication, and hijack a user's existing session?

What are you asking?

I don't even know how an ASP.net web-site, that is not configured to give out cookieless-sessions, could give out cookieless session. Is it possible to back-convert a cookie-based session id into a cookieless-based session id? I could quote the relevant <sessionState> section of web.config and machine.config, and show there is no presence of

<sessionState cookieless="true">

How does the web-server decide that the browser doesn't support cookies? I tried blocking cookies in Chrome, and I was never given a cookie-less session identifier. Can I simulate a browser that doesnt' support cookies, in order to verify that my server is not giving out cookieless sessions?

Does the server decide cookieless sessions by User-Agent string? If so, I could set Internet Explorer with a spoofed UA.

Does session identity in ASP.net depend solely on the cookie? Can anyone, from any IP, with the cookie-url, access that session? Does ASP.net not, by default, also take into account?

If ASP.net does tie IP address with the session, wouldn't that mean that the session couldn't have originated from the employee at their home computer? Because then when the GoogleBot crawler tried to use it from a Google IP, it would have failed?

Has there been any instances anywhere (besides the one I linked) of ASP.net giving out cookieless sessions when it's not configured to? Is there a Microsoft Connect issue on this?

Is Web-Forms authentication known to have issues, and should not be used to security?

Bonus Reading

A guy on StackOverflow who's web-server is sometimes giving out cookieless urls when it's not configured to

Edit: Removed name of ~~Google~~ the bot that bypassed privilege, as people are pants on head retarded; confusing ~~Google~~ the name of the crawler for something else. I use ~~Google~~ the name of the crawler as a reminder that it was a non-malicious web-crawler that managed to crawl it's way into another user's WebForm's session. This is to contrast it with a malicious crawler, that was trying to break into another user's session. Nothing like a pedant to bring out the aggravation.

回答1:

Though the question mainly references session identifiers, the length of the identifier struck me as unusual.

There are at least two types of cookie/cookieless operations that can modify the query string to include an ID.

Cookieless sessions
Cookieless forms authentication tokens

They are completely independent of each other (as far as I can tell).

Session State

A cookieless session allows the server to access session state data based on a unique ID in the URL versus a unique ID in a cookie. This is usually considered a fine practice, though ASP.Net reuses session IDs which makes it more prone to session fixation attempts (separate topic but worth knowing about).

Does session identity in ASP.net depend solely on the cookie? Can anyone, from any IP, with the cookie-url, access that session? Does ASP.net not, by default, also take into account?

The session ID is all that is required.

General Session Security Reading

Forms Authentication

Based on the length of the example data, I'm guessing your URL actually contains a forms authentication value, not a session ID. The source code suggests that cookieless mode is not something you must explicitly enable.

/// <summary>ASP.NET determines whether to use cookies based on
/// <see cref="T:System.Web.HttpBrowserCapabilities" /> setting. 
/// If the setting indicates that the browser or device supports cookies, 
/// cookies are used; otherwise, an identifier is used in the query string.</summary>
UseDeviceProfile

Here's how the determination is made:

// System.Web.Security.CookielessHelperClass
internal static bool UseCookieless( HttpContext context, bool doRedirect, HttpCookieMode cookieMode )
{
    switch( cookieMode )
    {
        case HttpCookieMode.UseUri:
            return true;
        case HttpCookieMode.UseCookies:
            return false;
        case HttpCookieMode.AutoDetect:
            {
                // omitted for length
                return false;
            }
        case HttpCookieMode.UseDeviceProfile:
            if( context == null )
            {
                context = HttpContext.Current;
            }
            return context != null && ( !context.Request.Browser.Cookies || !context.Request.Browser.SupportsRedirectWithCookie );
        default:
            return false;
    }
}

Guess what the default is? HttpCookieMode.UseDeviceProfile. ASP.Net maintains a list of devices and capabilities. This list is generally a very bad thing; for example, IE11 gives a false positive for being a downlevel browser on par with Netscape 4.

Causes

I think Gene's explanation is very likely; Google found the URL from some user action and crawled it.

It's completely conceivable that the Google bot is deemed to not support cookies. But this doesn't explain the origin of the URL, i.e. what user action resulted in Google seeing a URL with an ID already in it? A simple explanation could be a user with a browser that was deemed to not support cookies. Depending on the browser, everything else could look fine to the user.

The timing, i.e. the duration of validity seems long, though I'm not that familiar with how long the authentication tickets are valid and under what circumstances they could be renewed. It's entirely possible ASP.Net continued to reissue/renew tickets as it would do for a continually active user.

Possible Solutions

I'm making a lot of assumptions here, but If I'm correct:

First, reproduce the behavior in your environment.

Explicitly disable cookieless behavior by using HttpCookieMode.UseCookies.

web.config:

 <authentication mode="Forms">
    <forms loginUrl="~/Account/Login.aspx" name=".ASPXFORMSAUTH" timeout="26297438"
           cookieless="UseCookies" />
 </authentication>

While this should resolve the behavior, you might investigate extending the forms authentication HTTP module and adding additional validation (or at least logging/diagnostics).

回答2:

You asked for thoughts, so I'll give some. No warranty expressed or implied.

Give up the idea that your site is configured not to encode session information in URIs. With very high probability it did so. Either you're wrong about the configuration or (more likely) there's a bug that caused it to do so.

That leaves the central question: how Google obtained the session URI?

You didn't say anything about the customer base. Here's a guess:

A customer logged into the system in a way that produced a URI encoding of the session, then emailed this using a gmail account to someone else. Google scanned the email and provided the URI to the crawler bot.

There are other, similar ways that a customer whose client produced the URI could inadvertently surrender it to Google. Google Drive document. Google Plus posting. Etc.

Google may not be evil, but they're nonetheless everywhere. Their use agreement lets them move links across product boundaries, in this case mail (etc.) to search.

The real question you should be thinking about is why your site is not protected from cross-site request forgery. The Rails folks explain this pretty nicely. The Rails protect_from_forgery mechanism would have prevented the reported problem.

A related question is why the encoded cookie (apparently) never expires. It ought to be easy to make sessions contain timestamps to make this so.

How did harmless crawler bypass WebForms authentic

问题:

Yes, Google a non-malicious bot hijacked a session

Until one day; out of nowhere

Theories

What are you asking?

Bonus Reading

回答1:

Session State

Forms Authentication

Causes

Possible Solutions

回答2:

收藏的人(0)

How did harmless crawler bypass WebForms authentic

问题:

Yes, Google a non-malicious bot hijacked a session

Until one day; out of nowhere

Theories

What are you asking?

Bonus Reading

回答1:

Session State

Forms Authentication

Causes

Possible Solutions

回答2:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮