How to handle recaptcha on third-party site in my

2019-07-28 01:47发布

问题:

I was curious about how people build third-party apps for sites with NO public APIs, but I could not really find any tutorials on this topic. So I decided to just give it a try. I created a simple desktop application, which uses HttpClient to send GET requests to the site I frequently use, and then parses the response and displays the data in my WPF window. This approach worked pretty well (probably because the site is fairly simple).

However, today I tried to run my application from a different place, and I kept getting 403 errors in response to my application's requests. It turned out, that the network I was using went through a VPN server, while the site I was trying to access used CloudFlare as protection layer, which apparently forces VPN users to enter reCaptcha in order to access the target site.

var baseAddress = new Uri("http://www.cloudflare.com");
using (var client = new HttpClient() { BaseAddress = baseAddress })
{
   var message = new HttpRequestMessage(HttpMethod.Get, "/");
   //this line returns CloudFlare home page if I use regualr network and reCaptcha page, when I use VPN
   var result = await client.SendAsync(message);
   //this line throws if I use VPN (403 Forbidden)
   result.EnsureSuccessStatusCode();
}

Now the question is: what is the proper way to deal with CloudFlare protection in client application? Do I have to display the reCaptcha in my application just like the web browser does? Do I have to set any particular headers in order to get a proper response instead of 403? Any tips are welcome, as this is a completely new area to me.

P.S. I write in C# because this is the laguage I'm most comfortable with, but I don't mind aswers using any other language as long as they answer the question.

回答1:

I guess, one way to go about it is to handle captcha in web browser, outside the client application.

  1. Parse the response to see if it is a captcha page.
  2. If it is - open this page in browser.
  3. Let user solve the captcha there.
  4. Fetch the CloudFlare cookies form browser's cookie storage. You gonna need __cfduid (user ID) and cf_clearance (proof of solving the captcha).
  5. Attach those cookies to requests sent by client application.
  6. Use application as normal for the next 24 hours (until CloudFlare cookies expire).

Now the hard part here is (4). It's easy to manually copy-paste the cookies to make the code snippet in my question work with VPN:

var baseAddress = new Uri("http://www.cloudflare.com");
var cookieContainer = new CookieContainer();
using (var client = new HttpClient(new HttpClientHandler() { CookieContainer = cookieContainer } , true) { BaseAddress = baseAddress })
{
    var message = new HttpRequestMessage(HttpMethod.Get, "/");
    //I've also copy-pasted all the headers from browser
    //some of those might be optional
    message.Headers.Add("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:44.0) Gecko/20100101 Firefox/44.0");
    message.Headers.Add("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
    message.Headers.Add("Accept-Encoding", "gzip, deflate" });
    message.Headers.Add("Accept-Language", "en-US;q=0.5,en;q=0.3");
    //adding CloudFlare cookies
    cookieContainer.Add(new Cookie("__cfduid", "copy-pasted-cookie-value", "/", "cloudflare.com"));
    cookieContainer.Add(new Cookie("cf_clearance", "copy-pasted-cookie-value", "/", "cloudflare.com"));
    var result = await client.SendAsync(message);
    result.EnsureSuccessStatusCode();
}

But I think its going to be a tricky task to automate the process of fetching the cookies, due to different browsers storing cookies in different places and/or formats. Not to metion the fact that you need to use external browser for this approach to work, which is really annoying. Still, its something to consider.



回答2:

Answer to "build third-party apps for sites with NO public APIs" is that even though some Software Vendors don't have a public api's they have partner programs.

Good example is Netflix, they used to have a public api. Some of the Apps developed when the Public Api was enabled allowed to continue api usage.

In your scenario, your client app acts as a web crawler (downloading html content and trying to parse information). What you are trying to do is to Crawl the Cloudfare data which is not meant to be crawled by a third party app (bot). From the cloudfare side, they have done the correct thing to have a Captcha which prevents automated requests.

Further, if you try to send requests at a high frequency (requests/sec), and if the Cloudfare has Threat detection mechanisms, your ip address will be blocked. I assume that they already identified the VPN server IP address you are trying to use and blacklisted that, that's why you are getting a 403.

Basically you solely depend on security holes in Cloudfare pages you try to access via the client app. This is sort of hacking Cloudfare (doing something cloudfare has restricted) which I would not recommend.

If you have a cool idea, better to contact their developer team and discuss about that.



回答3:

In case you still need it, I had the very same problem and came up with the following solution 2 years ago.

It opens up the Cloudflare protected web page with the C# WebBrowser class, waits about 6 seconds so that CloudFlare saves the clearance cookie and then the program saves the cookie to disk.

You need a javascript capable browser like the C# WebBrowser class, as the Cloudflare captcha page needs javascript to function and count down in order to save the cookie, any other attempt will fail.

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Windows.Forms;
using System.Runtime.InteropServices;
using System.Net;
using System.Threading;

namespace kek
{
    public partial class Form1 : Form
    {
        [DllImport("wininet.dll", SetLastError = true)]
        public static extern bool InternetGetCookieEx(string url, string cookieName, StringBuilder cookieData, ref int size, Int32 dwFlags, IntPtr lpReserved);

        private Uri Uri = new Uri("http://www.my-cloudflare-protected-website.com");
        private const Int32 InternetCookieHttponly = 0x2000;
        private const Int32 ERROR_INSUFFICIENT_BUFFER = 0x7A;

        public Form1()
        {
            InitializeComponent();

            webBrowser1.DocumentCompleted += new System.Windows.Forms.WebBrowserDocumentCompletedEventHandler(this.webBrowser1_DocumentCompleted);

            webBrowser1.Navigate(Uri, null, null, "User-Agent: kappaxdkappa\r\n"); //user-agent needs to be set another way if that doesnt work
        }

        private void webBrowser1_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
        {
            int waitTime = 0;

            if(webBrowser1.DocumentTitle.Contains("We are under attack")) //check what string identifies the unique cloudflare captcha page and put it here
            {
                waitTime = 6000;
            }

            Task.Run(async () =>
            {
                await Task.Delay(waitTime); //cookie can be saved right away, but the waiting period might not have passed yet

                String cloudflareCookie = GetCookie(Uri, "cf_clearance");

                if (!String.IsNullOrEmpty(cloudflareCookie))
                {
                    System.IO.StreamWriter file = new System.IO.StreamWriter("c:\\CFcookie.blob"); //save to %appdata%\MyProgram\Cookies\clearence.blob
                    file.Write(cloudflareCookie);
                    file.Close();
                }
            });
        }

        String GetCookie(Uri uri, String cookieName)
        {
            int datasize = 0;
            StringBuilder cookieData = new StringBuilder(datasize);

            InternetGetCookieEx(uri.ToString(), cookieName, cookieData, ref datasize, InternetCookieHttponly, IntPtr.Zero);

            if (Marshal.GetLastWin32Error() == ERROR_INSUFFICIENT_BUFFER && datasize > 0)
            {
                cookieData = new StringBuilder(datasize);
                if (InternetGetCookieEx(uri.ToString(), cookieName, cookieData, ref datasize, InternetCookieHttponly, IntPtr.Zero))
                {
                    if (cookieData.Length > 0)
                    {
                        CookieContainer container = new CookieContainer();
                        container.SetCookies(uri, cookieData.ToString());

                        return container.GetCookieHeader(uri);
                    }
                }
            }

            return String.Empty;
        }
    }
}

Some notes:

  • Use a better user agent
  • The cookie is saved to disk as well because I needed it for something else. Not sure if the in-built browser saved the cookies for next time, but in case it does not, this way you can simply load it again.
  • Change the "We are under attack" phrase to the one that identifies the CF captcha page you are trying to bypass.
  • __cfduid cookie is not required afaik

EDIT: Sorry, I was so focused on Cloudflare itself after reading other answers in here that I didn't notice that you need to bypass Recaptcha that is sometimes found on the Cloudflare page. My code can help you a bit for the browser and cookie part, but you will have a hard time solving Recaptcha, at least now. A few weeks ago they made it even harder. I can recommend compiling your own version of Firefox and then automatically solve the captcha by hitting the checkbox. If you don't get that simple captcha then you need to display it for the user. Mind that you also need to randomize the behaviour of your browser and how you click on the checkbox, otherwise it will detect you as a bot.