Limiting the number of concurrent jobs on Azure Fu

2019-02-26 09:43发布

问题:

I have a Function app in Azure that is triggered when an item is put on a queue. It looks something like this (greatly simplified):

public static async Task Run(string myQueueItem, TraceWriter log)
{
    using (var client = new HttpClient())
    {
        client.BaseAddress = new Uri(Config.APIUri);
        client.DefaultRequestHeaders.Accept.Add(new MediaTypeWithQualityHeaderValue("application/json"));

        StringContent httpContent = new StringContent(myQueueItem, Encoding.UTF8, "application/json");
        HttpResponseMessage response = await client.PostAsync("/api/devices/data", httpContent);
        response.EnsureSuccessStatusCode();

        string json = await response.Content.ReadAsStringAsync();
        ApiResponse apiResponse = JsonConvert.DeserializeObject<ApiResponse>(json);

        log.Info($"Activity data successfully sent to platform in {apiResponse.elapsed}ms.  Tracking number: {apiResponse.tracking}");
    }
}

This all works great and runs pretty well. Every time an item is put on the queue, we send the data to some API on our side and log the response. Cool.

The problem happens when there's a big spike in "the thing that generates queue messages" and a lot of items are put on the queue at once. This tends to happen around 1,000 - 1,500 items in a minute. The error log will have something like this:

2017-02-14T01:45:31.692 mscorlib: Exception while executing function: Functions.SendToLimeade. f-SendToLimeade__-1078179529: An error occurred while sending the request. System: Unable to connect to the remote server. System: Only one usage of each socket address (protocol/network address/port) is normally permitted 123.123.123.123:443.

At first, I thought this was an issue with the Azure Function app running out of local sockets, as illustrated here. However, then I noticed the IP address. The IP address 123.123.123.123 (of course changed for this example) is our IP address, the one that the HttpClient is posting to. So, now I'm wondering if it is our servers running out of sockets to handle these requests.

Either way, we have a scaling issue going on here. I'm trying to figure out the best way to solve it.

Some ideas:

  1. If it's a local socket limitation, the article above has an example of increasing the local port range using Req.ServicePoint.BindIPEndPointDelegate. This seems promising, but what do you do when you truly need to scale? I don't want this problem coming back in 2 years.
  2. If it's a remote limitation, it looks like I can control how many messages the Functions runtime will process at once. There's an interesting article here that says you can set serviceBus.maxConcurrentCalls to 1 and only a single message will be processed at once. Maybe I could set this to a relatively low number. Now, at some point our queue will be filling up faster than we can process them, but at that point the answer is adding more servers on our end.
  3. Multiple Azure Functions apps? What happens if I have more than one Azure Functions app and they all trigger on the same queue? Is Azure smart enough to divvy up the work among the Function apps and I could have an army of machines processing my queue, which could be scaled up or down as needed?
  4. I've also come across keep-alives. It seems to me if I could somehow keep my socket open as queue messages were flooding in, it could perhaps help greatly. Is this possible, and any tips on how I'd go about doing this?

Any insight on a recommended (scalable!) design for this sort of system would be greatly appreciated!

回答1:

I think I've figured out a solution for this. I've been running these changes for the past 3 hours 6 hours, and I've had zero socket errors. Before I would get these errors in large batches every 30 minutes or so.

First, I added a new class to manage the HttpClient.

public static class Connection
{
    public static HttpClient Client { get; private set; }

    static Connection()
    {
        Client = new HttpClient();

        Client.BaseAddress = new Uri(Config.APIUri);
        Client.DefaultRequestHeaders.Add("Connection", "Keep-Alive");
        Client.DefaultRequestHeaders.Add("Keep-Alive", "timeout=600");
        Client.DefaultRequestHeaders.Accept.Add(new MediaTypeWithQualityHeaderValue("application/json"));
    }
}

Now, we have a static instance of HttpClient that we use for every call to the function. From my research, keeping HttpClient instances around for as long as possible is highly recommended, everything is thread safe, and HttpClient will queue up requests and optimize requests to the same host. Notice I also set the Keep-Alive headers (I think this is the default, but I figured I'll be implicit).

In my function, I just grab the static HttpClient instance like:

var client = Connection.Client;
StringContent httpContent = new StringContent(myQueueItem, Encoding.UTF8, "application/json");
HttpResponseMessage response = await client.PostAsync("/api/devices/data", httpContent);
response.EnsureSuccessStatusCode();

I haven't really done any in-depth analysis of what's happening at the socket level (I'll have to ask our IT guys if they're able to see this traffic on the load balancer), but I'm hoping it just keeps a single socket open to our server and makes a bunch of HTTP calls as the queue items are processed. Anyway, whatever it's doing seems to be working. Maybe someone has some thoughts on how to improve.



回答2:

I think the code error is because of: using (var client = new HttpClient())

Quoted from Improper instantiation antipattern:

this technique is not scalable. A new HttpClient object is created for each user request. Under heavy load, the web server may exhaust the number of available sockets.



回答3:

If you use consumption plan instead of Functions on a dedicated web app, #3 more or less occurs out of the box. Functions will detect that you have a large queue of messages and will add instances until queue length stabilizes.

maxConcurrentCalls only applies per instance, allowing you to limit per-instance concurrency. Basically, your processing rate is maxConcurrentCalls * instanceCount.

The only way to control global throughput would be to use Functions on dedicated web apps of the size you choose. Each app will poll the queue and grab work as necessary.

The best scaling solution would improve the load balancing on 123.123.123.123 so that it can handle any number of requests from Functions scaling up/down to meet queue pressure.

Keep alive afaik is useful for persistent connections, but function executions aren't viewed as a persistent connection. In the future we are trying to add 'bring your own binding' to Functions, which would allow you to implement connection pooling if you liked.



回答4:

I know the question was answered long ago, but in the mean time Microsoft have documented the anti-pattern that you were using.

Improper Instantiation antipattern