I am writing an event collector http server which would be under heavy load. Hence in the http handler I am just deserialising the event and then running the actual processing outside of the http request-response cycle in a goroutine.
With this, I see that if I am hitting the server at 400 requests per second, then the latency is under 20ms for 99 percentile. But as soon as I bump the request rate to 500 per second, latency shoots up to over 800ms.
Could anyone please help me with some ideas on what the reason could be so that I can explore more.
package controller
import (
"net/http"
"encoding/json"
"event-server/service"
"time"
)
func CollectEvent() http.Handler {
handleFunc := func(w http.ResponseWriter, r *http.Request) {
startTime := time.Now()
stats.Incr("TotalHttpRequests", nil, 1)
decoder := json.NewDecoder(r.Body)
var event service.Event
err := decoder.Decode(&event)
if err != nil {
http.Error(w, "Invalid json: " + err.Error(), http.StatusBadRequest)
return
}
go service.Collect(&event)
w.Write([]byte("Accepted"))
stats.Timing("HttpResponseDuration", time.Since(startTime), nil, 1)
}
return http.HandlerFunc(handleFunc)
}
I ran a test with 1000 requests per second and profiled it. Following are the results.
(pprof) top20
Showing nodes accounting for 3.97s, 90.85% of 4.37s total
Dropped 89 nodes (cum <= 0.02s)
Showing top 20 nodes out of 162
flat flat% sum% cum cum%
0.72s 16.48% 16.48% 0.72s 16.48% runtime.mach_semaphore_signal
0.65s 14.87% 31.35% 0.66s 15.10% syscall.Syscall
0.54s 12.36% 43.71% 0.54s 12.36% runtime.usleep
0.46s 10.53% 54.23% 0.46s 10.53% runtime.cgocall
0.34s 7.78% 62.01% 0.34s 7.78% runtime.mach_semaphore_wait
0.33s 7.55% 69.57% 0.33s 7.55% runtime.kevent
0.30s 6.86% 76.43% 0.30s 6.86% syscall.RawSyscall
0.10s 2.29% 78.72% 0.10s 2.29% runtime.mach_semaphore_timedwait
0.07s 1.60% 80.32% 1.25s 28.60% net.dialSingle
0.06s 1.37% 81.69% 0.11s 2.52% runtime.notetsleep
0.06s 1.37% 83.07% 0.06s 1.37% runtime.scanobject
0.06s 1.37% 84.44% 0.06s 1.37% syscall.Syscall6
0.05s 1.14% 85.58% 0.05s 1.14% internal/poll.convertErr
0.05s 1.14% 86.73% 0.05s 1.14% runtime.memmove
0.05s 1.14% 87.87% 0.05s 1.14% runtime.step
0.04s 0.92% 88.79% 0.09s 2.06% runtime.mallocgc
0.03s 0.69% 89.47% 0.58s 13.27% net.(*netFD).connect
0.02s 0.46% 89.93% 0.40s 9.15% net.sysSocket
0.02s 0.46% 90.39% 0.03s 0.69% net/http.(*Transport).getIdleConn
0.02s 0.46% 90.85% 0.13s 2.97% runtime.gentraceback
(pprof) top --cum
Showing nodes accounting for 70ms, 1.60% of 4370ms total
Dropped 89 nodes (cum <= 21.85ms)
Showing top 10 nodes out of 162
flat flat% sum% cum cum%
0 0% 0% 1320ms 30.21% net/http.(*Transport).getConn.func4
0 0% 0% 1310ms 29.98% net.(*Dialer).Dial
0 0% 0% 1310ms 29.98% net.(*Dialer).Dial-fm
0 0% 0% 1310ms 29.98% net.(*Dialer).DialContext
0 0% 0% 1310ms 29.98% net/http.(*Transport).dial
0 0% 0% 1310ms 29.98% net/http.(*Transport).dialConn
0 0% 0% 1250ms 28.60% net.dialSerial
70ms 1.60% 1.60% 1250ms 28.60% net.dialSingle
0 0% 1.60% 1170ms 26.77% net.dialTCP
0 0% 1.60% 1170ms 26.77% net.doDialTCP
(pprof)
The problem
I am using another goroutine because I dont want the processing to happen in the http request-response cycle.
That's a common fallacy (and hence trap). The line of reasoning appears to be sound: you're trying to process requests "somewhere else" in an attempt to
handle ingress HTTP requests as fast as possible.
The problem is that that "somewhere else" is still some code which
runs concurrently with the rest of your request-handling churn.
Hence if that code runs slower than the rate of ingress requests,
your processing goroutines will pile up essentially draining one or
more resources. Which exactly—depends on the actual processing:
if it's CPU-bound, it will create natural contention for the CPU
between all those GOMAXPROCS
hardware threads of execution;
if it's bound to network I/O, it will create load on Go runtime scheruler which has to divide the available execution quanta it has
on its hands between all those goroutines wanted to be executed;
if it's bound to disk I/O or other syscalls you will have
proliferation of OS threads created, and so on and so on…
Essentially, you are queueing the work units converted from the
ingress HTTP requests, but queues do not fix overload.
They might be used to absorb short spikes of overload,
but this only works when such spikes are "surrounded" by the periods
of load at least slightly below the maximum capacity provided by your
system.
The fact you're queueing is not directly seen in your case, but it's
there, and it's exhibited by pressing your system past its natural
capacity—your "queue" starts to grow indefinitely.
Please read this classic essay carefully to understand why your approach is not going
to work in realistic production setting.
Pay close attention to those pictures of the kitchen sinks.
What to do about it?
Unfortunately, it's almost impossible to give your simple solution
as we're not working with your code in your settings with your workload.
Still, here are a couple of directions to explore.
On the most broad scale, try to see whether you have some easily
discernible bottleneck in your system which you presently cannot see.
For instance, if all those concurrent worker goroutines eventually
talk to an RDBMs instance, its disk I/O may quite easily serialize
all those goroutines which will merely wait for their turn to have
their data accepted.
The bottleneck may be simpler—say, in each worker goroutine
you carelessly execute some long-running operation while holding a lock
contended on by all those goroutines;
this obviously serializes them all.
The next step would be to actually measure (I mean, by writing a benchmark)
how many time does it take for a single worker to complete its unit of work.
Then you need to measure how this number changes when increasing the
concurrency factor.
After collecting these data, you will be able to do
educated projections about at what realistic rate your system
is able to handle the requests.
The next step is to think through your strategy at making your system
fulfil those calculated expectations. Usually this means limiting the rate
of ingress requests. There are different approaches to achieve this.
Look at golang.org/x/time/rate
for a time-based rate limiter but it's possible to start with lower-tech
approaches such as using a buffered channel as a counting semaphore.
The requests which would overflow your capacity may be rejected
(typically with HTTP status code 429, see this).
You might also consider queueing them briefly but I'd try this only
to serve as a cherry on a pie—that is, when you have the rest
sorted out completely.
The question of what to do with rejected requests depends on your
setting. Typically you try to "scale horizontally" by deploying more
than one service to process your requests and teach your clients
to switch over available services. (I'd stress that it means several
independent services—if they all share some target sink which collects
their data, they might be limited by the ultimate capacity of that sink,
and adding more systems won't gain you anything.)
Let me repeat that the general problem has no magic solutions:
if your complete stystem (with this HTTP service you're writing being
merely its front-end, gateway, part) is only able to handle N
RPS of load,
no amount of scattering go processRequest()
is going to make it
handle requests at a higher pace. The easy concurrency Go offers is not
a silver bullet,
it's a machine gun.