runtime: epoll scalability problem with 192 core machine and 1k+ ready sockets #65064

prattmic · 2024-01-11T18:07:57Z

Split from #31908 (comment) and full write-up at https://jazco.dev/2024/01/10/golang-and-epoll/.

tl;dr is that a program on a 192 core machine with >2500 sockets and with >1k becoming ready at once results in huge costs in netpoll -> epoll_wait (~65% of total CPU).

Most interesting is that sharding these connections across 8 processes seems to solve the problem, implying some kind of super-linear scaling.

That the profile shows the time spent in epoll_wait itself, this may be a scalability problem in the kernel itself, but we may still be able to mitigate.

@ericvolp12, some questions if you don't mind answering:

Which version of Go are you using? And which kernel version?
Do you happen to have a reproducer for this problem that you could share? (Sounds like no?)
On a similar note, do you have a perf profile of this problem that shows where the time in the kernel is spent?
The 128 event buffer size is mentioned several times, but it is not obvious to me that increasing this size would actually solve the problem. Did you try increasing the size and see improved results?

cc @golang/runtime

The text was updated successfully, but these errors were encountered:

prattmic · 2024-01-11T18:11:03Z

There is a small chance that #56424 is related, though it seems unlikely as that was at a much smaller scale.

ericvolp12 · 2024-01-11T18:27:08Z

Which version of Go are you using? And which kernel version?

We're running on the golang:1.21-bullseye docker image base which is currently using: go version go1.21.6 linux/amd64, kernel version 5.15.0-91-generic on Ubuntu

Do you happen to have a reproducer for this problem that you could share? (Sounds like no?)

We don't have a reproducer for this problem right now unfortunately, but our suspicion is that it should be easy to replicate by serving or making hundreds of thousands of fast network requests in a go application using TCP.

On a similar note, do you have a perf profile of this problem that shows where the time in the kernel is spent?

We don't have a perf profile unfortunately, most of our discovery was done via pprof profiles from the running binary and testing different configurations (4, then 8 containers per host).

The 128 event buffer size is mentioned several times, but it is not obvious to me that increasing this size would actually solve the problem. Did you try increasing the size and see improved results?

We did not try increasing the buffer size, it wasn't apparent there was a way to do that without running a custom build of Go and at the time running more than one container was a more accessible solution for us.

Thanks for looking into this, it was definitely a interesting thing to find in the wild!

whyrusleeping · 2024-01-11T18:28:29Z

For some more context, the EpollWait time in the profile was 2800 seconds on a 30 second profile.

Also I don't necessarily think that the epoll buffer itself is the problem, rather just how epoll works under the hood with thousands of 'ready' sockets and hundreds of threads.

The application under load had around 3500 open sockets, http2 clients making requests to our grpc service on one end and us making requests to scyllaDB on the other.

prattmic · 2024-01-11T18:58:58Z

Thanks for the details! I'll try to write a reproducer when I have some free time, not sure when I'll get to it.

it wasn't apparent there was a way to do that without running a custom build of Go

Indeed, you'd need to manually modify the runtime. Note that is possible to simply edit the runtime source in GOROOT and rebuild your program (no special steps required for the runtime, it is treated like any other package). But if you build in a Docker container it is probably a pain to edit the runtime source.

prattmic · 2024-01-11T20:20:03Z

Some thoughts from brainstorming for posterity:

My best theory at the moment (though I'd really like to see perf to confirm) is that ~90 threads are calling epoll_wait at once (probably at this non-blocking netpoll: https://cs.opensource.google/go/go/+/master:src/runtime/proc.go;l=3230;drc=dcbe77246922fe7ef41f07df228f47a37803f360). The kernel has a mutex around the entire copy-out portion of epoll_wait, so there is probably a lot of time waiting for the mutex. If that is the case, some form of rate-limiting on how many threads make the syscall at once may be effective. N.B. that this non-blocking netpoll is not load-bearing for correctness, so occasionally skipping it would be OK.

whyrusleeping · 2024-01-11T23:56:56Z

Yeah, it was the netpoll call inside findRunnable (though i didnt have my source mapping set up at the time to confirm the exact line numbers).
I overwrite the profile i took from the degerate case unfortunately, if helpful we can probably reorient things back down to a single process per machine and run some tests with perf.

I've also got a spare test machine with the same CPU i can use to try out a repro test case as well.

sschepens · 2024-01-12T12:02:54Z

is go using the same epoll instance accross all threads? that might be the underlying problem, most high-throughput applications (nginx, envoy, netty) create several instances (usually one per thread together with an event loop) and connections get distributed to all epoll instances some way or another.

panjf2000 · 2024-01-12T13:02:48Z

is go using the same epoll instance accross all threads? that might be the underlying problem, most high-throughput applications (nginx, envoy, netty) create several instances (usually one per thread together with an event loop) and connections get distributed to all epoll instances some way or another.

Good point! And to answer your question, yes, Go has been using the single (and global) epoll/kqueue/poll instance internally since the day Go netpoll was introduced. I actually had this concern for a few years, but never got a chance to spot that kind of performance bottleneck emerge. What I had in mind is that we can make a transition from single epoll instance to per-P epoll instances, or just multiple global epoll instances simply, which could also help.

From where I stand, I reckon that refactoring the current epoll from a single instance to multiple instances would require much less work than introducing io_uring. What is more, given the current Go codebase, io_uring is better suited for file I/O than for network I/O. Oh boy, I can already imagine now how many obstacles we'll have to go through before io_uring is implemented for network I/O eventually, and also transparently.

To sum up, multiple epoll instances should be able to gain sufficient credits for the performance boost of network I/O, and in consideration of the complexity from introducing io_uring for network I/O, I think the former is more feasible at this stage.

sschepens · 2024-01-12T14:50:49Z

using multiple epoll instances would mean that connections or fds would now be bound to a single thread? does this means that it could be possible for connection imbalances to happen where some threads could be handling many long lived connections while others be mostly idle?

panjf2000 · 2024-01-13T02:32:47Z

using multiple epoll instances would mean that connections or fds would now be bound to a single thread? does this means that it could be possible for connection imbalances to happen where some threads could be handling many long lived connections while others be mostly idle?

This is one of the potential issues we may encounter and need to resolve if we decide to introduce multiple epoll instances for Go runtime. But I don't think it's going to be our big concern cuz there are ways for us to mitigate that, for instance, the work-stealing mechanism, or just to put surplus tasks in the global run queue.

I actually drafted a WIP implementation of multiple epoll/kqueue/poll instances a long time ago on my local computer, and I can take on this if we eventually decide to introduce multiple netpollers after the root cause of this issue has been revealed.

errantmind · 2024-01-13T03:25:13Z

A casual observation (not go specific): one reason epoll doesn't scale well when a single epoll instance is shared across threads is the file descriptor table, which is typically shared across the process. This is one of the reasons why, say, 8 separate processes usually performs better than a single process with 8 threads. The impact is present both with multiple epoll instances (per thread), or a single epoll instance shared across threads. The way to circumvent this is to unshare (syscall) the file descriptor table across threads upon thread creation, then create an epoll instance per thread. This yields similar performance to a multi process approach (within 1% in my experience). After that you can distribute the work however you want, maybe with SO_REUSEPORT. Also, be careful unsharing the file descriptor table, it is not appropriate for all situations.

Side note, if you are sharing an epoll instance across threads you should use edge triggered to avoid all threads from being woken up, most unnecessarily.

This is my experience anyway when using a thread per core model, although the principle would apply regardless of the number of threads. I don't know anything about go internals so I'll leave it there.

bwerthmann · 2024-01-15T16:28:39Z

I don't want to derail this issue, let me know if I should move this to a separate bug...

We are seeing a similar issue on a system with 128 cores, we're only reading from 96 Unix Sockets, 1 per goroutine. Go was spending much time in netpoll -> epoll_wait and perf top reported much of time in the kernel in osq_lock.

I'm looking for the profiles from the Go App, in the mean time I can share that we reproduced this issue with a simple socat invocation:

I wrote a workaround that does not invoke netpoll at all, instead it just makes raw syscalls and throughput improved by 5x (which is the next bottleneck in the App, also related to Mutex being slow). My microbenchmark with raw syscalls just reading from unix sockets performed >10x in terms of bandwidth with the 96 producers.

Let me know if there's anything I can do to help.

bwerthmann · 2024-01-15T17:01:19Z

These kernel patches may be of interest:
locking/osq_lock: Fix false sharing of optimistic_spin_node in osq_lock, may not be accepted yet?

https://lore.kernel.org/lkml/20230615120152.20836-1-guohui@uniontech.com/

panjf2000 · 2024-01-16T07:41:05Z

I wrote a workaround that does not invoke netpoll at all, instead it just makes raw syscalls and throughput improved by 5x

Just to make sure I don't misread what said, you achieved that by using raw syscalls of socket(), bind(), listen(), connect(), read() write(), etc. instead of the APIs provided by std net, right?
@bwerthmann

bwerthmann · 2024-01-16T15:29:33Z

I wrote a workaround that does not invoke netpoll at all, instead it just makes raw syscalls and throughput improved by 5x

Just to make sure I don't misread what said, you achieved that by using raw syscalls of socket(), bind(), listen(), connect(), read() write(), etc. instead of the APIs provided by std net, right?
@bwerthmann

Correct. I'll ask today if I can share an example.

valyala · 2024-01-17T20:23:42Z

I think it would be great if Go runtime could maintain a separate epoll file descriptor (epfd) per each P. Then every P could register file descriptors in its own local epfd and call epoll() on it when its local list of goroutines ready to run becomes empty and it needs to find runnable goroutine. This scheme has the following benefits:

Goroutines, which work with network, will tend to stay on the same P, since the file descriptors created by the goroutine are registered in P-local epfd. Even if the goroutine migrates to another P for some reason, it will migrate to the original P after the next network IO. This improves locality of data accessed by the goroutine, so it remains for longer in P-local CPU caches. This should improve the overall performance, since access to local CPU caches is usually faster than access to shared memory.
This should improve scalability of epoll() calls, since every P will poll its own epfd, thus removing bottlenecks related to access synchronization to shared epfd in kernel space.

Such a scheme may result in imbalance of goroutines among P workers, if a single goroutine creates many network connections (e.g. server accept loop). Then the Go scheduler will migrate all the goroutines, which make IO on these connections, to the original P where the original goroutine created all these network connections. This can be solved by periodic even re-distribution of the registered network connections among P-local epfds. For example, if P cannot find ready to run goroutines in local queue and in local epfd, then it can steal a few network connections from the busiest P, to de-register them from that P's epfd and then to register them in local epfd. The busiest P can be determined from some rough per-P CPU usage stats.

aclements · 2024-01-17T23:41:46Z

I agree that most likely we need multiple epoll FDs, with some sort of affinity.

@bwerthmann , since you're able to get perf profiles, could you get one with perf record -g? I'd love to see where the osq_lock call is coming from to confirm the hypothesis.

It would be really helpful if someone could create a benchmark that reproduces this issue. If it can be done with only 96 UNIX domain sockets, it may not even be especially hard.

aclements · 2024-01-17T23:45:39Z

If we want to go deep here, it might even be possible for the Go scheduler to become RX queue aware using sockopts like SO_INCOMING_CPU or SO_INCOMING_NAPI_ID. I suspect we can do a lot better without bringing in that complexity, but it's an interesting opportunity to consider.

bwerthmann · 2024-01-23T18:16:58Z

@aclements profile as requested. Taken with go1.21.5 on a 128 core machine:

bwerthmann · 2024-01-24T19:51:03Z

@aclements here's a stack sample with the main func at the top:

profile from FlameScope:

blanks on the left and right are when the profile stopped and started.

Flamegraph:

bwerthmann · 2024-01-29T16:33:03Z

I'd love to see where the osq_lock call is coming from to confirm the hypothesis.

@aclements what are your thoughts on the profiles?

prattmic · 2024-01-29T20:25:53Z

@bwerthmann Thanks for the detailed profile! This seems to confirm my suspicion in #65064 (comment).

The mutex being taken appears to be https://elixir.bootlin.com/linux/v5.10.209/source/fs/eventpoll.c#L696 [1]. This is held around a loop over the ready list (https://elixir.bootlin.com/linux/v5.10.209/source/fs/eventpoll.c#L1722), which double-checks that events are still ready (ep_item_poll) and then copies them to the userspace output buffer (__put_user). This loop exits when it reaches the end of the ready list or when it has written the max events that fit in the userspace output buffer, whichever is sooner.

With a very long ready list, we're probably hitting the 128 event limit specified by netpoll. It's possible shrinking this could actually help by making the critical section shorter, but probably not nearly as much as reducing concurrent calls to epoll_wait (either directly, or sharding across multiple epoll FDs).

As an aside, I also see a fair amount of contention on runtime locks in your profile (probably sched.lock), so that bottleneck would likely come up next.

[1] ep_scan_ready_list was removed completely in 2020 in Linux v5.11 in a fairly large epoll refactor: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1a825a6a0e7eb55c83c06f3c74631c2eeeb7d27f, but the lock simply moved to the caller and seems to protect a similar critical section. #65064 (comment) noted Linux v5.15, so presumably it has similar issues.

ianlancetaylor · 2024-02-15T01:54:57Z

It seems to me that we can partially mitigate the immediate issue by just limiting the number of P's that do a non-blocking epoll call in findRunnable. We get no advantage from having multiple P's call netpoll(0) simultaneously. And if we prevent that from happening, then we seem likely to avoid the contention in the kernel. We'll still have contention in userspace, but we have that anyhow, and that will continue until we are able to do a major overhaul of the scheduler for NUMA support.

If anybody who can easily recreate the issue has time for experimentation, it might be interesting to see whether https://go.dev/cl/564197 makes any difference. Thanks.

gopherbot · 2024-02-15T01:56:39Z

Change https://go.dev/cl/564197 mentions this issue: runtime: only poll network from one P at a time in findRunnable

ianlancetaylor · 2024-03-12T23:10:29Z

Is anybody interested in seeing whether https://go.dev/cl/564197 fixes the problem? To be clear, I'm not going to submit it unless I have some reason to think that it helps. Thanks.

panjf2000 · 2024-03-12T23:49:06Z

Is anybody interested in seeing whether go.dev/cl/564197 fixes the problem? To be clear, I'm not going to submit it unless I have some reason to think that it helps. Thanks.

Is there any chance you could apply CL 564197 in production? Or maybe do it in the dev/test environment to which you replay the live traffic using some tool like goreplay? @ericvolp12 @whyrusleeping

bwerthmann · 2024-03-13T03:09:06Z

I might have some cycles this week to test my reproducer.

panjf2000 · 2024-03-13T03:11:39Z

I might have some cycles this week to test my reproducer.

Great! Thanks! @bwerthmann

bwerthmann · 2024-03-13T03:16:46Z

I might have some cycles this week to test my reproducer.

Great! Thanks! @bwerthmann

Are there any instructions or easy buttons for checking out the changes needed in https://go-review.googlesource.com/c/go/+/564197/ to my GOROOT?

bwerthmann · 2024-04-10T20:01:19Z

I might have some cycles this week to test my reproducer.

Great! Thanks! @bwerthmann

I've had other priorities, I'd like to get back to this in a few weeks or so. Sorry for the delay.

harshavardhana · 2024-04-18T21:03:40Z

Looks like we have hit the same problem

ldemailly · 2024-06-01T18:08:04Z

For reproducing (I don't have a 192 core machine to check) you could probably use https://github.com/fortio/fortio

Happy to help using it/if it helps but something like a large -c and talking to itself; on my local Mac it does >90k qps easily over 32 to 128 sockets but you can make it do 10,000 sockets and it gets slower ~ 60k qps

sprappcom · 2024-06-17T14:46:07Z

will this be solved with io_uring?
epoll is obsolete with io_uring around.

ianlancetaylor · 2024-06-17T18:22:52Z

io_uring is #31908.

harshavardhana · 2024-07-26T14:40:28Z

Is anybody interested in seeing whether https://go.dev/cl/564197 fixes the problem? To be clear, I'm not going to submit it unless I have some reason to think that it helps. Thanks.

I have some cycles to test this out, as we have a reproducer of sort that generates random latencies.

epoll contention on TCP causes latency build-up when we have high volume ingress. This PR is an attempt to relieve this pressure. upstream issue golang/go#65064 seems to be a deeper problem, haven't yet tried the fix provide in this issue but however this change without changing the compiler helps. Of course this is a workaround.

epoll contention on TCP causes latency build-up when we have high volume ingress. This PR is an attempt to relieve this pressure. upstream issue golang/go#65064 It seems to be a deeper problem; haven't yet tried the fix provide in this issue, but however this change without changing the compiler helps. Of course, this is a workaround for now, hoping for a more comprehensive fix from Go runtime.

prattmic added Performance NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Jan 11, 2024

prattmic added this to the Backlog milestone Jan 11, 2024

gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jan 11, 2024

panjf2000 added the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label Jan 13, 2024

gopherbot removed the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jan 13, 2024

panjf2000 added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jan 13, 2024

gopherbot removed the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label Jan 13, 2024

panjf2000 added the OS-Linux label Jan 13, 2024

mknyszek added this to Go Compiler / Runtime Jan 17, 2024

mknyszek moved this to Todo in Go Compiler / Runtime Jan 17, 2024

phuslu mentioned this issue Mar 29, 2024

tested with 192 cores, performance is not very good, suggestion on how to tweak for 192 or 448 cores server phuslu/lru#13

Open

This comment was marked as off-topic.

Sign in to view

ouvaa mentioned this issue Mar 29, 2024

[Question]: What is the maximum number of cores tested by anyone? I just tested 192 cores and it's better than evio etc but... panjf2000/gnet#558

Closed

3 tasks

phuslu mentioned this issue Apr 2, 2024

Thoughts about throughput benchmarks fairness phuslu/lru#14

Open

maypok86 mentioned this issue Apr 14, 2024

can someone please do a benchmark on core utilization. e.g. 48, 196 cpu core, 384 cpu core comparison with others maypok86/otter#72

Closed

sprappcom mentioned this issue Jun 16, 2024

how does it scale with multi core / threads? lni/dragonboat#346

Open

harshavardhana mentioned this issue Jul 29, 2024

separate lock from common grid to avoid epoll contention minio/minio#20180

Merged

8 tasks

panjf2000 mentioned this issue Aug 14, 2024

internal/poll: transparently support new linux io_uring interface #31908

Open

runtime: epoll scalability problem with 192 core machine and 1k+ ready sockets #65064

runtime: epoll scalability problem with 192 core machine and 1k+ ready sockets #65064

Comments

prattmic commented Jan 11, 2024 • edited Loading

prattmic commented Jan 11, 2024

ericvolp12 commented Jan 11, 2024 • edited Loading

whyrusleeping commented Jan 11, 2024 • edited Loading

prattmic commented Jan 11, 2024

prattmic commented Jan 11, 2024

whyrusleeping commented Jan 11, 2024

sschepens commented Jan 12, 2024

panjf2000 commented Jan 12, 2024 • edited Loading

sschepens commented Jan 12, 2024 • edited Loading

panjf2000 commented Jan 13, 2024 • edited Loading

errantmind commented Jan 13, 2024 • edited Loading

bwerthmann commented Jan 15, 2024 • edited Loading

bwerthmann commented Jan 15, 2024

panjf2000 commented Jan 16, 2024

bwerthmann commented Jan 16, 2024

valyala commented Jan 17, 2024 • edited Loading

aclements commented Jan 17, 2024

aclements commented Jan 17, 2024

bwerthmann commented Jan 23, 2024 • edited Loading

bwerthmann commented Jan 24, 2024 • edited Loading

bwerthmann commented Jan 29, 2024 • edited Loading

prattmic commented Jan 29, 2024

ianlancetaylor commented Feb 15, 2024

gopherbot commented Feb 15, 2024

ianlancetaylor commented Mar 12, 2024

panjf2000 commented Mar 12, 2024

bwerthmann commented Mar 13, 2024

panjf2000 commented Mar 13, 2024 • edited Loading

bwerthmann commented Mar 13, 2024 • edited Loading

This comment was marked as off-topic.

bwerthmann commented Apr 10, 2024

harshavardhana commented Apr 18, 2024

ldemailly commented Jun 1, 2024 • edited Loading

sprappcom commented Jun 17, 2024

ianlancetaylor commented Jun 17, 2024

harshavardhana commented Jul 26, 2024

prattmic commented Jan 11, 2024 •

edited

Loading

ericvolp12 commented Jan 11, 2024 •

edited

Loading

whyrusleeping commented Jan 11, 2024 •

edited

Loading

panjf2000 commented Jan 12, 2024 •

edited

Loading

sschepens commented Jan 12, 2024 •

edited

Loading

panjf2000 commented Jan 13, 2024 •

edited

Loading

errantmind commented Jan 13, 2024 •

edited

Loading

bwerthmann commented Jan 15, 2024 •

edited

Loading

valyala commented Jan 17, 2024 •

edited

Loading

bwerthmann commented Jan 23, 2024 •

edited

Loading

bwerthmann commented Jan 24, 2024 •

edited

Loading

bwerthmann commented Jan 29, 2024 •

edited

Loading

panjf2000 commented Mar 13, 2024 •

edited

Loading

bwerthmann commented Mar 13, 2024 •

edited

Loading

ldemailly commented Jun 1, 2024 •

edited

Loading