The Surprising gRPC Client Bottleneck in Low-Latency Networks — and How to Get Around It Evgeniy Ivanov 9 min read · Just now Just now -- Listen Share Zoom image will be displayed “Improving anything but the bottleneck is an illusion.” — Eliyahu M. Goldratt At YDB, we use gRPC to expose our database API to clients. Therefore, all our load generators and benchmarks are gRPC clients. Recently, we discovered that the fewer cluster nodes we have, the harder it is for the benchmark to load the cluster. Moreover, when we shrink the cluster size, it results in more and more idle resources, while we observe steadily increasing client-side latency. Fortunately, we identified the root cause as a bottleneck on the client side of gRPC. In this post, we describe the issue and the steps to reproduce it using a provided gRPC server/client microbenchmark. Then, we show a recipe to avoid the discovered bottleneck and achieve high throughput and low latency simultaneously. We present a comparative performance evaluation that illustrates the relationship between latency and throughput, as well as the number of concurrent in-flight requests. A Very Short gRPC Introduction gRPC is usually considered “a performant, robust platform for inter-service communication”. Within a gRPC client, there are multiple gRPC channels, each supporting many RPCs (streams). gRPC is implemented over the HTTP/2 protocol, and each gRPC stream corresponds to an HTTP/2 stream. gRPC channels to different gRPC servers have their own TCP connections. Also, when you create a channel, you might specify channel arguments (channel configuration), and channels created with different arguments will have their own TCP connections. Otherwise, as we discovered, all channels share the same TCP connection regardless of traffic (which is quite unexpected), and gRPC uses HTTP/2 to multiplex RPCs. In gRPC’s Performance Best Practices, it is stated that: (Special topic) Each gRPC channel uses 0 or more HTTP/2 connections and each connection usually has a limit on the number of concurrent streams. When the number of active RPCs on the connection reaches this limit, additional RPCs are queued in the client and must wait for active RPCs to finish before they are sent. Applications with high load or long-lived streaming RPCs might see performance issues because of this queueing. There are two possible solutions: 1. Create a separate channel for each area of high load in the application. 2. Use a pool of gRPC channels to distribute RPCs over multiple connections (channels must have different channel args to prevent re-use so define a use-specific channel arg such as channel number). Our gRPC clients use the first solution. It’s also worth noting that, by default, the limit on the number of concurrent streams per connection is 100, and the number of our in-flight requests per channel is lower. In this post, we show that — at least in our case — these two solutions are actually two steps of the same fix, not separate options. A Simple gRPC Benchmark To tackle the problem, we implemented a simple gRPC ping microbenchmark in C++, which uses the latest version of gRPC (v1.72.0, fetched via CMake). Note that we observed the same issue with clients written in Java, so these results likely apply to most gRPC implementations. There is a grpc_ping_server , which uses the gRPC async API with completion queues. The number of workers, completion queues, and callbacks per completion queue are specified at startup. There is also a grpc_ping_client . At startup, you specify the number of parallel workers. Each worker performs RPC calls with in-flight equals 1, using the gRPC sync API (results are the same when using the async API) and its own gRPC channel. This is a closed-loop benchmark, with the total number of concurrent requests in the system equal to the number of client workers. For simplicity, the ping message has no payload. We run the client and the server on separate bare-metal machines. Each has two Intel Xeon Gold 6338 CPUs at 2.00 GHz with hyper-threading enabled. Here is the topology: > numactl --hardware available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 node 0 size: 257546 MB node 0 free: 51628 MB node 1 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 node 1 size: 258032 MB node 1 free: 82670 MB node distances: node 0 1 0: 10 20 1: 20 10 The machines are physically close to each other and connected with a 50 Gbps network: 100 packets transmitted, 100 received, 0% packet loss, time 101367ms rtt min/avg/max/mdev = 0.031/0.043/0.085/0.012 ms Just for comparison, here is a localhost ping: 100 packets transmitted, 100 received, 0% packet loss, time 101387ms rtt min/avg/max/mdev = 0.013/0.034/0.047/0.009 ms We run the gRPC server using the following command: taskset -c 32-63 ./grpc_ping_server --num-cqs 8 --workers-per-cq 2 --callbacks-per-cq 10 We use taskset to ensure that all threads always run within the same NUMA node. This is crucial for accurate performance results, and we recommend the same approach in production. According to the gRPC documentation, 2 workers per completion queue is optimal. We expect that a server with 32 CPU cores and 8 completion queues would handle a huge RPS with good latency. Now, let’s start the client and check the results for both regular RPCs and streaming RPCs, which should have better performance: echo "Single connection, no streaming" taskset -c 0-31 ./grpc_ping_client --host server-host --min-inflight 1 --max-inflight 48 --warmup 5 --interval 10 --with-csv echo "Single connection streaming" taskset -c 0-31 ./grpc_ping_client --host server-host --min-inflight 1 --max-inflight 48 --warmup 5 --interval 10 --with-csv --streaming The plotted results: Zoom image will be displayed Zoom image will be displayed “Regular RPC (theoretical)” is calculated using the formula rps(inflight) = measured_rps_inflight_1 * inflight and represents ideal linear scalability. Unfortunately, even with a small in-flight count, the results deviate significantly from this ideal straight line. Here is the tabular data for RPC without streaming: Benchmark Results Summary: In-flight | Throughput (req/s) | P50 (us) | P90 (us) | P99 (us) | P99.9 (us) | P100 (us) ---------+--------------------+----------+----------+----------+------------+---------- 1 | 10435.10 | 91 | 98 | 106 | 116 | 219 2 | 17583.50 | 100 | 131 | 235 | 273 | 372 3 | 21754.50 | 126 | 171 | 251 | 319 | 538 4 | 25904.20 | 144 | 195 | 283 | 358 | 491 5 | 27468.70 | 168 | 243 | 348 | 442 | 606 6 | 30093.90 | 190 | 255 | 366 | 461 | 630 7 | 32224.70 | 210 | 270 | 394 | 502 | 770 8 | 33289.90 | 234 | 298 | 424 | 552 | 732 9 | 36014.90 | 247 | 299 | 411 | 599 | 1282 10 | 37169.70 | 266 | 325 | 454 | 699 | 942 11 | 39452.10 | 283 | 336 | 433 | 653 | 1017 12 | 40515.50 | 297 | 359 | 537 | 840 | 1133 13 | 41839.70 | 312 | 374 | 543 | 902 | 1172 14 | 43360.70 | 327 | 389 | 536 | 892 | 1231 15 | 43196.50 | 344 | 426 | 660 | 1038 | 1335 16 | 43854.60 | 359 | 445 | 847 | 1156 | 1453 17 | 44044.30 | 375 | 473 | 875 | 1210 | 1493 18 | 44025.60 | 390 | 508 | 1043 | 1316 | 1578 19 | 42714.70 | 411 | 581 | 1176 | 1415 | 1651 20 | 42050.10 | 429 | 705 | 1270 | 1497 | 1764 21 | 40584.50 | 453 | 848 | 1371 | 1585 | 1866 As you can see, adding 10× clients yields only a 3.7× increase in throughput, while adding 20× results in just a 4× gain. Moreover, latency increases linearly with each additional client. Even when the in-flight count is small, latency is much higher than the network’s. And because we have a closed loop, throughput is latency-bound. The performance was clearly poor, so we began investigating. First, we checked the number of TCP connections using lsof -i TCP:2137 and found that only a single TCP connection was used regardless of in-flight count. Next, we captured a tcpdump and analyzed it in Wireshark: As expected, there were no congestion issues, because we have a solid network Nagle is off (TCP_NODELAY set) TCP window was also good: set to 64 KiB, while in-flight bytes were usually within 1 KiB No delayed ACKs Server was fast to respond Meanwhile, we noticed the following pattern: Client sends an HTTP/2 request to the server, batching RPCs from different workers. Server ACKs. Server sends a batched response (multiple datagrams) containing the data for different workers (i.e., also batched). Client ACKs. Now, the TCP connection has no bytes in-flight. Around 150–200 µs of inactivity before the subsequent request. Thus, the major source of latency is the client. And because our microbenchmark logic is minimal, the issue likely lies somewhere in the gRPC layer. It might involve contention within gRPC or batching. We tried both per-worker gRPC channels and channel pooling (all using the same arguments and sharing the same TCP connection). There was no improvement. Using channels with different arguments works well. Alternatively, it’s sufficient to set the GRPC_ARG_USE_LOCAL_SUBCHANNEL_POOL argument as we do here. The best performance (in both throughput and latency) is achieved when each worker has its own channel, and this option is set. Commands to run the clients: echo "Multi connection, no streaming" taskset -c 0-31 ./grpc_ping_client --host ydb-vla-dev04-000.search.yandex.net --min-inflight 1 --max-inflight 48 --warmup 5 --interval 10 --with-csv --local-pool echo "Multi connection streaming" taskset -c 0-31 ./grpc_ping_client --host ydb-vla-dev04-000.search.yandex.net --min-inflight 1 --max-inflight 48 --warmup 5 --interval 10 --with-csv --streaming --local-pool Results: Zoom image will be displayed Zoom image will be displayed For regular RPCs, we see nearly a 6× improvement in throughput, and 4.5× for streaming RPCs. And when in-flight is increased, latency grows really slowly. It’s possible that another bottleneck or resource shortage exists, but we stopped the investigation at this point. To better understand when this bottleneck might become an issue, we repeated the same measurements in a network with 5 ms latency: 100 packets transmitted, 100 received, 0% packet loss, time 99108ms rtt min/avg/max/mdev = 5.100/5.246/5.336/0.048 ms Zoom image will be displayed Zoom image will be displayed It’s clear that in this case everything is OK, and the multichannel version is only slightly faster when in-flight is high. Conclusions In our case, the two gRPC client-side “solutions” described in the official best practices — per-channel separation and multi-channel pooling — turned out to be two steps of a single, unified fix. Creating per-worker channels with distinct arguments (or enabling GRPC_ARG_USE_LOCAL_SUBCHANNEL_POOL ) resolves the bottleneck and delivers both high throughput and low latency. However, there may be other optimizations we’re unaware of. If you know ways to improve performance or want to contribute, let us know in the comments or by filing an issue/PR on the benchmark’s GitHub repo.