Introduction At Databricks, Kubernetes is at the heart of our internal systems. Within a single Kubernetes cluster, the default networking primitives like ClusterIP services, CoreDNS, and kube-proxy are often sufficient. They offer a simple abstraction to route service traffic. But when performance and reliability matter, these defaults begin to show their limits. In this post, we’ll share how we built an intelligent, client-side load balancing system to improve traffic distribution, reduce tail latencies, and make service-to-service communication more resilient. If you are a Databricks user, you don’t need to understand this blog to be able to use the platform to its fullest. But if you’re interested in taking a peek under the hood, read on to hear about some of the cool stuff we’ve been working on! Problem statement High-performance service-to-service communication in Kubernetes has several challenges, especially when using persistent HTTP/2 connections, as we do at Databricks with gRPC. How Kubernetes Routes Requests by Default The client resolves the service name (e.g., my-service.default.svc.cluster.local) via CoreDNS, which returns the service’s ClusterIP (a virtual IP). The client sends the request to the ClusterIP, assuming it's the destination. On the node, iptables, IPVS, or eBPF rules (configured by kube-proxy) intercept the packet. The kernel rewrites the destination IP to one of the backend Pod IPs based on basic load balancing, such as round-robin, and forwards the packet. The selected pod handles the request, and the response is sent back to the client. While this model generally works, it quickly breaks down in performance-sensitive environments, leading to significant limitations. Limitations At Databricks, we operate hundreds of stateless services communicating over gRPC within each Kubernetes cluster. These services are often high-throughput, latency-sensitive, and run at significant scale. The default load balancing model falls short in this environment for several reasons: High tail latency : gRPC uses HTTP/2, which maintains long-lived TCP connections between clients and services. Since Kubernetes load balancing happens at Layer 4, the backend pod is chosen only once per connection. This leads to traffic skew, where some pods receive significantly more load than others. As a result, tail latencies increase and performance becomes inconsistent under load. : gRPC uses HTTP/2, which maintains long-lived TCP connections between clients and services. Since Kubernetes load balancing happens at Layer 4, the backend pod is chosen only once per connection. This leads to traffic skew, where some pods receive significantly more load than others. As a result, tail latencies increase and performance becomes inconsistent under load. Inefficient resource usage : When traffic is not evenly spread, it becomes hard to predict capacity requirements. Some pods get CPU or memory starved while others sit idle. This leads to over-provisioning and waste. : When traffic is not evenly spread, it becomes hard to predict capacity requirements. Some pods get CPU or memory starved while others sit idle. This leads to over-provisioning and waste. Limited load balancing strategies : kube-proxy supports only basic algorithms like round-robin or random selection. There's no support for strategies like: Weighted round robin Error-aware routing Zone-aware traffic routing : kube-proxy supports only basic algorithms like round-robin or random selection. There's no support for strategies like: These limitations pushed us to rethink how we handle service-to-service communication within a Kubernetes cluster. Our Approach: Client-Side Load Balancing with Real-Time Service Discovery To address the limitations of kube-proxy and default service routing in Kubernetes, we built a proxyless, fully client-driven load balancing system backed by a custom service discovery control plane. The fundamental requirement we had was to support load balancing at the application layer, and removing dependency on the DNS on a critical path. A Layer 4 load balancer, like kube-proxy, cannot make intelligent per-request decisions for Layer 7 protocols (such as gRPC) that utilize persistent connections. This architectural constraint creates bottlenecks, necessitating a more intelligent approach to traffic management. The following table summarizes the key differences and the advantages of a client-side approach: Table 1: Default Kubernetes LB vs. Databricks' Client-Side LB Feature/Aspect Default Kubernetes Load Balancing (kube-proxy) Databricks' Client-Side Load Balancing Load Balancing Layer Layer 4 (TCP/IP) Layer 7 (Application/gRPC) Decision Frequency Once per TCP connection Per-request Service Discovery CoreDNS + kube-proxy (virtual IP) xDS-based Control Plane + Client Library Supported Strategies Basic (Round-robin, Random) Advanced (P2C, Zone-affinity, Pluggable) Tail Latency Impact High (due to traffic skew on persistent connections) Reduced (even distribution, dynamic routing) Resource Utilization Inefficient (over-provisioning) Efficient (balanced load) Dependency on DNS/Proxy High Minimal/Minimal, not on a critical path Operational Control Limited Fine-grained This system enables intelligent, up-to-date request routing with minimal dependency on DNS or Layer 4 networking. It gives clients the ability to make informed decisions based on live topology and health data. The figure shows our custom Endpoint Discovery Service in action. It reads service and endpoint data from the Kubernetes API and translates it into xDS responses. Both Armeria clients and API proxies stream requests to it and receive live endpoint metadata, which is then used by application servers for intelligent routing with fallback clusters as backup.” Custom Control Plane (Endpoint discovery service) We run a lightweight control plane that continuously monitors the Kubernetes API for changes to Services and EndpointSlices. It maintains an up-to-date view of all backend pods for every service, including metadata like zone, readiness, and shard labels. RPC Client Integration A strategic advantage for Databricks was the widespread adoption of a common framework for service communication across most of its internal services, which are predominantly written in Scala. This shared foundation allowed us to embed client-side service discovery and load balancing logic directly into the framework, making it easy to adopt across teams without requiring custom implementation effort. Each service integrates with our custom client, which subscribes to updates from the control plane for the services it depends on during the connection setup. The client maintains a dynamic list of healthy endpoints, including metadata like zone or shard, and updates automatically as the control plane pushes changes. Because the client bypasses both DNS resolution and kube-proxy entirely, it always has a live, accurate view of service topology. This allows us to implement consistent and efficient load balancing strategies across all internal services. Advanced Load Balancing in Clients The rpc client performs request-aware load balancing using strategies like: Power of Two Choices (P2C): For the majority of services, a simple Power of Two Choices (P2C) algorithm has proven remarkably effective. This strategy involves randomly selecting two backend servers and then choosing the one with fewer active connections or lower load. Databricks' experience indicates that P2C strikes a strong balance between performance and implementation simplicity, consistently leading to uniform traffic distribution across endpoints. For the majority of services, a simple Power of Two Choices (P2C) algorithm has proven remarkably effective. This strategy involves randomly selecting two backend servers and then choosing the one with fewer active connections or lower load. Databricks' experience indicates that P2C strikes a strong balance between performance and implementation simplicity, consistently leading to uniform traffic distribution across endpoints. Zone-affinity-based: The system also supports more advanced strategies, such as zone-affinity-based routing. This capability is vital for minimizing cross-zone network hops, which can significantly reduce network latency and associated data transfer costs, especially in geographically distributed Kubernetes clusters. The system also accounts for scenarios where a zone lacks sufficient capacity or becomes overloaded. In such cases, the routing algorithm intelligently spills traffic over to other healthy zones, balancing load while still preferring local affinity whenever possible. This ensures high availability and consistent performance, even under uneven capacity distribution across zones. The system also supports more advanced strategies, such as zone-affinity-based routing. This capability is vital for minimizing cross-zone network hops, which can significantly reduce network latency and associated data transfer costs, especially in geographically distributed Kubernetes clusters. The system also accounts for scenarios where a zone lacks sufficient capacity or becomes overloaded. In such cases, the routing algorithm intelligently spills traffic over to other healthy zones, balancing load while still preferring local affinity whenever possible. This ensures high availability and consistent performance, even under uneven capacity distribution across zones. Pluggable Support: The architecture's flexibility allows for pluggable support for additional load balancing strategies as needed. More advanced strategies, like zone-aware routing, required careful tuning and deeper context about service topology, traffic patterns, and failure modes; a topic to explore in a dedicated follow-up post. To ensure the effectiveness of our approach, we ran extensive simulations, experiments, and real-world metric analysis. We validated that load remained evenly distributed and that key metrics like tail latency, error rate, and cross-zone traffic cost stayed within target thresholds. The flexibility to adapt strategies per-service has been valuable, but in practice, keeping it simple (and consistent) has worked best. xDS Integration with Envoy Our control plane extends its utility beyond the internal service-to-service communication. It plays a crucial role in managing external traffic by speaking the xDS API to Envoy, the discovery protocol that lets clients fetch up-to-date configuration (like clusters, endpoints, and routing rules) dynamically. Specifically, it implements Endpoint Discovery Service (EDS) to provide Envoy with consistent and up-to-date metadata about backend endpoints by programming ClusterLoadAssignment resources. This ensures that gateway-level routing (e.g., for ingress or public-facing traffic) aligns with the same source of truth used by internal clients. Summary This architecture gives us fine-grained control over routing behavior while decoupling service discovery from the limitations of DNS and kube-proxy. The key takeaways are: