The Hardware Load Balancer Brick Wall
Last month at Networked Systems Design and Implementation (NSDI) conference, Google lifted the covers off Maglev, their distributed network software load balancer (LB) . Since 2008, Maglev has been handling traffic for core Google services like Search and Gmail. Not surprisingly, it's also the load balancer that powers Google Compute Engine and enables it to serve a million requests per sec without any cache pre-warming . Impressive? Absolutely! If you have been following application delivery in the era of cloud, say over last 6 years, you would have noticed another significant announcement at Sigcomm ‘13 by the Microsoft Azure networking team. Azure runs critical services such as blob, table, and relational storage on Ananta , its home-grown cloud scale software load balancer on commodity x86, instead of running it on more traditional hardware load balancers. Both Google and Microsoft ran headlong into what can be best described as “the hardware LB brick wall”, albeit at different times and along different paths in their cloud evolution. For Google, it started circa 2008 when the traffic and flexibility needs for their exponentially growing services and applications went beyond the capability of hardware LBs. For Azure, it was circa 2011, when the exponential growth of their public cloud led to the realization that hardware LBs do not scale and forced them to build their own software variant.
So, what is this “hardware LB brick wall” that these web-scale companies ran into?
Some common themes emerge when perusing through the papers and blogs that describe these implementations:
- scale-up instead of a more natural scale-out model that inherently limits the system performance to that of a single device.
- 1-1 redundancy that does not meet High Availability (HA) requirements of modern cloud services.
- lack of flexibility/programmability that makes it hard to integrate with applications.
- need for expensive forklift upgrades that require over-provisioning and advanced capacity planning.
Requirements of Modern Cloud-Era Application Stacks
Interestingly, these same themes have come up in conversations with Avi Networks’ customers, whether they be large enterprises, e-commerce platforms, large financials, or cloud service providers. In addition, the broader market has undergone several major transformations that have fundamentally altered how applications are developed and deployed today as compared to 4-6 years ago. Some of these are:
- An evolution to a self-service IT that manages its infrastructure or platform or apps as a multi-tenant cloud with resource sharing and tenant isolation when required.
- Spread of the DevOps and CI/CD models of application development. Applications are deployed every day if not several times a day.
- Programmatic integration of Load Balancer into application deployment workflows. This coupled with CI/CD implies that Virtual IPs (VIPs) and their configurations are provisioned at a rapid rate as new and existing services are deployed, A/B Tested, upgraded, served from multiple datacenters, and migrated across datacenters.
- Multi-cloud hybrid deployments with applications provisioned across a variety of environments, such as in a private OpenStack Cloud or a private vCenter Cloud, or a private bare-metal Linux Cloud, or a private Mesos/Kubernetes/Docker Cloud with optional bursting to public cloud for additional capacity.
- A 24/7/365 uptime requirement from applications with expectations of no performance degradations, no upgrade maintenance windows, and no downtimes for capacity overhauls. Thus, visibility and analytics into an application’s health with proactive suggestions on how to keep it up and running well is a key requirement for modern application deployments.
- Application evolution from monolithic to microservice/service-oriented patterns. This requires load balancing, visibility, and service discovery not just at the traditional network perimeter, but also deep within the application network in every rack of the datacenter.
How would traditional hardware LBs handle these requirements of a modern application stack? Clearly they cannot - there have been some proposed band-aid remedies that have been released like packaging the hardware appliance code as a 1-1 HA Virtual Machine (VM) pair and making it available as an image to be installed in the public cloud. How does this address any of the requirements described above? It doesn’t - eventually the customer is forced to upgrade to a hardware platform as soon as his/her traffic runs beyond the capacity of a single VM.
A Bird’s-Eye View of The Avi Vantage Platform
Three years ago, when we were iterating on the earliest versions of the Avi Vantage platform, we noticed the trend for these same themes described above playing out, and they have only accelerated since then. At Avi, it was clear to us that the broader industry was rapidly hurtling towards the same “hardware LB brick wall” that the web-scale companies ran into - traditional hardware LBs simply do not work in an application environment that has embraced the simplicity, flexibility, and performance characteristics of the cloud. Once this was clear, we set our minds on building a new application delivery platform that would address all these requirements for our customers. And we are proud to say that with Avi Vantage, we have built such a platform - a distributed software application delivery platform running on commodity x86 that can be summarized thus:
- A centralized clustered controller that is a single point of management and visibility into applications irrespective of the cloud they are deployed in.
- A distributed Layer-7 data path with SSL termination that is dynamically provisioned in the cloud, can be scaled-out infinitely through BGP/ECMP, and can be auto-scaled in/out based on user-configurable policies.
- A Visibility/Analytics engine on the controller that gives real-time fine granular view into application health, metrics, logs, and client/user experience thus enabling auto scale to meet application requirements.
- A RESTful interface into the controller that can enable integration with any external orchestration engine or application deployment workflows.
- A fast, responsive, and modern HTML5 UI that builds upon the same RESTful interface to provide full configuration capability and rich dashboards into application performance and health.
The picture below shows a high-level view of the Avi Vantage Platform Architecture.
In this section, I will highlight a few key design principles we employed when designing the Avi Vantage platform:
- Software Defined Approach: Avi Vantage is a true software defined architecture - a consensus based centralized software controller cluster manages an infinitely scalable distributed software datapath (our software datapath is called Service Engine aka SE). This simple yet powerful architectural construct enables us to handle all requirements of modern applications such as a single point of configuration and visibility, multi-cloud capability, and auto-scale of the datapath.
- Network-based Scaleout: To address scalability, performance, and elasticity requirements of the data path, Avi Vantage uses the upstream network router’s ECMP (Equal Cost Multipath) capability to spread connections over all Service Engines (SEs) that handle traffic for a VIP. Routes are announced by Avi through BGP updates pushed to the upstream routers - both scale-in and scale-out are handled seamlessly through route withdrawals and announcements respectively. A combination of Consistent Hashing and flow updates between SEs, and periodic health-checks of backend servers ensures that traffic continues flowing smoothly even across rolling upgrades of the Avi Vantage platform as well as rolling upgrades of the backend services.
- UserSpace (full kernel bypass) datapath software stack: With network-based scaleout, incoming traffic is uniformly distributed across the entire cluster of Service Engines. The total capacity of the cluster thus becomes N * C where C is the capacity of a single SE and N is the number of SEs in the cluster. Thus, capacity/performance of a single SE is a critical factor in the overall system efficiency. To maximize the performance, our SE stack runs completely in user-space with full kernel bypass, all the way from Layer-2 to Layer-7. Thus, it avoids all well-known overheads of traditional network stacks such as interrupt processing, system call overheads, and data copies across user-kernel boundaries [4, 5, 6, 7]. Additionally, we apply the principles of non-blocking run-to-completion interfaces, zero-copy interfaces, minimized serialization/de-serialization of data between packet buffers and application buffers to create a highly-efficient and performant software data path.
- Scaleout and Streaming Metrics/Log Analytics: One of the unique features and a key differentiator of Avi Vantage is its rich support for metrics and log analytics. Metrics and Logs data include very fine granular information about every HTTP transaction from the latency of client side connection, to the first byte response latency from the server, to the SE and server CPU and memory utilizations and hundreds of such data points that are all collected, processed, and indexed in real-time. Given a deployment of thousands of VIPs, hundreds of SEs, and several thousand backend servers, how do you build a scalable system that can provide a pulse of the application’s health in real-time? We employ the principles of distributed and scaled-out log and metric collectors on the SEs, data reduction, sharding across controller cluster, in-line streaming metrics computations, and incremental log indexing with indices across multiple dimensions.
- Fast and Flexible HTTP Routing: Unlike Maglev  and Ananta  which are software Layer-4 network load balancers, Avi Vantage is both a Layer-4 and Layer-7 LB with support for SSL termination at the VIP, and with re-encryption of traffic to backend services if desired. A key requirement from a Layer-7 LB is the ability to route traffic based on a variety of application level parameters such as (i) Incoming URI, query params, HTTP request headers, cookies (ii) Current load on the system requiring serving from a backup pool (iii) session affinity requiring routing to services in a remote datacenter, and many other parameters. Routing at Layer-7 is both cpu and memory intensive - it requires request parsing, extracting and matching fields against several match criteria, and then executing a series of actions that rewrite and route the request to the intended backend service. What makes this problem harder (and more interesting) is that in addition to speed, flexibility is also a key component. To handle both requirements, we designed our Layer-7 routing engine to have a dual personality (i) a native rule engine implementation that provides very fast matches and actions through efficient algorithms over IP, URI datasets and (ii) a scripting engine that runs user-defined Lua  scripts with well-defined APIs to the request/connection/session/pool state, with common operations accelerated through native code.
- SSL Acceleration through distributed session resumption: One of the oft-cited reasons for using hardware load balancers has been that they are a must for SSL termination. Turns out, this is no longer true. This is a quote from Adam Langley on Google’s deployment of TLS on commodity x86 , “On our production frontend machines, SSL/TLS accounts for less than 1% of the CPU load, less than 10 KB of memory per connection and less than 2% of network overhead. Many people believe that SSL/TLS takes a lot of CPU time and we hope the preceding numbers will help to dispel that” . So, the software performance of TLS/SSL on modern x86 CPUs is no longer an issue . Also, thanks to TLS/SSL session resumption and HTTP keepalives, the cost of SSL full handshake processing is only paid for a fraction of an application’s secure traffic. With network-based scale-out, connections from the same client can be routed to different SEs that terminate SSL traffic for a VIP. To reap the benefits of network scale-out as well as session resumption, we use a in-memory distributed datastore to store the encrypted server side SSL session state and turn it over daily. For client-side TLS tickets, the ticket keys are distributed to all the SEs and rotated daily to ensure benefits of Perfect Forward Secrecy (PFS) as described here .
Note, in the above section, I have highlighted only a few key design principles/choices that helped us manage and meet the complex and diverse requirements of building a modern cloud-era application delivery platform. Over the new few months, we will write a series of technical blogs that will cover more areas and go in depth into challenges we faced when building this system, and how we solved them. Please add https://blog.avinetworks.com/tech to your favorite RSS reader - we have some fun technical stuff coming up!
The goal of this blog was to motivate why in the modern cloud-era of applications, we need a fundamentally different architecture for an application delivery platform. Google’s NSDI paper and Microsoft’s Sigcomm paper laid down the reasons why hardware load balancers did not work in their environments. In addition, over the last 4 to 6 years, there have been fundamental transformations in the way applications are developed and deployed today in the broader industry across enterprise clouds, e-commerce platforms, large financials, and cloud service providers. This makes the “hardware LB brick wall” a universal problem that is not just limited to applications at these two companies. And so we set out to solve this challenging problem - and with Avi Vantage, we have built an application delivery platform that brings the benefits of distributed software LBs to everyone.
- Maglev: A Fast and Reliable Software Network Load Balancer
- Google shares software network load balancer design powering GCP networking
- Ananta: Cloud Scale Load Balancing
- Speeding up Networking
- netmap: A Novel Framework for Fast Packet I/O
- The Lua Programming Language
- Overclocking SSL
- Securing the Enterprise with Intel AES-NI
- How to botch TLS forward secrecy