Not very long ago, one of our co-founders wrote a post on the million-dollar question in the enterprise networking world. In that post, Ranga discussed how hardware load balancers cannot scale elastically, which is why even web-scale companies such as Facebook and Google leverage software load balancers for elastic autoscaling to match traffic requirements.
As Avi Networks set out to build the next generation of software load balancers, we wanted them to be optimized and smart. An important aspect that we considered was to use multiple analyses to understand and automate critical decisions that are usually manual, and often made without enough data.
The Hardware Load Balancer Brick Wall
Last month at Networked Systems Design and Implementation (NSDI) conference, Google lifted the covers off Maglev, their distributed network software load balancer (LB) . Since 2008, Maglev has been handling traffic for core Google services like Search and Gmail. Not surprisingly, it's also the load balancer that powers Google Compute Engine and enables it to serve a million requests per sec without any cache pre-warming . Impressive? Absolutely! If you have been following application delivery in the era of cloud, say over last 6 years, you would have noticed another significant announcement at Sigcomm ‘13 by the Microsoft Azure networking team. Azure runs critical services such as blob, table, and relational storage on Ananta , its home-grown cloud scale software load balancer on commodity x86, instead of running it on more traditional hardware load balancers. Both Google and Microsoft ran headlong into what can be best described as “the hardware LB brick wall”, albeit at different times and along different paths in their cloud evolution. For Google, it started circa 2008 when the traffic and flexibility needs for their exponentially growing services and applications went beyond the capability of hardware LBs. For Azure, it was circa 2011, when the exponential growth of their public cloud led to the realization that hardware LBs do not scale and forced them to build their own software variant.
So, what is this “hardware LB brick wall” that these web-scale companies ran into?
Whether it is a water-cooler conversation about the latest wearable health monitor or the current cautions from the CDC about the Zika virus, health may easily rank as one of the most talked about topics in our daily lives. As a technologist, I am part of a number of conversations about a different kind of health - application health - which is as top-of-the-mind concern for enterprise application developers and administrators. The discussion of human health always evokes passionate debates - it turns out that this was no different with application health.
This was the case at Avi Networks when we asked a simple question - how do admins know that applications are in "good" health? I don't believe we had more meetings and debates about any other topic as much as we had about application health. In this blog post, I will take you through some of those passionate yet fascinating discussions that led to the creation of the Avi Health Score - a key capability of the Avi Vantage Platform.
The team had people with diverse backgrounds so we asked everyone the same question - "What does application health mean to you?". Here is a sample of the responses we received:
"Health is how much throughput my application can deliver. If it is doing 10Gbps that means it is good"
"Health is bad when CPU and memory are above 100%."
"Health is good when latency is below 100ms."
"Health is good if the application is up and responding to the health checks."
In the real world, if I ask you, "Do you believe I am in good health if I ran 3 miles today?", depending upon who you are you will likely respond with "it depends"; "of course!"; "did you run just today or do you run every day?"; or "what was your heart rate and vitals after the run?" You will have a whole lot of follow-up questions to dig into the details. To put this in perspective, tennis champ Roger Federer would likely win in straight sets against most people even if he were running a fever. Would that make him healthy? Of course not!
As you can see just a simple data point of a 3-mile run is not enough for a doctor to give a certificate of good health. Similarly, if you think you can determine a server's health based on the simple fact that it can handle a throughput of 10Gbps, you know you are probably wrong. It was hard for me to come to terms with this especially given the fact that I had spent most of my career prior to Avi Networks in a hardware company where it was normal to consider that networking hardware is healthy when a link is up and pumping at a bandwidth of 10Gbps.