Whether it is a water-cooler conversation about the latest wearable health monitor or the current cautions from the CDC about the Zika virus, health may easily rank as one of the most talked about topics in our daily lives. As a technologist, I am part of a number of conversations about a different kind of health - application health - which is as top-of-the-mind concern for enterprise application developers and administrators. The discussion of human health always evokes passionate debates - it turns out that this was no different with application health.
This was the case at Avi Networks when we asked a simple question - how do admins know that applications are in "good" health? I don't believe we had more meetings and debates about any other topic as much as we had about application health. In this blog post, I will take you through some of those passionate yet fascinating discussions that led to the creation of the Avi Health Score - a key capability of the Avi Vantage Platform.
The team had people with diverse backgrounds so we asked everyone the same question - "What does application health mean to you?". Here is a sample of the responses we received:
"Health is how much throughput my application can deliver. If it is doing 10Gbps that means it is good"
"Health is bad when CPU and memory are above 100%."
"Health is good when latency is below 100ms."
"Health is good if the application is up and responding to the health checks."
In the real world, if I ask you, "Do you believe I am in good health if I ran 3 miles today?", depending upon who you are you will likely respond with "it depends"; "of course!"; "did you run just today or do you run every day?"; or "what was your heart rate and vitals after the run?" You will have a whole lot of follow-up questions to dig into the details. To put this in perspective, tennis champ Roger Federer would likely win in straight sets against most people even if he were running a fever. Would that make him healthy? Of course not!
As you can see just a simple data point of a 3-mile run is not enough for a doctor to give a certificate of good health. Similarly, if you think you can determine a server's health based on the simple fact that it can handle a throughput of 10Gbps, you know you are probably wrong. It was hard for me to come to terms with this especially given the fact that I had spent most of my career prior to Avi Networks in a hardware company where it was normal to consider that networking hardware is healthy when a link is up and pumping at a bandwidth of 10Gbps.
Applying Lessons from Human Health
We found that there was a lot of conviction and passion in everyone's responses but it just didn't make sense. It was then that I looked to draw upon ideas from human health.
For a moment, let us think of the Golden State Warriors as a web app. Then star athlete Stephen Curry would likely be an important microservice in this app. How would we measure his health? Certainly, when Curry is healthy, we expect him to score about 30 points per game on average. We also don't expect him to run out of stamina during a game. In good health (measured using vitals such as heart rate, blood pressure, etc.), Curry would be expected to be consistent in his performance.
We applied a similar philosophy to measuring application health. Here is a definition of health from Merriam-Webster that we found relevant:
"the condition of an organism with respect to the performance of its vital functions especially as evaluated subjectively or nonprofessionally"
We slightly enhanced that definition for Avi's "Health Score" of applications:
"The Avi Health (Score) is a measure of the application's performance and its consistency and it reflects any risks to the application due to factors like resources and security."
So far so good. We defined the Avi Health Score. However, the elephant in the room remained -- how do we define the terms "performance", "risks" etc. Well, here is how we further qualified the definition of Health Score:
- Performance Score: The application performance score reflects the ability of an application to meet and exceed SLAs. Performance metrics should unambiguously reflect good vs. bad and whether the application meets performance SLAs. With this, metrics like max concurrent connections or transactions per second became inadmissible because these metrics could not be expressed as good or bad transactions. Therefore, we built the platform to use a multitude of performance metrics such as response quality, connection quality, and client experience quality to represent application performance.
- Resources Risk (Penalty): The next important set of metrics was to identify if applications had enough resources (or stamina, energy etc.) to consistently meet performance requirements. Avi's Health Score uses several metrics like CPU utilization, memory utilization, software queue usage, license usage etc. Rather than making it a direct sliding scale we applied human principles regarding resources to only levy penalties when resources utilization exceeded thresholds (say, 80%).
- Security Risk (Penalty): Just as humans perform better when they are in a safe mental, physical, and social environment, we concluded that security vulnerabilities to an application should result in lower health. For example, the use of weak ciphers for SSL results in a lower Avi Health Score.
- Application Performance Consistency (Anomaly penalty): We believe that the consistency of performance is a very important measure for application health. It doesn't matter if performance is okay during periods of light traffic if it tanks during peak usage.
Now that we had the definition of the Avi Health Score, we still needed to work out details around how we combine network quality with HTTP response quality, how we calculate score when both memory and CPU are above 80% etc.
Refining the Algorithm
While our debate on how to define the Avi Health Score was not settled yet, we had another passionate debate on how the numbers ought to work. Avi's analytics team created two rules that would serve as guiding principles for combining application health related information:
- When two metrics or health factors are of a different kind then use the least healthy factor to represent the health. For example, a person may have a very strong heart but a tumor in the brain. We don't want to do an average health between those to organs but highlight that she is not in good health due to the tumor in the brain. For software, we expressed these situations as, health = min(health_A, health_B) when the health of two factors A and B need to be combined together.
- When two metrics or health factors are of similar entities then average out health across all the similar factors. For example, if a virtual service has 100 servers then the health of the pool is determined via average health of all the 100 servers given all the servers are expected to be of a similar nature.
Another important consideration was whether the health of an application should be based on instantaneous metrics or if it should incorporate performance history. Most of our users and employees were split into one of two groups: 1) the team with a hardware industry pedigree that responded that application health should be based on instantaneous information and 2) the team that had a software/ops background that sided with looking at trend and history. Again, we used decisions strategies in human health to break the tie - one is considered not yet in good health if he/she is still recovering from a recent illness.
We made a choice to look at the previous 6 hours of metrics to determine an application's Health Score. In the Avi Vantage Platform, when an application admin sees a perfect health score (100), she can safely assume that for previous 6 hours, the application's health was perfect and the application was meeting its performance expectations.
The Score is Core to Avi
The process to derive an objective score by combining a series of subjective measures wasn't easy. The Avi Health Score and the metrics and components that determine score is a topic that is very dear to our hearts at Avi. Once we nailed it, building the software to cement the definition was the simpler part. Later, the definition that one of our technical advisors for application health (should reflect how an application is performing, that it should not be running out of resources and there are no anomalies), matched verbatim with our proposal which he was not privy to.
Every new Avi employee gives us an opportunity discuss the topic of application health. We know it is very important to our customers and employees alike. We hope to keep improving its measurement and characterization as has been the case for human health over time!