In an earlier blog, “Avi Vantage: A Cloud-Scale Distributed Software Load Balancer For Everyone”, we had described the high-level architecture of the Avi Vantage platform. The platform consists of a clustered centralized Controller, a scale-out distributed Layer-7 Reverse Proxy data path called as Service Engine (SE), a Visibility/Analytics engine, a RESTful interface to the Controller that enables integration with external orchestration engines, and a fast and responsive HTML5 UI. It is designed ground up to be a modern cloud-scale Application Delivery Controller (ADC) that enables application deployment across any cloud - private, public, or hybrid.
In this blog, we take a deep-dive into one of the critical components of the SE datapath - namely the Application Routing Engine.
A classic networking stack is divided into several protocol layers, the so called Open Systems Interconnection model. Just like Switches and Routers switch/route packets at the Data Link and Network Layers respectively, an Application Routing Engine switches/routes connections and transactions at the Application Layer of the OSI stack. A classic example of this is an HTTP Reverse Proxy - it parses incoming HTTP requests and extracts HTTP attributes such as URIs, Headers, Cookies, Query Parameters from it, and then uses application-level logic to forward the request with required modifications to a server. Since HTTP is the de-facto lingua franca for web applications, in this blog, we will focus on HTTP Routing Engines.
Real-World Use-Cases for Application Routing
Modern web application deployments are full of examples of complex routing logic. API Gateways, social media applications, streaming media applications, finance applications, payment gateways, e-commerce applications are a few examples of these. A desire for 24/7/365 uptime, continuous deployment, multi-datacenter footprint for performance and availability, and need for security are some of the reasons for these complex traffic routing requirements. Before we deep-dive into the challenges of application level routing, let’s first look at a few real-world use cases for it:
- Content Switching for static/dynamic content: Websites/Applications have a mix of static and dynamic content. Examples of static content are media such as pictures, audio, and video clips. An example of dynamic content is a comments section on a social media site that is updated in real-time by a service that is a part of the social media application. As a user navigates through the pages of an application, the browser sends out multiple HTTP requests for the static and dynamic resources that make up the page elements. An HTTP Routing Engine is used to parse the requests and steer them appropriately towards the pool of servers that handle the specific type of resources. For instance, a Routing Engine deployed at the network edge (in a PoP) close to the user/client can serve static resources from a local pool of servers, while requests for dynamic resources are proxied to a pool of servers hosted in a remote data center.
- X-Forwarded-For Header: This is a HTTP Header commonly known as XFF that is used to identify the originating IP address of a user/client connecting to an application, possibly, through a chain of proxies. Since this header is easy to forge, applications usually have a non-trivial set of rules for reasoning about their authenticity. For instance, an application might want to match the client IP of the TCP connection to a well-known dataset of “trusted downstream proxies” as a litmus test. If that IP is trusted, then it parses the chain of IPs in the XFF Header and tries to verify the next hop and so on. Verifying each hop requires parsing of the XFF header, doing lookups against a large IP DataSet, and executing application specific logic. At the end of this exercise, an application will determine the originating client IP address and can use that for logging, authentication, billing etc.
- Rate Limiting Clients and/or URIs: This use case is interesting because it shows up in a variety of applications. An API gateway might want to rate limit to offer different SLAs for different customers. Clients can be rate limited based on geolocation, or even completely blocked in certain cases. Rate Limiting itself can vary from simple blacklist/whitelists at one end of the spectrum all the way to automatic learning of a client/URI behavior by tracking the completion status of the requests and using that information to adjust the limits dynamically.
- SSL Certificate based Admission Control: Public Key Infrastructure (PKI) is a critical component for securing Web Applications. It is commonly used by clients to authenticate the servers they are communicating with. Increasingly, modern applications use client certificates to authenticate clients connecting to them. Secure applications like payment gateways go beyond PKI. In addition to client authentication through PKI (verifying client certificate is valid, not revoked, and chain of trust leads to a root trusted by the application), the application can have additional policies that dictate that only clients with certain subject names are allowed to connect on service endpoints. Thus, after PKI validation is successful, additional admission control is done to determine whether to accept or deny the client request.
The above examples give a flavor of the variety of use cases that are served by a Application Routing Engine. Essentially, it enables an application to encode very fine-granular policies to control the flow of traffic as the application wishes - covering the gamut of features ranging from security, logging, billing, disaster recovery, authentication, continuous integration/continuous deployment and so on.
Requirements for an Application Routing Engine
When designing any system, the first step after listing down the use cases is almost always to abstract out the details and quickly get down to the core set of requirements. After a few iterations, we narrowed the requirements down to:
- Fine Granular Control: Ability to attach to the “request/response” processing pipeline at every possible stage, for instance “on client TCP connection”, “on client SSL handshake”, “on parsing HTTP request headers”, "on parsing HTTP request payload", “on doing a persistence/load-balancing decision”, “on establishing server TCP connection”, “on establishing SSL handshake to the server”, “on parsing HTTP response headers”, “on receiving the HTTP response payload” etc.
- DataSets and Operators: Ability to match against large data sets of IPs, URIs, Headers, Payload, etc. This led to the requirement of having datasets with support for a variety of operators such as “begins_with”, “ends_with”, “contains”, “regex”, "equals" etc.
- Multiple Matches and Multiple Actions: Ability to match multiple parameters of an incoming request/transaction - for instance, client IP, incoming port, URI, HTTP Headers, query params, etc. Ability to execute multiple matching actions on a rule match - for instance, replace some HTTP headers from the client request, add some new HTTP headers with values parsed out from the incoming client stream or through table/dataset lookups, rewrite the URI by doing regex match and token replacements, and decide which pool of servers to forward the request to.
- Performance: Ability to run the Routing Engine on every request without a perceptible difference in application latency to the user/client. Performance is always a key requirement of a datapath. Every request as it traverses the engine goes through parsing, deserialization, lookup/matches, rewrites, and finally serialization and proxying to the server. Thus, efficient implementations of all these operations is important.
- Flexibility: Ability to execute arbitrary application logic on requests in both stateless and stateful ways. For stateful logic, the state can have a variety of scope and lifetime - for instance the scope and lifetime could be that of a HTTP request, a TCP Connection, an SSL session, a Virtual Service, or even Global, and if required should be synced across SEs for a scaled-out Virtual Service.
- Usability: Ability to be used and deployed in a simple way. This implies the matches, actions, and application logic should be easy to specify. It should not require having to learn complicated syntax and semantics.
- Visibility/Debuggability: Ability to pinpoint and quickly zero in on reasons for changes in traffic routing. This lets application owners get visibility into and modify the routing logic quickly, correctly, and fearlessly.
- Logging: Ability to decide “when” and “how much” of a request/transaction to log for analysis. Like the Visibility/Debuggability requirement, this helps application owners to understand and better control traffic flows and patterns in their applications.
Design of an Application Routing Engine
Now that we had the core set of requirements understood and summarized, the next step was to distill them so the right abstractions could be built. The first thing that stood out right away was the presence of two equally important and somewhat at odds requirements of "Performance" and "Flexibility". We realized that there was no one-size-fits-all solution. The right design for the Application Routing Engine needed to support a dual abstraction - we call them Policies (fast, flexible enough, can handle most common use-cases) and DataScripts (flexible, fast enough, can handle all use-cases). All other requirements such as Logging, Debuggability, Fine Granular Control, Usability, DataSets, and Multiple Matches/Multiple Actions were applicable to both Policies and DataScripts.
The Policies Abstraction
The Policy abstraction supports a majority of real-world use-cases through native code implementation. A Policy is a collection of ordered Rules. A Rule allows for matches across several parameters and datasets, and a series of actions. Execution begins at the first Rule in the Policy, and continues until there is a successful Rule Match upon which all actions for that Rule are run, and the execution engine then jumps to executing the next Policy.
There are 4 types of Policies:
- Network Security Policy: Rules from this policy run during the TCP connection establishment phase. Example: reject all connections from IP subnet X.
- HTTP Security Policy: Rules from this policy run after the HTTP Request URI and Headers are parsed. Example: reject all connections where the User-Agent does not belong to a allowed DataSet of User Agents.
- HTTP Request Policy: Rules from this policy run after the HTTP Request URI and Headers are parsed. Example: Modify Incoming XFF Header by appending the IP Address of the client to the current header value "and" Add a X-SSL-FingerPrint Header with the value being the client’s SSL certificate fingerprint "and" Pick Pool-Static if URIs belong to the Static Resource DataSet else go to the Default Pool.
- HTTP Response Policy: Rules from this policy run after the HTTP Response Status and Headers are parsed. Example: Remove X-Application-Internal Headers before forwarding the response to the client.
The Match does a Regex compare of the incoming Path to a DataSet of Regexes (Protected-URIs in the screenshot), and the actions insert an X-Forwarded-For Header with the value of the Client IP, and select the "protected-pool" for serving the request. In addition, an Application Log is generated for this request and all the HTTP Headers are logged.
The pictures below show a log of the request. The middle column of the log UI indicates that the request matched against the "add-xff-select-pool" Policy rule. The right column shows the server was selected from the "protected-pool". The "Full Headers" view of the request log shows the "X-Forwarded-For" header added to the request as it is proxied to the backend server.
The example above is intentionally kept simple, yet demonstrates the expressiveness and simplicity of the Policies abstraction. Policies are described in detail in the Avi Networks Knowledge Base here.
The DataScript Abstraction
The DataScript abstraction captures all real-world use cases by embedding a small footprint Lua virtual machine in the datapath to run application defined Lua scripts. Since the application logic can now be written in Lua, it can capture arbitrary processing, both stateless and stateful. In addition to being able to run native Lua code, the Routing Engine exposes several powerful constructs into Lua through a set of libraries. Here are a few examples:
- Full read/write access to the Reverse Proxy State pertaining to the TCP connection, SSL session, HTTP Request, Virtual Service, Pool etc: For instance, an application DataScript can get access to the fingerprint of the Client SSL Cert, or incoming TCP port, or the number of servers that have passed Health Monitor checks in the Pool.
- DataSets such as String-Groups and IP-Groups and their associated operators such as “Begins_with”, “Contains”, “RegEx” etc: This means that efficient native code implementations of DataSets and Operators can be leveraged inside the application logic that is written in Lua.
- Read/Write variables of different scopes and lifetime: This lets application logic track state within a request, across requests in a connection, across connections in a session, sessions in a Virtual Service etc. A Virtual Service in Avi Vantage is scaled-out across multiple SEs. DataScripts support a distributed table/key-value store for read/write access across the multiple SEs that can be accessed from Lua.
The picture below shows a UI screenshot of a DataScript that is executed for a request after its HTTP headers are parsed.
The DataScript checks if the request was received over an HTTPs port. If yes, it fetches the Subject Name of the client's SSL cert and the request path. It then matches the path against a DataSet of "Tracked-Paths". If there is match in the dataset, a new HTTP Header "X-Client-Subject-DN" is inserted in the HTTP request forwarded to the server, and the client's subject name is logged in the request log.
The picture below show a log of the request. The middle column of the log UI shows the DataScript Log message that indicates the path matched and the client is being tracked. The right column shows the URI/Path of the request that matched the "Tracked-Paths" DataSet.
The picture below is screenshot of the "Full Headers" for the request Log. It shows that the "X-Client-Subject-DN" header was inserted by the DataScript in the request forwarded to the server.
The example Datascript above shows the power of embedding a full-fledged language runtime in the Application Routing Engine. We chose Lua for its small footprint, its simplicity without compromising expressiveness, and extensibility, amongst other advantages. Our SE Datapath employs optimized memory allocators for both per-core and across cores shared memory. The Lua interpreter makes no assumptions about memory allocation and lets an application override the default allocators. This works great since we can plug in our SE memory allocators and tightly control the memory allocated even inside the Lua runtime.
Here are a few examples of the libraries we export to the DataScript/Lua environment:
- avi.vs.<apis>: This exports the VirtualService state. For instance, avi.vs.client_ip() returns the IP address of the client that initiated the connection.
- avi.http.<apis>: This exports all the attributes of the HTTP request/response. For instance, avi.http.get_cookie_names() returns all the cookies in the HTTP request or the response.
- avi.ssl.<apis>: This exports the SSL state. For instance, avi.ssl.cipher() returns the SSL ciphers that were negotiated for the session.
- avi.pool.<apis>: This lets the user read the status of the pool such as number of servers that have passed HealthMonitor checks. It also lets the user select a specific pool or a server within a pool.
- avi.stringgroup.<apis>: This gives access to String and Map DataSets and their associated operators such as "begins_with", "ends_with", "contains", "equals", "regex" etc. Note, the DataSets are system objects that are defined outside DataScripts, for instance, they could be generic whitelists/blacklists that are used elsewhere even outside of Lua. Through the DataScript libraries, they are exposed to Lua for fast pattern matching operations.
- avi.crypto.<apis>: This exposes encrypt/decrypt operations to DataScripts. For instance, these routines could be used to encrypt some sensitive headers as they are sent to the client, and decrypt them before proxying them back to the server.
DataScripts are described in detail in the Avi Networks Knowledge Base here.
In this blog, we listed some of the use cases for Application Level Routing in modern-era cloud applications. We described how the Application Routing Engine in Avi Vantage meets these use cases while satisfying all the core requirements of speed, flexibility, logging, debuggability, fine granular control, usability, multiple matches/multiple actions, and distributed datasets. Our customers love the Policy and DataScript abstractions. For the common cases, they are able to easily configure traffic routing Policies. For more involved use cases, they jump right in and confidently write the DataScripts in Lua using the powerful constructs exported from the native proxy environment.
Want to join us in building the next generation cloud-scale distributed software Application Delivery Controller? We are Hiring!