In the first post of the Cloud-Native series, Chestnuts in the cloud, or what it means that my software is Cloud Native, we presented resilience as one of the fundamental attributes that help us to ensure that our systems are reliable and operate with practically no service interruptions.
Let’s start by defining resilience: it is the ability to react to a failure and recover from it to continue operating while minimising any impact on the business. Resilience is not about avoiding failures, but about accepting them and building a service in such a way that it is able to recover and return to a fully functioning state as quickly as possible.
Cloud-Native systems are based on distributed architectures and are therefore exposed to a larger set of failure scenarios compared to the classical monolithic application model. Examples of failure scenarios are:
- Unexpected increases in network latencies that can lead to communication timeouts between components and reduce quality of service.
- Network micro-outages causing connectivity errors.
- Downtime of a component, with restart or change of location, which must be managed transparently to the service.
- Overloading of a component that triggers a progressive increase in its response time and may eventually trigger connection errors.
- Orchestration of operations such as rolling updates (system update strategy that avoids any loss of service) or scaling/de-scaling of services.
- Hardware failures.
Although cloud platforms can detect and mitigate many of the failures in the infrastructure layer on which the applications run, to obtain an adequate level of resilience of our system, it is necessary to implement certain practices or patterns at the level of the application or software system deployed.
Let’s talk now about which techniques or technologies help us achieve resilience in each of the layers presented: infrastructure layer and software layer.
Resilience at the hardware level can be achieved through solutions such as redundant power supplies, write-over-redundant storage drives (RAIDs), etc. However, only certain failures will be covered by these protections, and we will have to resort to other techniques to achieve the desired levels of resilience, such as redundancy and scalability.
Redundancy consists of, as the word itself indicates, replicating each of the elements that make up the service, so that any task or part of a task can always be performed by more than one component. To do this, we must add a mechanism to distribute the workload between these duplicate ‘copies’ within each workgroup, such as a load balancer. On the other hand, determining the level of replication needed in a service will depend on the business requirements of the service, and will affect both the cost and complexity of the service.
It is recommended to identify the critical flows within the service, and to add redundancy at each point of the flows, thus avoiding the creation of single points of failure. These points refer to those components of our system that in case of failure would cause a total system failure.
It is also common to add multi-region redundancy with geo-replication of the information and distribute the load by means of DNS balancing, thus directing each request to the appropriate region according to the distance from its geographical origin.
Although we will discuss scalability in greater depth in future posts, we must advance that designing scalable systems is also fundamental to achieve resilience. In this case, scalability or the capacity to adjust the resources to the workload, either by increasing or decreasing their number, is fundamental to avoid failure situations such as communication timeouts due to excessive response times, service failures due to work collapse, or the degradation of storage subsystems due to massive information ingestion, etc. There are two types of scaling: horizontal scaling or scale out, and vertical scaling or scale up. In simplified terms, vertical scaling consists of increasing the power of a machine (be it cpu, mem, disk, etc.) while horizontal scaling involves adding more machines.
The ability to scale a system horizontally is closely interrelated to having redundancy. We could see the former as a higher level than the latter, i.e., a non-redundant system cannot be horizontally scalable and, in turn, we can achieve horizontal scalability over redundancy if we add feedback that allows us to determine from the real-time load of the system to what extent it should grow or decrease in resources to optimally adjust to the needs demanded at any given time. Note that at this point we are also establishing a relationship with the observability capacity, which will be responsible for providing the necessary metrics to monitor the load and automate the auto-scaling systems.
There are libraries in many languages to implement these techniques and we can also resort to more orthogonal solutions such as Service Mesh to facilitate this task and completely decouple our business logic.
As mentioned at the beginning of this post, it is essential to incorporate resilience into the design of the software itself in order to successfully face all the challenges of distributed systems. The logic of the service must treat failure as a case and not as an exception, it must define how to act in case of failure and determine the contingency action when the preferred path is not available. This latter is known as fallback action or backup configuration for that failure case.
Apart from the fallback pattern, there are a set of architecture patterns oriented to provide resilience to a distributed system, such as for example:
- Circuit Breaker: this pattern helps a service to recover or decouple from both performance drops due to subsystem overloads and complete outages of parts of the application. When the number of continuous failures reported by a component exceeds a certain level, it is the prelude to something more serious about to happen: the total failure of the affected subsystem. By temporarily blocking further requests, the component in trouble will have a chance to recover and avoid further damage. This temporary cushion may be sufficient for the auto-scaling system to have been able to intervene and replicate the overloaded component, thus avoiding any loss of service to its clients.
- Timeouts: the mere fact of limiting the time in which the sender of a request will wait for its response may be the key to avoid overloads due to the accumulation of resources, thus facilitating the resilience of the system. If a microservice A requires microservice B and the latter does not respond within the defined timeout, as there is no indefinite wait, microservice A will regain control and can decide whether to continue trying or not. If the problem has been caused by a network outage or an overload of microservice B, a retry may be sufficient to redirect the request to the already recovered instance of B or to a new instance free of load. And in case of no further retries, microservice A can free resources and execute the defined fallback.
- Retries: the two previous techniques, short-circuit and timeouts, have already indirectly introduced the importance of retries as a base concept for resilience. But is it possible to incorporate retries in communications between components for free? Let’s imagine, continuing with the previous example, that a microservice A makes a request to B, and due to a punctual network outage, B’s response does not reach A. If A incorporates retries, what will happen is that when the waiting time of that call (the timeout) ends, it will recover control and make the request to B again, so B will do the work in duplicate with the consequences that may arise. For example, if that request were to subtract a purchase from the stock of products, the output would be recorded in duplicate and therefore leave an incorrect balance in the stock books. It is because of this situation that the concept of idempotence is introduced. An idempotent service is characterised by being immune to duplicate requests, i.e., the repeated processing of the same request does not cause inconsistencies in the final result, giving rise to “safe retries”. The immunity is obtained based on a design that contemplates idempotency from the beginning, for example, in the previous case of the stock update, the request should include a purchase identifier, and microservice B should register and validate that this identifier has not been completely processed before trying again.
- Caché: now that we know why you need to incorporate retries If you use a cache to automatically store the responses of a microservice, you are helping both to reduce the pressure on it and to generate a fallback in case of certain anomalies. In the case of a retry, the cache helps to ensure that the component does not have to retry a previously completed job and can return the result directly to the component.
- Bulkhead: this last pattern consists of dividing the distributed system into “isolated” and independent parts, also called pools, so that if one of them fails, the others can continue to function normally. This architectural tool can be seen as a contingency technique, comparable to a firewall or watertight compartments that divide ships into parts and prevent water from jumping between them. It is advisable, for example, to isolate a set of critical components from other standards. It should also be appreciated that such divisions can sometimes lead to losses in resource efficiency, as well as adding to the complexity of the solution.
As mentioned above, in a distributed system there are so many components interacting with each other that the probability of things going wrong is very high. Hardware, network, traffic overload, etc. can fail. We have discussed various techniques to make our software resilient and minimise the impact of these failures. But do we have a way to test the resilience of our system? The answer is yes, and it’s called “Chaos Engineering”.
But what is “Chaos Engineering”? It is a discipline of infrastructure experimentation that exposes systemic weaknesses. This empirical process of verification leads to more resilient systems and builds confidence in their ability to withstand turbulent situations. Experimenting with Chaos Engineering can be as simple as manually executing kill -9 (command to immediately terminate a process on unix/linux systems) on a box within a test environment to simulate the failure of a service. Or it can be as sophisticated as designing and running experiments automatically in a production environment against a small but statistically significant fraction of live traffic.
There are also supporting libraries and frameworks, such as, Chaos-monkey which is a framework created by Netflix that allows randomly terminating virtual machines or containers in production environments, and complies with the principles of Chaos Engineering.
It is necessary to identify system weaknesses before they manifest themselves in aberrant behaviour that affects the entire system. Systemic weaknesses can take the form of incorrect backup configurations when a service is unavailable; excessive retries due to mismatched timeouts; service outages when a component of the processing chain collapses due to traffic saturation; massive cascading failures resulting from a single component (single-point-of-failure can be detected); etc.
The most traditional approach when building systems was to treat failure as an exceptional event outside the successful execution path, and therefore it was not contemplated in the basic design of the heart of the service. This has changed radically in the cloud-native world, given that in distributed architectures, failure situations appear normally and recurrently in some part of the whole, and this must be considered and assumed from the outset and within the design itself. Thus, when we talk about resilience, we refer to this characteristic that allows services to respond to and recover from failures, limiting the effects on the system as a whole as much as possible and reducing the impact on it to a minimum. Achieving resilient systems not only has an impact on the quality of the service or application, but also makes it possible to gain more cost efficiency and, above all, not lose business opportunities due to loss of service.