We recently published the post Cloud Chestnuts in The Cloud, Or What It Means That My Software in which we tried to explain what the term Cloud Native meant and what attributes our applications/systems would have to have in order to really be considered as Cloud Native.
We discussed the need for scalability of current applications derived from the strong demand for resources of current applications, a need that Cloud Computing technology responded to by offering the necessary resources (almost infinite) on demand, instantly and paying only for their use.
However, any distributed application/system has a complexity associated with it, partly inherited from the dependence of the different subsystems that make it up (storage, network, computation, databases, business services/microservices, etc.). Inevitably hardware will fail from time to time, the network will suffer outages, services may crash and become unresponsive, and so on. With this scenario, it is not enough to move applications from “on-prem” environments to “cloud” environments to convert them into Cloud Native applications, they have to be able to survive in this type of environment, recover and continue to function without users perceiving these problems, therefore they have to be designed to withstand failures, and ultimately be more reliable than the infrastructure on which they run, i.e. they have to be resilient.
In addition to these possible failures (hardware, network, services, etc.) there are other elements such as possible changes in the business, a fluctuation in demand for our services or the execution of the same in different environments that make us have to act on our applications to incorporate new functionalities or to keep them working correctly and without interruptions as desired by the users
How easy or difficult it is for our applications to change their behaviour, either to enable/disable some functionality, to deploy/redeploy more service nodes without downtime, or to failover from one or more resources that have stopped working to others that are still available is what we are going to talk about today in this post.
In our previous article we said that one of the attributes of a Cloud Native application according to the CNCF was “manageable”, we talk about manageable, but probably the term malleable is more accurate.
When we say that a material is malleable, we mean that we can change its shape without breaking it, and without any casting, melting, or any other industrial or chemical process that we can think of. Looking for an analogy in the software world, we could say that an application is malleable when we can modify its behaviour from the outside, without touching its code and without having to stop the application, that is to say, “without breaking it”, without any user perceiving that something has stopped working, even momentarily. It is important to highlight the difference between our application being malleable or “manageable” and our application being maintainable, in which case we would be referring to the ease with which we could change the behaviour or evolve our application from the inside, that is, by making changes to the code, something that is equally very important, but which is not the subject of this article.
To better understand what we mean, let’s imagine that we have a running application or system that is providing a certain service to N clients, which we need to modify for some reason, for example:
- There is a growing/decreasing customer demand, and we need to increase/decrease the number of certain system resources. Note, we are not talking about how to solve scalability (let’s assume that our service has been designed stateless and is ready to scale horizontally without problems).
- We have developed a new version of our application, we have tested it in our test environments and we want to run it in a production environment where the code is exactly the same, but the resources such as network, databases, storage, etc. are different.
- We have detected a problem in the configuration of a component that causes the service to behave incorrectly and we need to modify that configuration.
The key to achieving this does not lie in moving our applications to a cloud environment as we have mentioned previously, the key lies in following a series of practices in the design of the architecture of our applications so that we can not only modify their behaviour but also do so in a simple and agile way, ensuring that users perceive that the system is working correctly at all times.
Well, if this is about making changes in a system that is providing service in a productive environment, first of all we have to know when it is necessary to make those changes. In the post Observability, what is it, what does it offer, our colleague Dani Pous gave us an introduction to the importance of our applications being observable, because thanks to this we will know what is happening at all times and we can make decisions based on the information gathered by our metrics, logs and traces.
If we want our application to be malleable, it is essential that we know when to make those decisions, so we need to spend time designing alarms that activate the corresponding automatic mechanisms that change the behaviour of our system (for example, detecting a DB cluster that is not responding so that we can automatically failover another one), and also the dashboards that give us the necessary information to make a manual change in the configuration and update our application without the need to restart (for example, increasing a timeout in a configuration file to avoid rejecting client requests), and also the dashboards that give us the necessary information to make a manual change in the configuration so that our application can be updated without the need to restart (for example, increasing a timeout in a configuration file to avoid rejecting client requests).
Secondly, it is necessary that our application has some mechanism so that we can externally change its behaviour. We must try to identify which parts of our application have to be parameterisable (DB connection strings, URLs for invoking web services, threads, memory or active CPUs for performance, etc.). This configuration or parameterisation is something that can change between different environments (development, integration, production, etc.)
Most readers will have heard of The Twelve-Factor App, for those who do not know it, it is a methodology that was created at the time by several Heroku developers and which establishes twelve principles that help to create applications in the cloud providing benefits such as portability, parity between development and production environments or greater ease of scaling, among others.
One of these twelve principles refers to the configuration of the applications and indicates that the code of the applications is maintained between the different environments in which it is executed, but the configuration varies, so it is important that the configuration is kept separate from the code. It is also important that the configuration is versioned in a version control system to facilitate the restoration of a specific configuration if necessary.
Environment variables have the advantage that they are easy to implement and to change between deployments without changing the code, and they are supported by any language and any operating system. However, not all are advantages in the use of environment variables, environment variables define a global state and are shared with many other variables, so we must be careful when defining them so that they do not step on each other, and they cannot manage more complex configuration than a text string. In any case, to represent configuration at environment level (Development, Staging, Production, etc.) they are a very suitable solution.
Command Line Arguments
Another option used for the configuration of simple applications that does not require any file is the command line arguments. This type of configuration provided on the command line when starting an application is suitable for when interacting with scripts. However, when the configuration options get complicated, command line arguments are not a manageable option, they are overcomplicated and have an inflexible format.
Configuration files on the other hand also offer many advantages, especially when we have really complex applications, among other things because they allow us to represent more complex structures that can group logic of our application that is related. However, when using configuration files, it can be complicated to maintain the integrity of the configuration at all times of the nodes in a cluster, since we will have to distribute this configuration to each of the nodes. This problem can be improved by incorporating a solution such as etcd or consul that offers a storage system (key, value) in a distributed way.
Last but not least, we need to have an automated deployment system that allows, among other things, to:
- Update all the necessary nodes of a system to the new configuration. The times when one person in the operations team would update the configuration of the different nodes that served a system or component of a system are long gone. Today there are services that support millions of users and have thousands of active nodes. Can anyone imagine how to update thousands of nodes if not automatically?
- Manage the scaling/descaling of the components of a system/application in a progressive way without the need to stop the service. This includes tasks such as infrastructure deployment, software deployment, balancer configuration, etc.
Fortunately, the widespread use of containers and orchestrators such as Kubernetes in Cloud Native applications means that the problem of configuration distribution is greatly reduced, as these platforms offer specialised mechanisms for this, such as Kubernetes’ “ConfigMap”, which also allows you to manage both environment variables and command-line parameters such as configuration files.
Kubernetes also facilitates the deployment of new versions through the application of what is known as “Rolling Updates”, this technique allows us to progressively update the different instances of our applications, hosting the new versions on nodes with available resources, while at the same time eliminating the instances of the previous version, thus achieving a deployment with the coveted “Zero Downtime”.
In all cases we must always work with the concept of immutability, in which the images of the containers deployed in our application as well as the configuration objects are immutable. In this way, once the applications are deployed, any change will require the replacement of the container by a new image or of the configuration file or object (if, for example, we have a Kubernetes ConfigMap) by the new version.
Cloud Native applications use architectures based on microservices, which makes it easier to develop and evolve applications (independent teams, functional decoupling, technology independence, etc.).
The use of containers for the deployment of microservices (e.g., Docker) and the increasingly widespread container orchestrators (e.g., K8s) facilitate the scaling, de-scaling of applications and the management of thousands of nodes within an application/service.
However, all these facilities are not without problems, the large number of nodes that can be serving a Cloud Native application multiplies the number of possible failures and therefore it is necessary that we design our systems with the mind-set that they will fail.
Additionally, we need to be able to distribute new versions (both code and configuration) across a huge number of instances without users perceiving a loss of service. The sheer number of machines, services, etc. managed within our applications makes it unfeasible for these changes to be manual and also requires us to work with the concept of immutability to ensure that each change is associated with a version that can be restored at any time.