Continuing the series of posts about Cloud Native and knowing the five attributes that every cloud application must meet to be considered native, we are going to go deeper into one of them: Observability. What does it consist of, how does it differ from ‘classic’ monitoring, and what are its benefits, these will be the points to deal with in this post. Let’s start.
The term “observability” comes from Rudolf Kalman’s control theory and refers to the ability to infer the internal state of a system based on its external outputs. This concept applied to software systems refers to the ability to understand the internal state of an application based on its telemetry. Not all systems allow or give enough information to be ‘observed’, so we will classify as observable those that do. To be observable is one of the fundamental attributes of cloud-native systems
Telemetry information can be classified into three main categories:
- Logs: probably the most common and widespread mechanism for issuing information on internal events available to the processes or services of a software system. Historically, they are the most detailed source of what happened and they follow a temporal order. Their contribution is key to debugging and understanding what happened within a system, although some point out that they could be overtaken by traces in this main role. They are easy to collect, but very voluminous and consequently expensive to retain. There are both structured and unstructured (free text) logs, and common formats include json and logfmt. There are also proposals for semantic standardisation such as Open Telemetry or Elastic Common Schema.
- Metrics: are quantitative information (numerical data) related to processes or machines over time. For example, it could be the percentage of CPU, Disk or Memory usage of a machine every 30 seconds or the counter of the total number of errors returned by an API, labelled with the HTTP-Status returned and the name of the Kubernetes container, for example, that has processed the request. Thus, these time series can be determined by a set of tags with values, and which also serve as an entry point for exploration of telemetry information. Metrics are characterised by being simple to collect, inexpensive to store, dimensional to allow for quick analysis, and an excellent way to measure overall system health. Later in another post we will also see that the values of a metric can have data attached to them known as exemplars, also in the form of a key/value, which serve among other reasons to easily correlate this value with other sources of information. For instance, in the above API error counter metric an attached exemplar could allow us to jump directly from the metric to the traces of the request that originated the error. This greatly facilitates the operation of the system.
- Traces: we are talking about detailed data about the path executed inside a system in response to an external stimulus (such as an HTTP request, a message in a queue, or a scheduled execution). This type of information is very valuable as it shows the latency from one end of the executed path to the other and for each of the individual calls made within it, even if it is a distributed architecture and therefore the execution may affect multiple components or processes. The key to this power lies in the propagation of context between system components working together, for example, in a distributed micro-services system components may use HTTP headers to propagate the required state information to get the interleaved data from one end to the other. In conclusion, traces allow us to understand execution paths, find bottlenecks and optimise them efficiently, and identify errors, making them easier to understand and fix.
Historically, these three verticals of information have been referred to as the “three pillars” of observability and making them work together is essential to maximise the benefits obtained: for example, metrics can be alarmed to report a malfunction, and their associated exemplars will allow us to identify the subset of traces associated with the occurrence of the underlying problem.
Finally, we will select the logs related to those traces, thus accessing all the available context necessary to efficiently identify and correct the root cause of the problem. Once the incident has been resolved, we can enrich our observability through new metrics, consoles or alarms to more proactively anticipate similar problems in the future.
Why monitoring is not enough? and… What does Observability offer?
Monitoring allows us to detect if something is not working properly, but it does not give us the reasons. Moreover, it is only possible to monitor situations that are foreseen in advance (known knowns). Observability, on the other hand, is based on the integration and relationship of multiple sources of telemetry data, that together help us to better understand how the software system under observation works and not only to identify problems. However, the most critical aspect is what is done with the data once it is collected, for example, why rely on pre-defined thresholds when we can automatically detect unusual ‘change points’? It is this kind of ‘intelligence’ that enables the discovery of unknown unknowns.
The elaboration of real time topology maps is another capability offered by observability, and allows us to establish automatic relationships between all the telemetry information gathered, going much further than a simple correlation by time. A high-impact example of what these topologies can provide would be to achieve automatic incident resolution mechanisms in real time without human intervention.
Observability also facilitates the integration of performance as a first level activity in software development, by allowing us to have profiling information (step by step detail of an execution) on a continuous basis (something that without the appropriate mechanisms requires a lot of effort in distributed systems) and offers us the possibility of detecting bottlenecks in real time, etc. In addition, the mere fact of making us understand in depth what happens within a system over time allows us to maximise the benefit of load testing (and in general of any type of e2e test) and open the doors to the implementation of chaos engineering techniques. At the same time, but not least, it reduces the mean-time to resolution (MTTR) of incidents by reducing the time spent on diagnosis, allowing us to focus on the resolution of the problem.
We can conclude that when a system embraces a mature observability solution, the benefits for the business become more acute. Not only does it give rise to more efficient innovation, but the reduction in implementation times is transferred as an increase in efficiency to the teams, generating consequent cost reductions.
For all these reasons, you can imagine that observability is not a purely operational concern, but a transversal responsibility of the whole team, as well as being considered a basic practice within the recommendations of the most modern and advanced software engineering.
The key to understanding the problems of distributed systems, problems that appear repeatedly but with accentuated variability, is to be able to debug them reinforced with evidence rather than conjecture or hypotheses. We must internalise that ‘errors’ are part of the new normal that accompanies complex distributed systems. The degree of observability of a system is the degree to which it can be debugged, so we can assimilate the contribution of observability in a distributed system to what a debugger offers us on a single running process. Finally, it is worth noting that an observable system can be optimised, both at a technical and business level, much more easily than the rest.
- https://www.alibabacloud.com/blog/a-unified-solution-for-observability—make-sls-compatible-with-opentelemetry_597157 https://containerjournal.com/kubeconcnc/the-future-of-observability/