You can read a lot about observability on various platforms, as it has become increasingly important to assess a system’s current state by analyzing the data it produces. However, the topic can initially seem very complex, and it’s easy to feel overwhelmed by the names of products, technologies, and concepts. In my previous blog post, we looked into some key foundations of observability, including the three pillars, the responsibilities of the platform and delivery team, and the OpenTelemetry framework.
In this post, we will build on the knowledge from our previous post and put it into practice. The goal of this blog post is to bridge these knowledge gaps and demonstrate a minimal setup of an observability solution, helping to establish a better understanding of the subject.
Before we dive into the orchestration code and start deploying observability tools, it’s crucial to clearly understand where each component of our observability solution will be placed and why. The setup is divided into three parts: Application Landscape, OpenTelemetry Collector, and LGTM (Loki, Grafana, Tempo, Mimir) Stack
The application landscape represents a typical service architecture commonly found in many organizations. Depending on the organization, this mesh of intercommunicating services can range from a few services to several hundred. For simplicity, our scenario features a limited number of services to maintain a clear overview of the overall structure.
In our case, we have a cron job that triggers an event in the service called “Emitter” every few minutes. Upon being triggered, the “Emitter” starts sending data to the “Receiver” service.
This setup ensures a continuous stream of data production, eliminating the need to manually trigger events to test the effectiveness of our observability solution. The cron job automates this process, allowing us to focus on configuring and verifying our observability tools.
In the application landscape, our services are operational, but their telemetry data are not yet ingested. We need to establish a procedure to collect metrics, logs, and traces from each service and integrate them into our observability solution. For this, we use instrumentation. In the OpenTelemetry ecosystem, there are two approaches to instrumenting our services:
For our implementation, we opt for the Zero-code solution for simplicity. By using this approach, a JAR agent, which is a Java archive file that automatically instruments and captures telemetry data, is attached to our services to capture telemetry data from various popular libraries and frameworks. In addition to the JAR agent, we also need a sidecar on our deployment, which will transmit the telemetry data from our services to a central point, the OpenTelemetry Collector.
The OpenTelemetry Collector provides a vendor-neutral framework for receiving, processing, and exporting telemetry data. It supports receiving data in various formats (such as OTLP, Jaeger, Prometheus, among many commercial/proprietary tools) and transmitting it to one or multiple .
For our use case, the OpenTelemetry Collector will receive telemetry data from the sidecars running alongside our services and then push this data to the appropriate tools in our LGTM stack for further processing, visualization and storage.
Now that we have managed to get our telemetry data to a centralized point, that distributes them to the correct backend components, we should take a deeper look at the LGTM stack, which stands for Loki-for logs, Grafana — for dashboards and visualization, Tempo — for traces, and Mimir — for metrics. To give you a short recap of what each component in this stack does, we will summarize the key ideas behind them. For more detailed information, please refer to the documentation https://grafana.com/docs/.
Loki
Grafana Loki is an open-source project that forms a comprehensive logging stack. It simplifies operations and significantly lowers costs through a small index and highly compressed chunks. Unlike other logging systems, Loki indexes only metadata (labels) rather than the log contents, making it highly cost-effective and scalable.
Mimir
Grafana Mimir is an open-source project offering horizontally scalable, highly available, multi-tenant, long-term storage for Prometheus and OpenTelemetry metrics. It allows users to ingest metrics, run queries, create new data with recording rules, and set up alerting rules across multiple tenants, leveraging tenant federation.
Tempo
Grafana Tempo is an open-source, high-scale distributed tracing backend that is easy to use. It enables you to search for traces, generate metrics from spans, and link tracing data with logs and metrics. Tempo is cost-efficient, requiring only object storage to operate.
Grafana
Grafana is an open-source project that enables you to query, visualize, alert on, and explore your metrics, logs, and traces from various sources. Data source plugins support querying from time-series databases like Prometheus and CloudWatch, logging tools like Loki and Elasticsearch, tracing backend like Tempo, NoSQL/SQL databases such as Postgres, and many more.
From a high-level perspective, the structure and division between these components are straightforward and intuitive. OpenTelemetry routes the data streams to the appropriate backend component based on the signal type of telemetry data. Each component stores the data efficiently, ensuring high availability and optimization. However, simply having data stored in databases isn’t sufficient for effectively analyzing issues or bottlenecks in our application landscape. This is where Grafana comes into play. Grafana serves as the unified UI for the entire stack, allowing for the querying, visualization, alerting, and exploration of metrics, logs, and traces from various data sources through live dashboards with insightful visualizations.
A short disclaimer before we start the hands-on section: We will not go into detail about every line of orchestration code. We assume that basic knowledge about Kubernetes is present. Therefore, the focus is more on better illustrating the integration and interrelation of the components. However, references or links to the repository are frequently provided, where additional information can be obtained. We also do not recommend using this exact configuration in a production environment, as there is no failover concept, backup plan, or security measures implemented or considered. These aspects were left out due to complexity reasons.
For deploying our services and components, we use Argo CD to encapsulate them properly. Throughout the rest of the text, you will find references to Argo CD in deployments and configuration. However, Argo CD is not required to achieve this setup. If you prefer to set this up without Argo CD, you will need to make some adjustments, especially when deploying Helm charts.
In this hands-on guide, we’ll leverage Kubernetes operator and controller to provide the necessary functionality to get this setup working. The controller and operator that need to be installed are as follows:
+--------------------------+----------+-----------------------------------------------------------------------------------------------------------------+
| Operator & Controller | Version | Link |
+--------------------------+----------+-----------------------------------------------------------------------------------------------------------------+
| Argo CD | v2.13.1 | https://raw.githubusercontent.com/argoproj/argo-cd/v2.13.1/manifests/install.yaml |
| cert-manager | v1.16.2 | https://github.com/cert-manager/cert-manager/releases/download/v1.16.2/cert-manager.yaml |
| OpenTelemetry | v0.114.0 | https://github.com/open-telemetry/opentelemetry-operator/releases/download/v0.114.0/opentelemetry-operator.yaml |
| Grafana (cluster scoped) | v5.15.1 | https://github.com/grafana/grafana-operator/releases/download/v5.15.1/kustomize-cluster_scoped.yaml |
+--------------------------+----------+-----------------------------------------------------------------------------------------------------------------+
Those were the newest versions when the blog post was created. When you try to recreate this setup, you may consider choosing a newer version. However, this might lead to incompatibilities or break functionality related to the configuration of the components.
It’s finally time to dive into some orchestration code. To build our observability solution, we begin with the workloads that generate telemetry data. By starting here, we can later integrate all the necessary components for exposing and visualizing this data in a step-by-step manner.
Our application landscape is relatively simple. We have two deployments, each running a pod with a service containing a Spring Boot application. These applications have REST endpoints with implemented business logic, generating logs, traces, and metrics. Additionally, they have REST clients configured, enabling us to connect the two services. Consequently, when one application receives a REST call, it produces logs, metrics, and traces, and then makes a REST call to the other pod, which, in turn, generates its own set of telemetry data.
To avoid the need to manually trigger these REST endpoints, a cron job is set up to run every 5 minutes. For more details on the deployments, please refer to this commit, which contains the orchestration for the entire application landscape.
Now we come to the point where we want to ship our telemetry data to a centralized location: the OpenTelemetry Collector. To achieve this, we first need to deploy two components: the OpenTelemetry Collector and the Instrumentation.
The OpenTelemetry Collector is responsible for receiving, processing, and exporting our data to the respective monitoring stack components. The Instrumentation, on the other hand, defines the injected sidecar that will be added to our applications in the application landscape.
Configuring the OpenTelemetry Collector
We will start with a simple configuration for the OpenTelemetry Collector. We need to define “receivers,” “exporters,” and “service” to get the application running. For the receiver, we specify that the OpenTelemetry Collector provides an endpoint on port 4318 to receive telemetry data. In the service section, we simply connect the log pipeline from the receiver to the exporter. Currently, our exporter is set to “debug,” which writes the logs to the console. Later on, we will properly configure this exporter to ship our logs to Loki. For now, writing the logs to the console on the OpenTelemetry Collector is sufficient to confirm that the logs are exporting correctly from the application landscape to the OpenTelemetry Collector. Generally, this debug approach can also be used for traces and metrics, and it is very helpful during the initial setup.
The OpenTelemetry Collector configuration is very flexible and offers functionalities such as filtering, batching, and redacting sensitive information. For more information, read here. To better understand how the pipelines work together, we recommend reading this.
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
name: opentelemetry
namespace: playground
spec:
config:
receivers:
otlp:
protocols:
http:
endpoint: 0.0.0.0:4318
exporters:
debug:
verbosity: detailed
service:
pipelines:
logs:
exporters:
- debug
receivers:
- otlp
Configuring Automatic Instrumentation
To manage automatic instrumentation, the Operator needs to be configured to identify which pods to instrument and which automatic instrumentation to use for those pods. The instrumentation acts as a blueprint, enabling the proper configuration of sidecars. In our case, we configured it like this:
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
name: opentelemetry-instrumentation
namespace: playground
spec:
exporter:
endpoint: http://opentelemetry-collector:4318
propagators:
- tracecontext
- baggage
- b3
sampler:
type: parentbased_traceidratio
argument: "1"
java:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:2.10.0
env:
- name: OTEL_INSTRUMENTATION_MICROMETER_ENABLED
value: "true"
Enabling Automatic Instrumentation
The last step is to apply the Kubernetes annotations to our deployment. For the automatic instrumentation and auto-injection of the sidecar to work, you may need to restart the deployments. This should be done after deploying the instrumentation, the OpenTelemetry Collector, and applying the Kubernetes annotations.
apiVersion: apps/v1
kind: Deployment
metadata:
name: receiver-deployment
namespace: playground
spec:
...
annotations:
instrumentation.opentelemetry.io/inject-java: "true"
Validating the Configuration
Once you see the sidecar being deployed, you just need to wait until the cron job generates some logs. The logs that appear in the containers of the application landscape should now also be accessible in the OpenTelemetry Collector.
If that is not the case, then you should check the logs in the application landscape container. The sidecars will produce helpful logs that can indicate issues, such as a non-working or incorrectly configured connection to the OpenTelemetry Collector.
This entire section is also covered in this commit.
Deploying Loki
We have successfully configured our logs to be exported to the OpenTelemetry Collector. Now, we need to determine how to export the logs to Loki. To achieve this, we will start by deploying Loki through a Helm chart in our namespace using the Argo CD application. If you are not familiar with the Argo CD application, we recommend reading the documentation.
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: "loki"
namespace: argocd
finalizers:
- resources-finalizer.argocd.argoproj.io
spec:
destination:
namespace: playground
server: https://kubernetes.default.svc
project: default
source:
chart: loki
repoURL: https://grafana.github.io/helm-charts
targetRevision: 6.16.0
helm:
values: |
deploymentMode: SingleBinary
loki:
auth_enabled: false
commonConfig:
replication_factor: 1
storage:
type: 'filesystem'
useTestSchema: true
singleBinary:
replicas: 1
read:
replicas: 0
backend:
replicas: 0
write:
replicas: 0
gateway:
enabled: false
For the configuration of the Helm chart, we aim to keep the Loki instance as simple as possible. Since this is only for demonstration purposes, we minimize the number of replicas. We also set the deployment mode to SingleBinary, which does not provide high availability and should be used when you have only a few tens of GB/day. Additionally, we use the storage type “filesystem” to avoid dependencies on specific vendors or storage restrictions. This approach, of course, is not ideal for a production environment.
Configuring OpenTelemetry Collector for Loki
After deploying Loki, we can return to the OpenTelemetry Collector and configure the exporter correctly so that the logs are now exported to Loki.
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
name: opentelemetry
namespace: playground
spec:
config:
receivers:
otlp:
protocols:
http:
endpoint: 0.0.0.0:4318
exporters:
otlphttp/loki:
endpoint: http://loki-headless:3100/otlp
tls:
insecure: true
service:
pipelines:
logs:
exporters:
- otlphttp/loki
receivers:
- otlp
We need to configure the exporter with the correct Loki endpoint and ensure that the receiver and exporter are properly connected in the pipelines. Once that’s done, we can apply the updated OpenTelemetry configuration.
Since OpenTelemetry and Loki’s logs are relatively sparse about the export of logs, we need to manually check if our wiring was correct, as there is no other good way to verify the functionality. To do this, we check the Write-Ahead Log (WAL) on Loki. On the Loki pod, we execute the following command:
grep -E 'EmitterApplication' /var/loki/wal/*
If we receive some result containing our logging statement, for example “Received data: Traffic package 3”, then the wiring was successful.
Deploying and Configuring Grafana for Loki
The logs are now being stored in Loki, but we need a method to access them visually and query for certain keywords. To achieve this, we will deploy Grafana. Since we have the Grafana Operator installed, we can deploy a Grafana CRD, which will reduce overhead and simplify the deployment.
apiVersion: grafana.integreatly.org/v1beta1
kind: Grafana
metadata:
name: grafana
labels:
dashboards: "grafana"
spec:
config:
log:
mode: "console"
security:
admin_user: root
admin_password: secret
To configure Grafana, we define the credentials needed for accessing the Grafana UI.
To complete the logging setup, we need to connect Grafana to Loki using a Grafana data source. For that, we deploy the following data source:
apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDatasource
metadata:
name: grafana-datasource-loki
spec:
instanceSelector:
matchLabels:
dashboards: "grafana"
datasource:
name: datasource-loki
type: loki
url: http://loki-headless:3100
access: proxy
isDefault: false
The configuration of the data source is straightforward. We simply insert the endpoint of our Loki instance along with some other basic required information. After applying the changes, we should see a new data source in Grafana.
With this data source, we can filter the logs based on our service name. For example, like this:
Great! We have successfully exported our logs from our application landscape to the OpenTelemetry Collector, which in turn exported the logs to Loki. Grafana then uses Loki as a data source to visualize the logs. The same principle of wiring and connecting will be used for traces and metrics.
Since it is sometimes easier to follow the wiring of components directly in a codebase, we have covered this section in this commit.
Deploying Mimir
Similar to our approach with Loki, we will apply the same method for metrics. Before configuring our OpenTelemetry Collector, we will start by deploying Mimir.
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: "mimir"
namespace: argocd
finalizers:
- resources-finalizer.argocd.argoproj.io
spec:
destination:
namespace: playground
server: https://kubernetes.default.svc
project: default
source:
chart: mimir
repoURL: https://skyloud.github.io/helm-charts
targetRevision: 5.5.0
helm:
values: |
deploymentMode: monolithic
monolithic:
replicas: 2
zoneAwareReplication:
enabled: false
resources:
requests:
memory: 128Mi
alertmanager:
enabled: false
Here as well, we use a Helm chart to deploy the component. In this case, we use the Helm chart from SkyCloud instead of Grafana directly, as they offer different deployment modes. We choose the “monolithic” deployment mode. This deployment mode, along with the other configuration settings, allows for a very minimal deployment of Mimir, which is sufficient for our needs.
Configuring OpenTelemetry Collector for Mimir
Once Mimir is deployed, we can start by adding the necessary configuration to our OpenTelemetry Collector to export the metrics received from our application landscape to Mimir.
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
name: opentelemetry
namespace: playground
spec:
config:
receivers:
otlp:
protocols:
http:
endpoint: 0.0.0.0:4318
exporters:
otlphttp/loki:
endpoint: http://loki-headless:3100/otlp
tls:
insecure: true
prometheusremotewrite/mimir:
endpoint: http://mimir-nginx:80/api/v1/push
tls:
insecure: true
service:
pipelines:
logs:
exporters:
- otlphttp/loki
receivers:
- otlp
metrics:
exporters:
- prometheusremotewrite/mimir
receivers:
- otlp
Following our procedure for configuring logs, we will add a receiver and an exporter to the OpenTelemetry configuration. In this case, we add the endpoint “http://mimir-nginx:80/api/v1/push" for the exporter.
To verify that the metrics are being shipped to Mimir, we can check the logs on the mimir-nginx. If the logs contain such logging statements, then we know that the wiring of the OpenTelemetry Collector was correct.
200 "POST /api/v1/push HTTP/1.1" 0 "-" "opentelemetry-collector/0.114.0" "-"
Configuring Grafana for Mimir
Since we have already deployed Grafana in the Loki section, we do not need to do that again here. However, we do need to deploy the data source that connects Grafana to Mimir. Specifically, we will deploy the following configuration:
apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDatasource
metadata:
name: grafana-datasource-mimir
spec:
instanceSelector:
matchLabels:
dashboards: "grafana"
datasource:
name: datasource-mimir
type: prometheus
url: http://mimir-nginx:80/prometheus
access: proxy
isDefault: false
Here again, we need to set the correct URL for Mimir’s endpoint. Mimir is highly compatible with Prometheus, ingesting and storing its metrics, and understanding PromQL. Therefore, Mimir is defined as a Prometheus data source in our configuration.
After deploying this data source, we can refresh Grafana and check if the metrics are available. To verify the functionality, we imported a public dashboard for Spring Boot and configured it to use Mimir as the data source. Here is an example:
This is only a subset of all the available metrics. There are also metrics related to the JVM, logs, and database activity. The zero-code auto instrumentation does a great job of providing us with a solid base of necessary metrics to start with.
We have also provided a concise commit so you can see the entire configuration at a code level.
Deploying Tempo
Now we come to the last telemetry data in our observability solution: traces. For traces, we will use Tempo. Let’s start by deploying Tempo through a Helm chart in our namespace.
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: "tempo"
namespace: argocd
finalizers:
- resources-finalizer.argocd.argoproj.io
spec:
destination:
namespace: playground
server: https://kubernetes.default.svc
project: default
source:
chart: tempo
repoURL: https://grafana.github.io/helm-charts
targetRevision: 1.14.0
The Helm chart already provides a minimal single binary mode, so we do not need to add any additional configuration.
Configuring OpenTelemetry Collector for Tempo
For the final adjustment, we need to update the OpenTelemetry configuration by adding the exporter and pipeline for Tempo. With this addition, we will finalize the OpenTelemetry configuration.
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
name: opentelemetry
namespace: playground
spec:
config:
receivers:
otlp:
protocols:
http:
endpoint: 0.0.0.0:4318
exporters:
otlp/tempo:
endpoint: http://tempo:4317
tls:
insecure: true
otlphttp/loki:
endpoint: http://loki-headless:3100/otlp
tls:
insecure: true
prometheusremotewrite/mimir:
endpoint: http://mimir-nginx:80/api/v1/push
tls:
insecure: true
service:
pipelines:
logs:
exporters:
- otlphttp/loki
receivers:
- otlp
metrics:
exporters:
- prometheusremotewrite/mimir
receivers:
- otlp
traces:
exporters:
- otlp/tempo
receiver
We can also use the Write-Ahead Log (WAL) of Tempo to verify if the traces are being exported to Tempo. To do this, we use the following command:
grep -r -E 'emit' /var/tempo/wal/*
Configuring Grafana for Tempo
We follow the same process as we did for Mimir and Loki by deploying a Grafana data source for Tempo, using the correct Tempo endpoint provided by the Helm chart deployment.
apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDatasource
metadata:
name: grafana-datasource-tempo
spec:
instanceSelector:
matchLabels:
dashboards: "grafana"
datasource:
name: datasource-tempo
type: tempo
url: http://tempo:3100
access: proxy
isDefault: false
We can now select the Tempo data source in Grafana to display traces between our services. In the provided example, you can easily see the importance of traces. We can follow the communication paths between the services and also observe how long a service takes to process an incoming request. This makes identifying bottlenecks much easier in a large microservice architecture.
With the setup of Tempo, we conclude our practical guide for deploying an observability solution. As with the other sections, we have also provided a commit here that covers the most important parts of deploying and configuring our observability components.
In this blog post, we demonstrated how to easily set up and run an observability solution using open-source components. Initially, this topic can seem complex. Getting the connections and configurations of OpenTelemetry to work required some trial and error. Additionally, the documentation is often unclear or sometimes even missing. We frequently found ourselves checking open issues or merge requests on GitHub, and browsing Stack Overflow to see how things need to be configured or deployed. However, when broken down into several parts, as we did in this blog post, the topic becomes manageable and intuitive.
The observability solution presented here covers some foundational aspects of observability. However, there are several options to extend this setup. For one, we would really like to implement Pyroscope for profiling. This would provide an even deeper and more extensive view of the application. Another tool we are eager to try is Grafana Alloy, which was introduced at GrafanaCon 2024. Grafana’s OpenTelemetry Collector appears to be highly flexible and simplifies problem-solving with an embedded debugging UI accessible via the Alloy HTTP server. Lastly, we would like to see an alert manager added to the solution. Seeing alerts being handled and delivered to the respective team would make our proposed observability solution more production-ready.
As businesses continue to evolve in the cloud era, having the right observability tools is crucial for maintaining system performance and reliability. At &, we specialize in providing comprehensive monitoring solutions using stacks like LGTM to give you deep insights into your cloud native infrastructure. Our expertise in cloud native environments ensures that we can help you enhance your observability practices to find the perfect fit for your needs.
Contact us today to learn more about how we can help you stay ahead with tailored observability solutions, advanced monitoring strategies, and dedicated workshops on these topics.