O11Y on Kubernetes
I’ll show you how to set up your own monitoring stack on Kubernetes.
Most Kubernetes teams start with third-party observability platforms like Datadog or New Relic. They’re fast to set up and cover 80% of needs. But over time, you hit limits: opaque billing, vendor outages, or not being able to store raw logs and traces as long as you’d like. That’s when teams start looking at self-hosted monitoring.
The core open-source stack I’ve used is:
- Prometheus for metrics
- Loki for logs
- Tempo for traces
- Grafana to visualize everything
Together, they give you a vendor-neutral, fully customizable monitoring solution.
Deploying the stack
Helm makes installation straightforward:
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
helm install loki grafana/loki-stack
helm install prometheus grafana/prometheus
helm install tempo grafana/tempo
helm install grafana grafana/grafana
But real-world setups go beyond defaults. For example, you’ll almost always need persistence for Prometheus and Loki:
# values.yaml
loki:
persistence:
enabled: true
size: 50Gi
prometheus:
prometheusSpec:
storageSpec:
volumeClaimTemplate:
spec:
resources:
requests:
storage: 100Gi
Wiring it together
- Prometheus scrapes metrics from pods via service annotations.
- Loki collects logs through Promtail sidecars or a DaemonSet.
- Tempo ingests spans directly from instrumented apps (e.g. OpenTelemetry).
- Grafana connects to all three, letting you pivot between metrics, logs, and traces in a single dashboard.
This triad makes debugging powerful: you see a spike in latency (Prometheus), jump into the logs (Loki), and correlate with a distributed trace (Tempo).
Lessons learned in practice
- raw logs can fill disks quickly. Use retention policies and S3-compatible backends for Loki/Tempo.
- exposing Prometheus endpoints without auth is a common security mistake. Lock it down.
- the stack gives you raw data, but meaningful alerts require tuning (e.g. 99th percentile latency, error budgets).
- Helm makes it easy to install, but version mismatches between components can break integrations. Test upgrades in staging.
Takeaway
Running your own observability stack is not free — you trade vendor convenience for control. But if you care about cost transparency, long-term data retention, or avoiding lock-in, Loki + Prometheus + Tempo + Grafana is a proven combination. Once configured, it gives your team deep visibility into Kubernetes workloads, and you’re not at the mercy of a third-party SaaS when things go wrong.