When a user reports that the application is slow, or an alert fires at 3 AM, the first question is always the same: what is happening and why? Observability is the practice of instrumenting systems so that these questions can be answered without deploying new code or guessing at the root cause.
The term comes from control theory, where a system is observable if its internal state can be determined from its external outputs. In software engineering, observability means having enough data and tooling to understand the behavior of production systems, including behaviors that were not anticipated during development.
Monitoring vs. Observability
Monitoring and observability are related but distinct concepts. Monitoring answers known questions: Is the server up? Is CPU utilization below 80%? Is the error rate within normal bounds? Monitoring is about checking predefined conditions against thresholds.
Observability goes further. It answers unknown questions: Why did latency spike for requests from a specific geographic region? Why are a subset of users experiencing errors while others are not? What changed between the working state and the broken state? Observability is about exploring data to understand novel problems.
A system can be well-monitored but poorly observable if you can detect problems but cannot diagnose their root causes.
The Three Pillars of Observability
Logs
Logs are timestamped records of discrete events. They capture what happened at a specific point in time: a request was received, a database query was executed, an error was thrown, a user logged in. Logs are the most familiar form of telemetry and the easiest to produce.
Structured logging is essential for observability. Instead of free-text log messages, structured logs emit events as key-value pairs or JSON objects. This makes them searchable, filterable, and aggregatable. A structured log entry might include fields like timestamp, service name, request ID, user ID, duration, status code, and error message.
Common logging tools include the ELK stack (Elasticsearch, Logstash, Kibana), Grafana Loki, Datadog Logs, and Splunk. These platforms ingest, index, and provide search capabilities across millions of log entries.
Best practices for logging:
- Log at appropriate levels: DEBUG for development details, INFO for normal operations, WARN for concerning conditions, ERROR for failures.
- Include correlation identifiers like request IDs and trace IDs to connect related log entries across services.
- Avoid logging sensitive data such as passwords, tokens, or personal information.
- Use sampling for high-volume, low-value log streams to control costs.
Metrics
Metrics are numerical measurements collected over time. Unlike logs, which record individual events, metrics aggregate data into time series: values measured at regular intervals. This makes them efficient to store, fast to query, and ideal for dashboards and alerting.
The four primary metric types are:
- Counters track cumulative values that only increase, like the total number of requests served or errors encountered.
- Gauges measure a value that can go up or down, like current memory usage, active connections, or queue depth.
- Histograms capture the distribution of values, like request latency. They enable percentile calculations (p50, p95, p99) that are more meaningful than averages for understanding user experience.
- Summaries are similar to histograms but calculate percentiles on the client side rather than the server side.
Prometheus is the dominant open-source metrics platform, often paired with Grafana for visualization. Commercial alternatives include Datadog, New Relic, and Dynatrace. The RED method (Rate, Errors, Duration) and USE method (Utilization, Saturation, Errors) provide frameworks for deciding which metrics to collect.
Traces
Distributed tracing follows a request as it flows through multiple services. A trace represents the entire journey of a request, and each step within that journey is a span. Spans capture the service name, operation, duration, status, and parent-child relationships.
Traces are indispensable in microservices architectures where a single user action might trigger calls across ten or more services. Without tracing, diagnosing latency or errors across service boundaries requires manual correlation of logs and metrics from multiple systems.
OpenTelemetry has emerged as the standard for distributed tracing instrumentation. It provides SDKs for most programming languages and exporters for popular backends like Jaeger, Zipkin, Tempo, and commercial platforms.
Key tracing concepts:
- Context propagation passes trace and span IDs across service boundaries through HTTP headers or message metadata.
- Sampling controls the percentage of traces that are captured. Head-based sampling decides at the start of a trace, while tail-based sampling decides after the trace is complete, which allows capturing all error traces regardless of the sampling rate.
- Span attributes add contextual information to spans, such as HTTP method, URL, database query, or user ID.
How the Three Pillars Work Together
Logs, metrics, and traces are most powerful when correlated. A typical debugging workflow might start with a metric alert showing elevated error rates, then move to traces to identify which service is causing the errors, and finally examine logs from that service to find the specific error message and stack trace.
Correlation is enabled by shared identifiers. A trace ID attached to both log entries and metric exemplars allows seamless navigation between telemetry types. Modern observability platforms like Grafana, Datadog, and Honeycomb provide this correlation out of the box.
OpenTelemetry: The Convergence Standard
OpenTelemetry is a CNCF project that provides a unified set of APIs, SDKs, and tools for generating and collecting telemetry data. It supports logs, metrics, and traces through a single instrumentation framework. The OpenTelemetry Collector can receive telemetry in multiple formats and export it to any supported backend.
Adopting OpenTelemetry provides vendor flexibility. You can switch between backends, such as moving from Jaeger to Datadog, without re-instrumenting your applications. This decoupling of instrumentation from analysis is a significant architectural advantage.
Building an Observability Strategy
- Start with service-level objectives. Define what good looks like for your users in terms of availability, latency, and error rates. Use these SLOs to guide what you instrument and alert on.
- Instrument early. Adding observability after a production incident is expensive and stressful. Instrument services during development and make telemetry a first-class concern.
- Prioritize actionable alerts. Alert on symptoms that affect users, not on internal metrics that may not correlate with user impact. Every alert should have a clear response procedure.
- Control costs. Observability data can grow quickly. Use sampling, retention policies, and tiered storage to manage costs without sacrificing visibility into critical systems.
- Practice with game days. Regularly test your observability setup by introducing controlled failures and verifying that your tools help you diagnose the problem quickly.
Observability is not a product you buy or a tool you install. It is a property of systems that are designed to explain their own behavior. Investing in observability pays dividends every time an incident occurs, a performance regression is detected, or a capacity decision needs data to back it up.