| Metadata | Value |
|---|---|
| Source | Confluence - Standard for Applications Observability |
| Status | Ready for Comments |
| Last Synced | 2026-01-08 |
| Space | Architecture (ARCH) |
Ensure every service is observable by default, enabling fast detection, diagnosis, and resolution of issues while controlling cost.
This standard defines when to use logging, APM, tracing, alerting, and metrics and the minimum bar to go live.
| Pillar | Definition | Example |
|---|---|---|
| Logging | Captures discrete events in a structured format (e.g., JSON) with context like timestamps, severity, and correlation IDs. Logs are essential for debugging and forensic analysis. | Search transaction_id=123 to find all related errors in seconds. |
| APM | Combines traces, metrics, and profiling to give developers actionable insights into code-level performance and dependencies. | After deployment, APM compares performance baselines and flags a 30% increase in DB query time, allowing rollback before customers notice. |
| Tracing | Provides a full picture of a request's journey across services using spans and timing data. Traces help pinpoint latency bottlenecks and broken dependencies in distributed systems. | Identify an 800ms latency spike in the switching service, causing authorisation delays. |
| Metrics | Numeric time series (counters, gauges, histograms) that measure system health, performance, and trends. Metrics power dashboards, SLOs, and alerting. | Generic performance indicators (latency, error rates, throughput) or custom indicators (number of processed transactions, failed authorisations). |
| Alerting | Converts telemetry into actionable notifications. Alerts should be symptom-based (e.g., SLO violations) and avoid noise from low-level resource metrics. | Fire alerts if the number of failed authorisations for the last 5 minutes is above 5%. |
| Need | Use | Notes |
|---|---|---|
| Auditable facts & discrete events (new features, unhappy paths, errors) | Logging | Structured JSON logs; include relevant fields from Logging guideline |
| End-to-end request flow across services | Tracing | Requires correlationId propagation; create spans for remote calls, DB, queues |
| Health over time, SLIs/SLOs (latency, error rate, saturation) | Metrics | Including counters/gauges/histograms. Use for alerting and capacity planning |
| Code-level performance, flame graphs, error analytics, RUM | APM | Use Coralogix APM & Profiler. Keep on for all internet-facing services |
| Human attention for user/business impact | Alerting | Alerts must map to an SLO, runbook, and owner. Prefer multi-signal alerts |
We standardise on OpenTelemetry (OTel) for instrumentation to:
| Rule | Description |
|---|---|
| MUST | Send all observability data to Coralogix |
| MUST | Tag every signal with: service, env (prod|uat|dev), version (build SHA/semver), host |
| MUST NOT | Log secrets or raw PII; follow Logging Guidelines |
| SHOULD | Keep observability in code using OpenTelemetry SDK |
| Rule | Description |
|---|---|
| MUST | Emit structured JSON logs with a standard envelope |
| MUST | Use levels correctly: ERROR for actionables; WARNING for unexpected unhappy paths; no INFO by default |
| SHOULD | Avoid high-cardinality fields in messages (explodes indexes and slows search) |
| Rule | Description |
|---|---|
| MUST NOT | Override the traceid set automatically |
| MUST | Propagate a well-defined correlationId across APIs, Events, DB calls, etc (traceId) |
| MUST | Create spans for network calls, DB operations, external APIs, and message processing |
| MUST | Make low sampling (assume 1%) a goal for budget purposes |
| COULD | Create custom business correlationId for complex flows (e.g., paymentReferenceId) |
| Rule | Description |
|---|---|
| MUST | Enable Coralogix APM for all services; prioritise internet-facing and critical internal services |
| SHOULD | Keep logs, metrics, and traces as defaults for non-critical services if APM costs are prohibitive |
| COULD | Promote non-critical services to first-class citizens during incident analysis |
| Rule | Description |
|---|---|
| MUST | Expose service metrics (latency, throughput, error rate) and key resource metrics |
| MUST | Define SLIs/SLOs for critical user journeys; store in Coralogix and review periodically |
| SHOULD | Keep metric label sets bounded; avoid user or request-level labels |
| Rule | Description |
|---|---|
| MUST | Alert on SLO burn rate and symptom-based conditions; each alert has owner, runbook link, severity, auto-resolve criteria |
| MUST NOT | Alert on raw host metrics unless they indicate imminent user impact |
| SHOULD | Group & route automatically |
| Pillar | 🐢 Crawl | 🚶 Walk | 🏃 Run |
|---|---|---|---|
| Logging | Structured JSON logs containing entire entities mimicking application flows | Reduced log objects to meaningful events (e.g., unhappy paths) and information trimmed | Dynamic log-level through feature toggles. Logging on errors and warnings over unexpected paths (little to no INFO) |
| Tracing | TraceId across applications | Tail sampling; custom correlationIds covering multi-application flows | Adaptive sampling |
| Metrics | Generic Performance Indicators (latency, errors, throughput) | Service-Specific Performance Indicators (e.g., number of processed transactions) | Full RED/USE coverage; Use metrics as criteria for application autoscaling |
| APM | Auto-instrumentation; DB/HTTP spans | Dependency maps; N+1 detection (high number of similar DB queries per request) | Continuous profiling; regression diff |
| Alerting | Few actionable alerts with runbooks | Alerting over error-budget (e.g., over 2% failed authorisations in 5min). Severity drives paging decisions | Predictive thresholds considering quiet hours |
| Component | Configuration |
|---|---|
| Ingestion | OpenTelemetry SDKs → OTel Collector → Coralogix |
| Correlation | Logs include a defined correlationId (e.g., paymentReferenceID) so Coralogix auto-links logs ↔ traces ↔ APM |
| Dashboards | Each service MUST have a Coralogix dashboard template instance (golden signals + service SLOs) |
| Runbooks | Link runbooks from Coralogix alerts |
| Retention (Starter Policy) | Logs: hot 3 days (re-hydrate from archive); Traces: 7 days indexed; Metrics: 6 months |
Teams may request retention changes via Technology Review Board (TRB) and Financial Control Group (FCG) processes.
| Pitfall | Guidance |
|---|---|
| Business data vs. telemetry | Do NOT use observability tools (e.g., Coralogix metrics) to track revenue or business KPIs. Use Data systems (Snowflake, PowerBI). Use telemetry only for operational health. |
| High-cardinality labels | Avoid user_id or raw GUIDs in metric labels; use attributes in logs/traces instead. |
| PII/secrets in telemetry | Scrub sensitive data at source or via OTel Collector processors. |
| Missing trace/log correlation | Always include trace_id and span_id in logs when tracing is enabled. |
| Alert fatigue | Page on SLO symptoms, not low-level resource metrics. |
Plan early. Observability cost scales with event volume, metric cardinality, and retention.
Note: Budget is a first-class design concern – treat it like latency or availability. Engage IDP/Architecture during design review.
| Role | Responsibility |
|---|---|
| Development teams | Own instrumentation, dashboards, and SLOs |
| Development teams / SysOps | Own first-line alert response |
| IDP team | Own Coralogix tenancy, agents/collectors, guardrails, accesses, cost governance, templates, and golden paths |
| Architecture | Own this Standard, related Guidelines and design review checklist |