Standard for Applications Observability

Metadata	Value
Source	Confluence - Standard for Applications Observability
Status	Ready for Comments
Last Synced	2026-01-08
Space	Architecture (ARCH)

1. Purpose

Ensure every service is observable by default, enabling fast detection, diagnosis, and resolution of issues while controlling cost.

2. Scope

This standard defines when to use logging, APM, tracing, alerting, and metrics and the minimum bar to go live.

The Five Pillars of Observability

Pillar	Definition	Example
Logging	Captures discrete events in a structured format (e.g., JSON) with context like timestamps, severity, and correlation IDs. Logs are essential for debugging and forensic analysis.	Search `transaction_id=123` to find all related errors in seconds.
APM	Combines traces, metrics, and profiling to give developers actionable insights into code-level performance and dependencies.	After deployment, APM compares performance baselines and flags a 30% increase in DB query time, allowing rollback before customers notice.
Tracing	Provides a full picture of a request's journey across services using spans and timing data. Traces help pinpoint latency bottlenecks and broken dependencies in distributed systems.	Identify an 800ms latency spike in the switching service, causing authorisation delays.
Metrics	Numeric time series (counters, gauges, histograms) that measure system health, performance, and trends. Metrics power dashboards, SLOs, and alerting.	Generic performance indicators (latency, error rates, throughput) or custom indicators (number of processed transactions, failed authorisations).
Alerting	Converts telemetry into actionable notifications. Alerts should be symptom-based (e.g., SLO violations) and avoid noise from low-level resource metrics.	Fire alerts if the number of failed authorisations for the last 5 minutes is above 5%.

3. "When to Use What" (Decision Rules)

Need	Use	Notes
Auditable facts & discrete events (new features, unhappy paths, errors)	Logging	Structured JSON logs; include relevant fields from Logging guideline
End-to-end request flow across services	Tracing	Requires correlationId propagation; create spans for remote calls, DB, queues
Health over time, SLIs/SLOs (latency, error rate, saturation)	Metrics	Including counters/gauges/histograms. Use for alerting and capacity planning
Code-level performance, flame graphs, error analytics, RUM	APM	Use Coralogix APM & Profiler. Keep on for all internet-facing services
Human attention for user/business impact	Alerting	Alerts must map to an SLO, runbook, and owner. Prefer multi-signal alerts

Rules of Thumb

If you need to answer "what happened?" → Logs
If you need to answer "where did it slow/fail?" → Trace + APM
If you need to answer "is it healthy?" → Metrics + SLOs
If you need a pager to fire → Alert on SLO symptoms

4. Requirements

4.1. OpenTelemetry Rationale

We standardise on OpenTelemetry (OTel) for instrumentation to:

Decouple code from backend vendors
Ensure consistent semantics across services
Enable cost control via centralised processing (sampling, filtering, redaction)
Allow backend flexibility without re-instrumenting applications
Support multi-vendor strategies

4.2. Common Requirements

Rule	Description
MUST	Send all observability data to Coralogix
MUST	Tag every signal with: `service`, `env` (prod\|uat\|dev), `version` (build SHA/semver), `host`
MUST NOT	Log secrets or raw PII; follow Logging Guidelines
SHOULD	Keep observability in code using OpenTelemetry SDK

4.3. Logging Requirements

Rule	Description
MUST	Emit structured JSON logs with a standard envelope
MUST	Use levels correctly: `ERROR` for actionables; `WARNING` for unexpected unhappy paths; no `INFO` by default
SHOULD	Avoid high-cardinality fields in messages (explodes indexes and slows search)

4.4. Tracing Requirements

Rule	Description
MUST NOT	Override the `traceid` set automatically
MUST	Propagate a well-defined correlationId across APIs, Events, DB calls, etc (traceId)
MUST	Create spans for network calls, DB operations, external APIs, and message processing
MUST	Make low sampling (assume 1%) a goal for budget purposes
COULD	Create custom business `correlationId` for complex flows (e.g., `paymentReferenceId`)

4.5. APM Requirements

Rule	Description
MUST	Enable Coralogix APM for all services; prioritise internet-facing and critical internal services
SHOULD	Keep logs, metrics, and traces as defaults for non-critical services if APM costs are prohibitive
COULD	Promote non-critical services to first-class citizens during incident analysis

4.6. Metrics Requirements

Rule	Description
MUST	Expose service metrics (latency, throughput, error rate) and key resource metrics
MUST	Define SLIs/SLOs for critical user journeys; store in Coralogix and review periodically
SHOULD	Keep metric label sets bounded; avoid user or request-level labels

4.7. Alerting Requirements

Rule	Description
MUST	Alert on SLO burn rate and symptom-based conditions; each alert has owner, runbook link, severity, auto-resolve criteria
MUST NOT	Alert on raw host metrics unless they indicate imminent user impact
SHOULD	Group & route automatically

5. Maturity Levels (Crawl/Walk/Run)

Pillar	🐢 Crawl	🚶 Walk	🏃 Run
Logging	Structured JSON logs containing entire entities mimicking application flows	Reduced log objects to meaningful events (e.g., unhappy paths) and information trimmed	Dynamic log-level through feature toggles. Logging on errors and warnings over unexpected paths (little to no INFO)
Tracing	`TraceId` across applications	Tail sampling; custom correlationIds covering multi-application flows	Adaptive sampling
Metrics	Generic Performance Indicators (latency, errors, throughput)	Service-Specific Performance Indicators (e.g., number of processed transactions)	Full RED/USE coverage; Use metrics as criteria for application autoscaling
APM	Auto-instrumentation; DB/HTTP spans	Dependency maps; N+1 detection (high number of similar DB queries per request)	Continuous profiling; regression diff
Alerting	Few actionable alerts with runbooks	Alerting over error-budget (e.g., over 2% failed authorisations in 5min). Severity drives paging decisions	Predictive thresholds considering quiet hours

6. Implementation Defaults (Coralogix)

Component	Configuration
Ingestion	OpenTelemetry SDKs → OTel Collector → Coralogix
Correlation	Logs include a defined correlationId (e.g., paymentReferenceID) so Coralogix auto-links logs ↔ traces ↔ APM
Dashboards	Each service MUST have a Coralogix dashboard template instance (golden signals + service SLOs)
Runbooks	Link runbooks from Coralogix alerts
Retention (Starter Policy)	Logs: hot 3 days (re-hydrate from archive); Traces: 7 days indexed; Metrics: 6 months

Teams may request retention changes via Technology Review Board (TRB) and Financial Control Group (FCG) processes.

7. Service Readiness Checklist (Go-Live Gate)

Coralogix APM is enabled
Trace propagation verified across application dependencies
Structured logs with correlation fields; PII redaction verified
Golden metrics exported; SLI/SLOs defined in Coralogix
At least two symptom-based alerts with owner & runbook
Dashboard published in Coralogix; link in the Developer Portal Applications Catalog
Cost controls configured (sampling, drop filters, indexes) and reviewed

8. Common Pitfalls

Pitfall	Guidance
Business data vs. telemetry	Do NOT use observability tools (e.g., Coralogix metrics) to track revenue or business KPIs. Use Data systems (Snowflake, PowerBI). Use telemetry only for operational health.
High-cardinality labels	Avoid `user_id` or raw GUIDs in metric labels; use attributes in logs/traces instead.
PII/secrets in telemetry	Scrub sensitive data at source or via OTel Collector processors.
Missing trace/log correlation	Always include `trace_id` and `span_id` in logs when tracing is enabled.
Alert fatigue	Page on SLO symptoms, not low-level resource metrics.

9. Budget & Cost Controls

Plan early. Observability cost scales with event volume, metric cardinality, and retention.

During Planning:

Set a monthly budget envelope for each domain/service and review in technical design
Choose default sampling for traces (e.g., 10% default, 100% for new workflows)
Apply log level dynamic configuration (on-demand)
Prefer metrics from logs for counters over retaining verbose logs
Prefer traces over logs for traceback exercises
Keep labels low-cardinality; explode dimensions only when they unlock clear value
Use tiered retention and indexing in Coralogix (index only what you search)
Track cost per request for observability as a KPI

Note: Budget is a first-class design concern – treat it like latency or availability. Engage IDP/Architecture during design review.

10. Governance, Exceptions, and Compliance

This standard is strongly recommended by default
Exceptions allowed for justified cases (e.g., extreme latency constraints, embedded devices) via Technology Review Board (TRB)
Exception requests require: rationale, risk assessment, alternate controls, and a path to compliance
Non-compliance should be recorded as technical debt with agreed remediation date
Security & privacy policies apply: do not emit secrets or raw PII; use approved redaction and hashing strategies

11. Ownership

Role	Responsibility
Development teams	Own instrumentation, dashboards, and SLOs
Development teams / SysOps	Own first-line alert response
IDP team	Own Coralogix tenancy, agents/collectors, guardrails, accesses, cost governance, templates, and golden paths
Architecture	Own this Standard, related Guidelines and design review checklist