Monitoring tools without a defined observability model produce dashboards that are visually complete and operationally useless. When every metric is visible, but none are prioritized, on-call engineers face the same cognitive load during an incident as during routine operations. Everything looks potentially relevant; nothing is clearly critical. For a platform handling live hospital workflows, that ambiguity translates directly into slower incident response and higher operational risk. The first work we did was not configuration - it was definition.
- Observability layer separation and SLI definition: We structured the platform's observability into three distinct layers: infrastructure health covering CPU, memory, pod restarts, and cluster-level indicators; application-level metrics covering request rates, error rates, and latency distributions per service; and service-level objectives tied directly to business and clinical impact thresholds. Each layer had defined ownership and defined response expectations. We worked with the CTO and platform team to identify the indicators that actually reflected user-facing reliability, not everything Prometheus could scrape, but the specific signals whose degradation meant a clinical workflow was affected. This exercise surfaced several metrics the team had been ignoring that turned out to be leading indicators of the incidents they were most frequently investigating.
- Alert rationalization and threshold definition: With SLIs defined, we reviewed the existing alert configuration end-to-end and removed triggers that fired on metric anomalies with no direct path to user impact. The remaining alert set was smaller, more specific, and tied to defined SLO thresholds rather than arbitrary static values. We pushed back on the instinct to keep broad alerting as a safety net; alert fatigue in an on-call rotation is not a minor inconvenience; it is a patient safety risk in a clinical environment where a delayed response to a real incident could affect care delivery. The team needed to trust that a firing alert meant something, which meant accepting that some anomalies would not alert at all.

