Control Plane Instability Incident - June 26, 2025
Executive Summary
On June 26, 2025, between 05:58 UTC and 20:38 UTC (14 hours, 40 minutes), the control plane experienced a critical failure, entering a crash loop. This prevented the reconciliation of any new or updated entities, affecting some customers attempting to make configuration changes. The issue went undetected for several hours due to monitoring gaps. The root cause was a bug in the control plane's configuration handling that led to a panic when processing a specific sandbox configuration. The issue was resolved by deploying a targeted hotfix once detected.
Timeline of Events (UTC)
- 05:58: The initial incident occurred when a specific sandbox configuration was applied, triggering a panic and crash loop in the control plane. The issue was not immediately detected.
- 20:11: The instability was finally detected, and an investigation was formally initiated.
- 20:26: The on-call team identified the crash loop and correlated it with the problematic configuration from earlier in the day.
- 20:34: A hotfix to gracefully handle the configuration was deployed. The control plane stabilized.
- 20:38: The system was confirmed to be fully operational. The incident was declared resolved.
Root Cause
The direct cause was a nil pointer dereference within the control plane's reconciliation loop, triggered by a specific, unhandled sandbox configuration. The service lacked adequate validation for this edge case, causing a panic. The service's automated restart policy then created a persistent crash loop. A significant contributing factor was a gap in monitoring and prober coverage, which failed to detect this class of failure, leading to a prolonged time-to-detection.
Resolution
- Detection: The issue was eventually detected through a combination of cascading alerts and manual investigation, hours after the initial failure. The primary automated probers did not exercise the code path that was failing.
- Mitigation: The immediate fix was a hotfix deployment. This patch modified the control plane logic to correctly parse the new configuration type, preventing the panic.
Corrective and Preventative Actions
- Enhance Prober Coverage: Update health and liveness probers to exercise edge-case configurations, ensuring faster detection of reconciliation failures.