Partial Platform unavailability
Incident Report for DoiT International
Postmortem

Post Mortem: System Outage for DoiT Console App Date: April 24, 2023

Incident Summary: On April 24, 2023, the DoiT Console App experienced intermittent 404 errors and NO_BACKEND_SELECTED issues. This caused a significant disruption in the app's functionality, affecting the user experience.

Timeline (Eastern Time):

  • April 24, 2023, 11:23 AM: Initial report of the issue, users are getting intermittent 404 errors when accessing DoiT Console
  • April 24, 2023, 1:30 PM: We identified that Google App Engine is responsible for the outage
  • April 24, 2023, 1:42 PM: Report of the issue to GCP as P1
  • April 24, 2023, 2:34 PM: Google Cloud Support starts investigating the issue
  • April 24, 2023, 3:06 PM: Evgeny Zislis provides an attachment with additional details
  • April 24, 2023, 5:28 PM: Marcos R, the Escalation Manager, gets involved in the case
  • April 24, 2023, 5:46 PM: Marcos R informs that Google's SRE team is investigating
  • April 24, 2023, 7:07 PM: Alex N from Google Cloud Support Americas identifies a change in GFEs causing the issue
  • April 24, 2023, 8:56 PM: Alex N confirms the issue is mitigated after rolling back the change

Root Cause: The root cause of the issue was a change rolled out to a subset of Google Front End (GFE) servers. This change allowed the GFE to occasionally select an App Engine target that didn't contain the correct service, resulting in 400 series errors and the no backend selected status.

Resolution and Recovery: Google's product engineers rolled back the change in the affected GFEs, which resolved the issue. The rollback was completed on April 25, 2023, at 3:56 AM Eastern Time.

Going forward, our team will continue to monitor the app's performance and promptly report any anomalies to ensure the best possible user experience for the DoiT Console App.

Posted Apr 25, 2023 - 04:44 UTC

Resolved
This incident has been resolved.
Posted Apr 25, 2023 - 04:29 UTC
Identified
The issue has been identified and is fixed by the cloud provider.
Posted Apr 24, 2023 - 23:00 UTC
Update
We are continuing to investigate this issue.
Posted Apr 24, 2023 - 16:34 UTC
Investigating
We are investigating intermittent 404 statuses. Temporary solution is refresh/retry.
Posted Apr 24, 2023 - 16:32 UTC
This incident affected: Core Platform (Cloud Cost Analytics, APIs, Dashboards, DoiT Console).