Post-mortem

Issue Summary

Timeline

Root Cause

Impacts

REMEDIATION AND PREVENTION

  • adding additional safeguards to disable features not yet in service.
  • Increase hardening of the GFE testing stack to reduce the risk of having a latent bug in production binaries that may cause a task to restart.
  • Pursuing additional isolation between different shards of GFE pools in order to reduce the scope of failures.
  • Create a consolidated dashboard of all configuration changes for GFE pools, allowing engineers to more easily and quickly observe, correlate, and identify problematic changes to the system.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store