A project post-mortem, also called a project retrospective, is a process for evaluating the success (or failure) of a project’s ability to meet the goals.
Any Software Program can be prone to failure due a wide range of possible factors: bugs, large amount of traffic, security issues, technical problems , human error…
A postmortem is an important step in the lifecycle of an always-on service especially for the tech industry. The findings from your postmortem should feed right back into your planning process. This ensures that the critical remediation work identified in the postmortem finds a place in upcoming work and is balanced against other upcoming work and priorities.
Issue Summary
On Tuesday, 17 July 2018, from 12:17 to 12:49 PDT, customers using Google Cloud App Engine, Google HTTP(S) Load Balancer, or TCP/SSL Proxy Load Balancers experienced elevated error. Customers observed errors consisting of either 502 return codes, or connection resets.
Timeline
12:19 : Automated monitoring alerted Google’s engineering team to the event,
12:44 : The team had identified the probable root cause and deployed a fix.
12:49 : The fix became effective and the rate of 502s rapidly returned to a normal level.
12:55 : Services experienced degraded latency for several minutes longer as traffic returned and caches warmed. Then the Serving fully recovered .
Root Cause
Google’s Global Load Balancers are based a type of architecture that allows clients to have low latency connections anywhere in the world, while taking advantage of Google’s global network to serve requests to backends, regardless of in which region they are located. The GFE development team was in the process of adding features to GFE to improve security and performance. One of the features contained a bug which would cause the GFE to restart; this bug had not been detected in either of testing and initial rollout. While some requests were correctly answered, other requests were interrupted (leading to connection resets) or denied due to a temporary lack of capacity while the GFEs were coming back online.
Impacts
Cloud CDN cache hits dropped 70% due to decreased references to Cloud CDN URLs from services behind Cloud HTTP(S) Load balancers and an inability to validate stale cache entries or insert new content on cache misses. Services running on Google Kubernetes Engine and using the Ingress resource would have served 502 return codes as mentioned above. Google Cloud Storage traffic served via Cloud Load Balancers was also impacted.
Other Google Cloud Platform services were not impacted. For example, applications and services that use direct VM access, or Network Load Balancing, were not affected.
REMEDIATION AND PREVENTION
Google engineers were alerted to the issue within 3 minutes and began immediately investigating. At 12:44 PDT, the team discovered the root cause, the configuration change was promptly reverted, and the affected GFEs ceased their restarts. As all GFEs returned to service, traffic resumed its normal levels and behavior.
In addition to fixing the underlying cause, they implemented changes to both prevent and reduce the impact of this type of failure in several ways by :
- adding additional safeguards to disable features not yet in service.
- Increase hardening of the GFE testing stack to reduce the risk of having a latent bug in production binaries that may cause a task to restart.
- Pursuing additional isolation between different shards of GFE pools in order to reduce the scope of failures.
- Create a consolidated dashboard of all configuration changes for GFE pools, allowing engineers to more easily and quickly observe, correlate, and identify problematic changes to the system.