Explanation of this week’s outage
Posted by Staff
On Monday August 9th The City was intermittently unavailable for 2 hours, 41 minutes due to a cascading hardware failure of redundant systems. In addition, on Tuesday August 10th there was a delay in email delivery of up to 4 hours. While both of these issues were beyond our control, the buck stops here, and we’ve spent countless hours working with our hosting provider to get them resolved.
Our lead engineers have had extensive discussions with executives at our hosting provider to talk about what happened, why it happened, and what can be done to help prevent this kind of failure in the future.
The City is critical to the life and mission of your church, and diminished performance like this isn’t just an inconvenience for you, it’s unacceptable.
We’ve invested heavily in the stability, reliability, and responsiveness of The City. This is the first time since the inception of The City that we’ve had an outage of this significance, and we’re working hard to make sure it doesn’t happen again. We are very sorry that this happened at all.
Following is a summary of the post-mortem investigation.
Monday:
8:00AM – There was a hardware failure with one of the pair of load balancers, causing the unit to go offline. The secondary load balancer picked up the traffic with no noticeable effect. Hosting provider was notified of the failure and technicians began to prepare a new load balancer for operation.
8:20AM – The secondary load balancer had a software fault that resulted in degraded performance leading to complete failure. Steps were taken to try to track down where in the network the problem was originating.
8:45AM – Hosting provider’s technicians were able to identify the secondary load balancer as the source of the issue. Hosting provider then attempted several fixes resulting in a tenuous state of uncertain operation. We were up and down until about 1:00PM.
1:00PM – Hosting provider deployed a debugged version of the software to the secondary load balancer, which successfully brought up the site and returned traffic to normal operation. At this time, the hosting provider decided not to bring up the replacement primary load balancer in an effort to avoid introducing more uncertainty into the situation.
Tuesday:
1:00PM – We were alerted to mail not being delivered by our monitoring system.
1:00–2:00PM – Troubleshooting with hosting provider to determine the cause of the problem.
2:00PM – Hosting provider adjusted delayed jobs from 8 workers to 1 because there was a problem with table locking. This was a temporary fix to get mail processing.
2:30PM – Mail processing observed to be happening, but slowly.
3:00PM – Logging enabled to troubleshoot mail delivery speed issues.
3:30PM – Hosting provider decided it was better to let mail process slowly, rather than attempt further intervention and possibly break or lose mail.
Wednesday:
8:00AM – Fix was applied but not enough mail volume to confirm complete resolution.
1:30PM – Daily Digest email volume picks up, and mail situation confirmed as resolved. The root cause was identified as misconfiguration of mail sever rate limiting in the new data center.
We are still monitoring mail for any alerts or errors thrown.