Description: One of our store servers failed resulting in intermittent downtime / page loading issues for visitors. Normally in the case of a store server failure our load balancers would detect the failure and automatically take the server offline, no longer sending live traffic to it. We are investigating why in this case the load balancer continued to send live traffic to this store server throughout the day, even as it was failing. The downtime and page loading issues were a result of live traffic being sent to a failed server node.
Impact: Due to the intermittent nature of the failure, where some visitors were successful and some where not, it is difficult to put an exact figure on the downtime as a number of hours. We have experienced an average lower sales day today, across all stores, as a result of this issue. Because of normal day-to-day changes in sales patterns it is difficult to know how much of this is due to the error. As an estimate for informational purposes and to help with understanding the issue, we believe it was equivalent to an approximately 3 hour block of downtime.
We assure you this is a very unusual occurrence and take pride in an excellent average uptime % over our history. We're doing everything we can to better understand the failures and put in place improved detection and prevention measures.