Earlier this week we had a problem at one of our data centers. The way it was successfully handled makes for an interesting story that I’d like to share.
The outage
We use multiple data centers in different locations. One of them lost power at two of their core switches which meant our servers in that data center were not connected to the internet for about an hour and 40 minutes. Now this isn’t supposed to happen. The providers we use have all sorts of processes in place to try to ensure that they are online all the time. However from our experience over the past 10 years of doing this, we see that every now and then a series of unforeseen events happens and they do go offline. To deal with this unlikely eventuality we have designed our hosting architecture to provide uninterrupted service even when these outages occur.
What would have happened in the past?
Depending on when in our 10 year history this happened, we could have had the following outcomes:
- Loss of service for all of our customers during all the outage
- Loss of service for some of our customers during the outage
- Loss of service for a portion of our customers’ visitors for a part of the outage
In any case, it would have been all hands to the deck while we tried to find a solution and there would have been a lot of communication with our customers letting them know that there was a problem and what we were doing to fix it, and then afterwards, what we had done to reduce the possibility of it happening again. In short there would have been a lot of stress all around.
So what was the impact?
We have over 100 sites whose search and navigation are partially hosted in the data center segment that had the outage.
- For the first 50 seconds a third of the visitors searching on those sites experienced a single 30 second delay while their browser timed out and then tried our next data center. Thereafter they will have had service as normal.
- For the next 100 seconds a third of the visitors to those sites experienced a single subsecond delay while their browser got a connection refused response and tried another data center. Thereafter they will have had service as normal.
- After that the data center was automatically removed from the DNS and no further requests will have been sent to it
- Once our servers came back on line the data center was added back into the DNS and operations continued as normal
The data center has informed us that they have made some changes to their processes and hardware so this type of fault won’t happen again. So they are a little stronger now than they were before. However I’m sure they will still suffer another outage at some stage.
In summary, during a 150 second period, a third of the visitors searching on the effected sites will have experienced a single delay, but after that delay they will have been able to use the services as normal. We didn’t hear anything from any of our customers and it was all handled without any stress. This is a significant improvement over what we’ve done in the past.
I wanted to highlight this story because it shows one of the unseen benefits of our service. Our less experienced and cheaper competitors don’t offer this sort of redundancy. Most website operators that are hosting their own search are also unable to provide this level of redundancy. Our customers expect our services to be running all of the time. The reason we are able to provide the level of service that we do is a combination of our dedication to continually improve the way we do things, combined with the knowledge gained through years of experience.
3 thoughts to “The story of the outage that wasn't”
Could you expand on some of the technical details? What kind of architecture do you use to replicate data in multiple centers, and how frequently do you update? This seems particularly tough if your search is hooked into an inventory system or auction timer.
Great article, Shaun. This comes at a particularly good time for me as we are evaluating Software as a Service Ecommerce platforms and one of the potential benefits I see is the lower cost considering developing and maintaining this type of architecture. If we were to try to do this on our own for our own Ecommerce site, the costs get very ridiculous.
Pingback: 100% Uptime – Guaranteed! - SLI Systems Blog
Comments are closed.