The beginning of July was a big week for data center outages with some notable Internet impact. I've written before about cloud resiliency and uptime in my 'Changing a tire while going 60mph' and 'Big, white puffy clouds can still evaporate' posts.
Amongst those outages was a Google App Engine interruption, which just goes onto the tally sheet as yet another failure of a supposed infallible cloud platform service. Ted Dziuba has some interesting viewpoints on the App Engine outage, and brings up a key element that makes all the difference with these types of outages: transparency. If something fails, people like to know why it happened (especially in light of all the resiliency/redundancy claims of the cloud) and what is being done so it doesn't happen again. Google's transparency was a basic statement that something bad happened with GFS, and that affected BigTable, which affected App Engine's DataStore, and they are going to fix it. Now, compare that level of detail to the information offered by Amazon during their S3 outage from July 2008 and February 2008. There is a notable difference. So a message to SaaS vendors: just because customers don't need to know the platform and implementation details behind your SaaS, doesn't mean they don't want to know...especially in light of a catastrophic failure. Telling customers what went wrong let's them enact their own post-mortem and due-diligence analysis processes to ensure they haven't shackled themselves to a SaaS vendor who is experiencing outages due to silly oversights that are likely to repeat as life goes on.
Actually, transparency and efficient communication regarding the outage event was also a key issue for some of the other outages last week. Jeremy Irish blogged about the fire at Fisher Plaza data center that left his systems offline for 29 hours. Apparently news of what exactly was happening, even for the first responders that were physically present at the building, was scarce and little effort was made to keep folks efficiently informed. Eventually they found out that it was a fire in a basement-level electrical room, which triggered the sprinklers and in turn ruined additional electrical equipment and generators, which ultimately caused significant damage to the power infrastructure.
Overall, Murphy and his laws are a fact of life and vendors (cloud, SaaS, or other) must do what is in their power to have contingency plans if you truly want to maintain availability and business continuity. And SaaS vendors, take note: just because you can hide the implementation and platform details of your service, doesn't mean you should hide the operational details of your service...especially during a service disruption.
Until next time,