On the one hand, I was a bit surprised to read Holzle admit the preference to have human involvement in remediating certain types of outages. It seems Google has automated technological controls to handle single system and small-scale failures (a whole equipment rack, etc.) within a single data center; but larger scale outages (such as a whole data center) are mitigated manually. On the other hand, I can understand the desire to have explicit human oversight; large-scale failures tend to have unpredictable cascading effects that are hard to account for with automated systems. In fact, Holzle mentions that the very systems designed to handle larger-scale outages tend to be the same systems responsible for the outages; the February GMail outage is an explicit example.
Having said that, I still believe clouds need to be as self-sustaining as possible (preferably in automated fashion) despite the current limits of technology and engineering. That implied resiliency and redundancy is one of the core value propositions of clouds. Without it, clouds get devalued into not much more than a traditional managed hosting platform/service.
My suggestion to organizations considering a purchase from a cloud-based vendor is to have a due-diligence discovery session regarding the cloud architecture that vendor has engineered. You want to see something akin to a fairly detailed Visio diagram depicting multiple datacenter/regional locations, with appropriate systems in each location and the roles they provide to the overall cloud platform and offered services. You want to see redundancy in systems and roles, with distribution and backups across disparate regions. Look for points of failure, and ask them how the entire cloud handles an outage and thwarts a service interruption during the failure for any given node you happen to point to. Especially point to an entire datacenter/region and ask what happens if that entire area of the cloud system goes down due large-scale power outage, natural disaster, etc. Also be wary of how cloud entrance points could become affected during an outage; it is great if the overall service will naturally shift processing to a secondary region if the entire first region experiences an outage, but it is also not so great if you have to explicitly go back and reconfigure your organization to point to the secondary region. Even though the processing survived, your point of entrance changed and thus you still had an outage. You definitely want to ensure all points of presence/entrance are fault tolerant.
Overall, going with a cloud vendor still requires you to think about redundancy, fault tolerance, business continuity, etc. And it might require a little extra effort up front in order to ferret out the appropriate details from your cloud vendor during the initial evaluation. Fortunately, once you are satisfied with the resiliency offered by the vendor's cloud architecture, you can then focus your attention elsewhere--the vendor is the one that has to deal with the implementation and ongoing maintenance of the necessary architecture redundancy. And that is one headache that is definitely nice to outsource.