Microsoft although dominated the IT world, now struggles to compete with the cloud computing giants such as Google, Amazon, Rackspace and VMWare as it proves to provide interrupted services from Hotmail to the recently released cloud service SkyDrive and it makes us think twice before moving to Microsoft Cloud. In Cloud computing, any provider may experience outage anytime but it matters how they recover and build their future systems with minimal outages. This article focuses on the Microsoft’s cloud outages and how they were dealt and recovered.
Microsoft had an outage for about three hours on September 8, 2011 on the Windows Live services such as Hotmail and SkyDrive . The affected users did not have any data loss during the outage. The investigation team found the interruption was due to an issue in Domain Name Service(DNS). The tool that balances the network traffic had an update and the update did not work properly which caused the issue and did not last for long although it took some time to replicate on all the users around the world. Arthur De Haan, VP of Windows Live Test and Service Engineering regretted for the inconvenience by this outage.
Microsoft’s Hotmail had an outage on December 31, 2010 before the recent one which was listed as fourth worst cloud outage in top ten by www.infoworld.com. A number of users experienced their emails and folders were missing. Though they monitor the health of email accounts through automated systems, an error occured in the script that removed a small number of real users’ accounts along with the test accounts. As a result, the affected 17355 users found their emails and folders had gone missing but only the inbox location was removed from the directory servers. When these users logged in they had a new inbox without their old messages and with a welcome note for signing up for an account with Hotmail.
The support teams tried to trace the cause but could not and raised a ticket on December 31st to the wrong team which took extra time to solve this issue then they prioritised the ticket to the correct team on January 1st that led to find the route cause of the problem and this process continued till January 2nd to restore and merge the users’ old emails with new emails. They completed this for 16035 users on January 2nd but they continued till January 5th for the remaining 1320 affected users. Mike Schakwitz from Microsoft apologised to the Hotmail users and assured the data was recovered 100%.
The outage on Microsoft Sidekick lasted for 6 days which was the longest one for Microsoft resulted the users stop accessing the calendar, address book and other services. Also, Microsoft Azure had an outage along with the Sidekick on March 13, 2009 which left the users without access to their apps.
The lessons learned from the outages are the users need to do an analysis on the cloud providers about their disaster recovery plans and procedures and average downtime. Almost all the cloud leaders experience outages and we should look at how they handle the issue and how soon they come out of it. Most of the outages Microsoft experienced were only minor such as email issues, loss of access to calendars and address book and to the cloud storage service. The most recent outage on Hotmail and Skydrive proves the outages cannot be stopped until the cloud computing lasts. The cloud providers have to improve the recovery plans and have to focus on how fast they can resolve the issue and obviously Microsoft has improved when we compare the recent outage to the previous ones.
You may be interested in this related post "Gmail outage: An Investigation"
Other articles you may be interested in.....
News you may be interested in.....