Millions of Skype users found themselves unable to connect to the VoIP network due to shortcomings in the company's P2P network management algorithm. The shortcomings effectively led to an unforeseen but preventable denial of service.
Not so coincidentally, the outage occurred right at the time Microsoft released its regularly scheduled round of monthly security updates. Skype initially thought that one or more of the updates was the cause of the outage. However, after consulting with Microsoft engineers, Skype soon realized that the fault was entirely their own.
"The Microsoft team was fantastic to work with, and after going through the potential causes, it appeared clearer than ever to us that our software’s P2P network management algorithm was not tuned to take into account a combination of high load and supernode rebooting," wrote Villu Arak in the company's blog.
Skype "supernodes" are end user computers that are used to help route the VoIP traffic of other Skype users. Without enough supernodes available, the Skype network can be brought to a screeching halt. That's precisely what happened in this case. Because so many computers were being rebooted in relatively the same time window, many supernodes were not available.
"The high number of post-update reboots affected Skype’s network resources. This caused a flood of log-in requests, which, combined with the lack of peer-to-peer network resources at the time, prompted a chain reaction that had a critical impact. The self-healing mechanisms of the P2P network upon which Skype’s software runs have worked well in the past. Simply put, every single time Skype has needed to recover from reboots that naturally accompany a routine Windows Update, there hasn’t been a problem.
Unfortunately, this time, for the first time, Skype was unable to rise to the challenge and the reasons for this were exceptional. In this instance, the day’s Skype traffic patterns, combined with the large number of reboots, revealed a previously unseen fault in the P2P network resource allocation algorithm Skype used. Consequently, the P2P network’s self-healing function didn’t work quickly enough," wrote Arak.
This all leads to the obvious question of why hasn't this type of incident happened before. Arak went on to explain that in previous instances where many supernodes were offline, there was not a high Skype network usage, so the problem never surfaced until now.
Arak added that Skype has now adjusted the algorithm that is used to tune the network and core network size changes, so such an occurrence shouldn't happen again.