Jim Allchin Addresses Windows NT's Reliability

Comdex '99—Microsoft Senior Vice President Jim Allchin, who's leading the charge for Windows 2000 (Win2K), President Steve Ballmer, and Vice President Deborah Willingham met with the press at Comdex on Monday. Jim Allchin used this opportunity to announce that he thinks he's fixed Windows NT 4.0's reliability problems. "You're looking at the complaints department for Windows," Allchin said as he stepped to the podium. He immediately launched into a narrative about how his team figured out why NT servers went down, and how Microsoft has tried to fix the problems in Win2K Server. Microsoft started a reliability initiative 2 years ago, and has since put about 500 person-years of work into it. According to Allchin, the company was fascinated because some NT users found NT to be robust and reliable while others complained about NT's reliability. "So we grabbed all the logs off of their machines, about 5000 servers we did this, had terabytes of information, and we could measure exactly when the machine was up, when it was down, and then we could start to figure out why it went down," said Allchin. The result? Allchin's team discovered that 65 percent of reboots were planned. Of these planned reboots, 45 percent were forced reboots from application installs, hardware changes, and OS configuration changes. Another 20 percent of the reboots were hygienic reboots that users were performing because they thought they were mysteriously cleaning the system by regular rebooting. This type of rebooting, implied Allchin, was unnecessary. About 35 percent of the reboots were unplanned, which breaks down as follows: 14 percent were core OS failures (e.g., blue screens), 13 percent were hardware failures (e.g., memory failures), and most of the rest were device-driver problems—an overwhelming majority of the unplanned reboots. A significant part of these device-driver failures were specifically related to antivirus software. Allchin detailed his plan of attack to address users' reliability concerns. "The first thing we did is we purchased, for tens of millions of dollars, a company that has one of the most advanced tools that we've ever seen for being able to analyze source code and look for problems in the source code before you actually go down the testing process," he said. "This code is able to find and initialize variables and find problems in memory leakages. We've run this over NT sources, and fixed literally thousands of problems that this has been able to discover." Allchin's team stress-tested every build of the OS, simulating months of run time each night. The company also hired security experts to attack the system in every way, as well as assigning a full-time team to do a complete code review. Allchin's team then tried to work with third-party independent software vendors (ISVs) to address driver-based failures. Microsoft created a driver-verifier testing tool for third-party ISVs to use to make sure their drivers integrated properly with Windows. Microsoft also made it impossible for third-party vendors to modify core OS files during installs. Allchin's team also created a kill-tree process, which can kill an entire group of processes and get past a crashing application without necessarily rebooting the server. Then the team worked to reduce planned reboots. Microsoft added features such as Service Pack slipstreaming, which reduces or eliminates system downtime when adding fixes. Allchin's team also tried to make adding components without requiring reboots as easy as possible. Finally, the team published best practices. "We had all these customers that were having such a great experience, and we talked to some others who were saying, 'oh, well, we're having to reboot.' And we get into it and operational practices make a big difference. If you treat it like a mission-critical environment, you had a better experience with the system." In addition to trying to minimize reboots in Win2K, Microsoft has added fault-tolerant features, such as load balancing and rolling upgrade, where you can take down one server in a cluster during upgrade and service continues without interruption. You can read the full transcript of Allchin's speech online or watch the press conference online.

Comments

Plain text