System scalability is a topic that tends to annoy systems administrators. Does anyone ever actually upgrade a running system in a production environment? Do companies seriously consider upgradability as anything more than an option on a checklist when the time comes to buy server and system hardware? What do systems administrators do when their servers aren't keeping up with the needs of their users?
Network applications can bog down in three ways. First, applications can become CPU-bound, meaning that the available CPU time supply doesn't keep up with the application demand. Second, applications can become I/O-bound, meaning that the load on storage subsystems becomes a continual bottleneck from which the systems never recover. Third, applications can become network-bound, meaning that network traffic prevents users from getting acceptable network performance. Fortunately, tools are available to help network managers avoid these situations. For example, network monitoring applications can keep an eye on the health of the wire; capacity-planning tools can help you guesstimate how much size and power the hardware that runs your applications needs; and system management tools, both built-in and third-party, let you track your network servers' behavior.
Considerations Before You Upgrade
Some problems have seemingly obvious solutions. For example, are you running out of disk space? Storage is cheap, so you can just buy more. However, solutions are rarely that simple. If the disk space problem is a continual one, do you need to look at your policies for cleaning up wasted space on the servers? Do you face fighting with users who keep a couple gigabytes' worth of files on your servers and have mailboxes that contain hundreds of megabytes of email and attached files? Downtime is the primary concern in any upgrade. If your storage systems don't allow hot upgrades, what is the business cost every time you must bring a server down to add another disk?
I've worked with too many businesses where a daily fact of life was the broadcast message from the IT department that a specific server was going down "for a few minutes." When those minutes become hours, and when such messages start appearing every day, something is wrong. The problem might be the OS, physical hardware, or application choice, but good reasons rarely exist for disrupting business in such a way. "We're upgrading the servers" is not an excuse that any user wants to hear. If you're performing a major OS upgrade or setting up a new line of business applications that require modifications to your servers, ideally, you've planned to roll out with minimal disruptions to your users. But if your upgrades involve adding memory, drives, or processors, you're likely reacting to a perceived problem, not implementing a well-thought-out plan.
When I've needed to upgrade servers in corporate networks, I've been a big fan of using duplicate servers. I build at least one extra server and use this hot-spare system to duplicate any server on the network or to recreate a server with additional drives or memory. If I need to upgrade the OS, I set up the spare server with the appropriate applications, test it to make sure everything works, then sneak it onto the network when I affect the fewest number of users. The additional cost for the hardware is usually offset by the reduced system downtime. If your problem is simply upgrading drives and memory, you can use the duplicate-server technique to update your network servers. This technique might not work for every situation, but it will handle 90 percent of the network server upgrades you deal with.
You notice that I haven't mentioned CPU upgrades. I don't favor buying SMP-capable boxes without a full complement of processors. Until the appearance of Profusion architecture machines, I don't think there was a compelling reason to buy SMP-capable computers. In addition, performing processor upgrades on SMP systems has been a pain if the processors are old. The difficulty lies in the fact that upgrading a Pentium-based SMP machine requires you to provide another processor that is in a narrow range of stepping and revision codes. Try telling a vendor that to add a second processor to your server, you need a Pentium Pro 200 Stepping 2, Revision B chip 18 months after that version of the chip has been superseded. If you're buying a 2-way capable box, you need to buy it with two processors; if you're buying a 4-way, buy it with 4 processors, and so on. Granted, some bus and performance concerns might make you want to buy an 8-way-capable Profusion system with fewer than the maximum complement of CPUs, but such is the only exception I'm willing to make.
However, the multiple-processor question might soon become moot. Microsoft is moving away from building bigger machines (despite the pending introduction of 16- and 32-way hardware running Windows 2000 Datacenter Server). From most Microsoft development groups, I hear, "Scale out, not up."
Scaling Out to the Rescue
Scaling out requires applications to scale by adding additional computers, not by moving to bigger servers. (For a discussion about scaling out versus scaling up, see Mark Smith, Editorial, "Windows 2000 Datacenter Server," August 2000.) The idea behind concepts such as Microsoft's Distributed interNet Applications (DNA) architecture and products such as Application Center is to allow the creation of distributed applications that administrators can easily deploy and upgrade across multiple systems. Scaling out doesn't rule out using huge SMP boxes. I'd still want all the horsepower I could get for a big SQL Server database-intensive application, but the ability to expand the capabilities of my applications by throwing more hardware at them in the form of complete servers has definite appeal.
Hardware vendors are champing at the bit for large distributed applications. Having provided rack-mounted systems in 1U (1.75") form factors for the growing ISP market, these vendors want to take the rack-mount systems into corporate America. Companies such as IBM already offer dual-processor Pentium III systems in a 1U form factor, as well as full-blown fault-tolerant systems with four CPUs and a raft of hot-swappable devices in form factors as small as 7U (12.25").
The software necessary to start building applications that scale out already exists. Some of you are probably scaling out but don't realize it. Do you have a load balancer sitting in front of your Web farm? If so, then you've started to scale out. From there, building clusters of clusters and load balancing across the clustered applications is only a short step. And as hardware prices continue to drop and OSs include more distributed-application features, the buy-in point for these technologies becomes lower. If Microsoft is able to deliver on its promises with Application Center, making applications work in this load-balanced, distributed environment will be possible for every competent systems administrator. Benefits exist for administrators who manage hardware, too. No more "special" server configurations—just get another standard model and plug it into the network. When was the last time anybody was concerned about ease of use for hardware managers?