When you look at the technology involved in the new Pentium Pro and the complexity of the architecture, it's surprising that this chip runs as fast as it does. Yet, performance is not a problem--quite the opposite, actually. Using BAPCo benchmarks, the 150-MHz HP Vectra XU system outpaced nearly everything else in the lab, hands down. But what gives this CPU the edge?
The challenge Intel faced was to create the next-generation processor, while maintaining backward compatibility and ease of replication; that is, Intel had to be able to build it on the same manufacturing lines that generated the 100-MHz P5s. (0.6µm Bipolar Complementary Metal Oxide Semiconductor, or BiCMOS, is typically implemented as dual NAND gates with coupled inputs and a BJT driving the output. BiCMOS is an adaptation of older CMOS technologies, and in this implementation it improves performance by about 15% overall by reducing gate delays for greater fanouts. See Figure A.) So, they had to turn to improvements in microarchitecture rather than changes in the production process.
This change exists in what Intel calls Dynamic Execution, a code-scheduling technique that until now has been solely within the purview of RISC technologies. This technique, combined with key changes in the processor bus, pipeline, and functional units, makes the Pentium Pro a contender of global proportions.
The Pentium Pro combines elements of both CISC and RISC architectures. The concepts of pipelining, small and fast functional units (such as instruction fetch, decode, etc.), and code scheduling come from the world of RISC. Other elements, such as large complex instructions composed of smaller micro-operations, come from CISC.
Code scheduling allows instructions to be executed out of correct program-flow order. This greatly reduces delays based on data dependencies within a certain range of code. For example, the following fragment R1 <= mem\[R0\]; R2 <= R1+R2; R5 <= R5+1; R6 <= R6-R3 would be held up in a non-scheduled pipeline, because the second operation based on R2 has to wait for the load from memory into R1, thus halting the CPU. By analyzing the program flow and executing these instructions out of order, the CPU can reorganize them so that the memory access runs, the R2 operation is held in an Instruction Pool pending completion of the first operation, and R5 and R6 can be operated on normally, as they have no dependency on the first two operations. Then, when all operations have successfully completed, they are moved from the Instruction Pool and sent to the Retire Unit, where they are reordered into the proper program flow (see Figure B).
The other major part of Dynamic Execution is speculative execution, whereby the dataflow engine can predict program flow (branches) with 90% accuracy: for example, in a code loop with a branch condition rather than a set number of cycles, the engine predicts whether the branch state will fall through or loop back; thus, it can fetch additional instructions from the Instruction Pool and begin executing them based on that prediction without having to wait for the loop to properly terminate.
There are other significant changes in the Pentium Pro's microarchitecture that boost it far beyond the older Pentium:
- Moving from a 5-stage to a 12-stage pipeline with smaller units, thus reducing pipestage time by 33%
- Using non-blocking cache to minimize the impact of cache misses
- Dual-ported registers used for simultaneous loads/stores
- Register renaming to increase the number of available runtime registers (a buffer in the CPU core) while maintaining compatibility with older Intel architectures possessing fewer named registers
- A glueless multiprocessor capability, allowing up to four Pentium Pro CPUs on a single bus without significant overhead (by merely extending the CPU bus to include additional sockets) and support for more than four CPUs in a system.
4-layer metal BiCMOS, 5.5M transistors
Dual-cavity package (dual die); 691x691 mils
2.9V, 20W (peak)--14W (typical)
True 32-bit Intel architecture,
64-bit data path, superpipelined
Superscalar Level 2 and 3 (two instruction pipes, dispatch/retire three instructions per clock)