To increase the system speed
When we look at the overall system, it is apparent that not all things have progressed at the same rate. The largest bottleneck is the memory. We already use a slower clock speed but the microprocessor still spends a lot of its time humming a tune and bending paper clips waiting for information to arrive from the memory. During the life of the microprocessor, the clock speed has increased from 0.1 MHz to about 3 GHz, an increase of about 30 000 times. During this time, these DRAM memories have got bigger but only about 2000 times faster.
Modern microprocessors have about 128 kbyte of on-board RAM, called a cache. When the microprocessor has to go to the external memory for information, it saves a copy of the address and the information in case it is needed again. It also saves the address and information from the next memory location. The reasoning behind this is that since nearly all languages are procedural then the next location is likely to be accessed next. If not, the program may jump back to a previous address to repeat part of a program as in a counting loop to produce a delay. When the microprocessor next requires access to the memory, it first checks the high speed cache to see if the information is stored, if it is, we have scored a ‘hit’ and the system has increased its speed. If it is not there, it is a ‘miss’ and the main memory is used. This new information is then stored in the cache for later.
This cache is sometimes called a level 1 cache, or L1 cache. This implies that there may be a level 2 cache – and there is. The L2 cache is usually 256 kbyte.
When data is needed, the microprocessor checks cache level 1, then 2 and lastly, the main memory.
Pipelining
To put too much reliance on the clock frequency is like saying that the maximum rpm of the engine determines the maximum speed of a vehicle. Yes, true, but other things like gearbox ratios are also significant. Doing 9000 rpm in first gear will not break any speed records. The real speed of a microprocessor also depends on how much useful work is done during each clock cycle. This is where pipelining is really helpful and is now incorporated in all microprocessors.
Let’s assume we have some numbers to move from the memory to the arithmetic and logic unit (ALU):
Clock pulse 1 | A number is moved from a memory location to the accumulator. |
Clock pulse 2 | It is then moved from the accumulator to the ALU. |
If we have another number to be loaded, this would have to repeat the process so loading two numbers would take four clock pulses. Three numbers would take six clock pulses and so on.
During the first clock pulse, a number is being moved along the bus between the memory and the accumulator and so the other part of the bus between the accumulator and the ALU is not used. During the second pulse, we still have one section of the bus idle (Figure 11.1).
Figure 11.1 One clock pulse moves one number
Pipelining is the process of making better use of the buses. While one number is shifted from the memory to the accumulator, we can use the same clock pulse to shift another number from the accumulator into the ALU along the other section of the bus. In this way, we get more action for each clock pulse and so the microprocessor completes instructions faster without an increase in the clock speed (Figure 11.2).
Figure 11.2 One clock pulse moves two numbers
If we get two jobs being done on the same clock cycle, then this has made a significant improvement to the speed without increasing the clock speed. If we can manage to get three pieces of information moving, or jobs done, this is even better. Incidentally the Pentium manages five and the Pentium Pro can manage 12 and the Pentium 4 can keep up to 126 instructions ‘in flight’. Unfortunately, we can never get pipelining to work this well on all instructions, but every little helps.
If we wished to AND two binary numbers, we could do it by using a logic gate as we saw in Chapter 5 or we could use a microprocessor executing an instruction code. Now, comparing middle-of-the-range devices, the logic gate would complete the task in 8 ns but a comparable microprocessor (80386, 25 MHz) would take a minimum of 80 ns.
This type of comparison established the belief that, given a choice, hardware is always faster than software. In the above case, it is 10 times faster.
Given the job of carrying out a hundred such instructions we had a choice:
Software method = 100 operations × 80 ns = 8000 ns (8 ms)
Hardware method = 1 operation at, say, 240 ns + 100 hardware operations × 8 ns = 1049 ns
This philosophy was followed throughout the development of 4-and 8-bit microprocessors. This gave rise to more complex hardware and a steady increase in the size of the instruction set from a little under 50 instructions for the 4004 up to nearly 250 in the case of the Pentium Pro.
In the mid 1980s, the hardware-for-speed approach began to be questioned. The ever-increasing number and complexity of the operating codes was reversed in some designs. These microprocessors were called RISC (Reduced Instruction Set Computers) and the ‘old fashioned’ designs were dubbed CISC or (Complex Instruction Set Computers). History has not proved so black and white as this suggests. It is much more a matter of shades of grey with new designs being neither wholly CISC nor RISC. The use of predominately CISC microprocessors outnumbers RISC designs by a wide margin, at least 60:1. This does not imply that they are better but they simply have a greater proportion of the market. As we know, there is a lot more to market dominance than having the best product. Sadly. CISC designs include all the 8-bit microprocessors and Pentium, Pentium Pro and all of the 68000 family whereas the RISC includes Digital Alphas and the IBM/Motorola Power PCs.
RISC versus CISC
Both RISC and CISC microprocessors employ all the go-faster techniques such as pipelining, superscalar structures and caches. A superscalar architecture is when there are two ALUs that share the processing like having two microprocessors. So, what are the real differences?
By analysing the code actually produced by compilers, we find that a small number of different instructions account for a very large proportion of the object code produced. Most popular are the instructions that deal with data being moved around.
At this point a curious switch of design occurred. You will remember that the ‘normal’ or CISC microprocessor included a microprogram in its instruction decoder or control unit. This microprogram was responsible for the internal steps necessary to carry out the instructions in the instruction code. So the microprocessor that we have been praising for its use of hardware to gain speed, is actually being run internally by software.
The RISC approach was to reduce the number of instructions available but keep them simple and do them fast. The number of instructions were reduced to under a hundred. Since instruction codes can be easily enhanced by adding some extras to the microprogram it was tempting to do it. No pruning of previous instruction was possible owing to the need to maintain compatibility with previous versions.
Following the cries of ‘hardware is faster than software’ it seemed a logical step to do away with the microprogram and replace it with hardware that could carry out the simple steps necessary. This hardware was made more simple by keeping all the instructions the same length so that pipelining was easier to organize. The only disadvantage of these constant length instructions is that they all have to be the same length as the longest and so the total program length will be increased.