Secret of High Performance CPU - New Technology
● Johan de Gelas (from computer newspaper double wooden translation)
Rise MP6 is the first CPU in the X86 CPU with a true dual FPU pipeline. So, is its performance better than PII? If you have read the Rise MP6 evaluation report, you know that MP6's FPU performance is similar to the Pentium MMX using a single FPU pipeline. So, what will the performance of K7? Its 3 FPU pipelines can only improve the performance when running the actual program?
Obviously, it is designed to have a CPU that can handle a large number of instructions in each clock cycle is the best way to improve CPU performance. In the last article (see Section 18 17), we see that since the program branch has caused many instructions in the same clock cycle to issue many instructions to each execution unit, the problem is to predict the implementation of the branch program. However, in addition to the program branch, there are three main reasons such that the X86 CPU cannot immediately release a large number of instructions in each clock cycle:
· Instruction dependence;
· X86 architecture only 8 universal registers;
· The execution unit is very busy, the required execution unit is not ready.
One is explained below.
First, instruction dependence
The so-called "dependency" is the operational result of the execution of the instructions that requires the previous instruction. For example, the programmer is often used by the branch, and the CPU designer hates this situation. Look below:
A = C * 2
B = a 2
As long as the value of the variable A is not known, B = A 2 cannot operate. That is, the CPU scheduler cannot publish the instruction 2 to the execution unit as long as the result of the command 1 is not written. In the last article, we see that the program branch will cause a longer-long pipeline CPU to run stagnation, the solution is to adopt branch prediction; this time we see the problem brought by "dependence", the longer the pipeline, the instruction period The longer, the longer the time of the previous instructions, the longer the operation of the operation, which will also cause the CPU to operate stagnation, and the relatively perfect solution is to use "dilute execution" and use a larger capacity buffer.
Second, 8 universal registers of X86 architecture
Some people criticize Intel's 12-year-old X86 instruction set structure only 8 universal registers. They use this fact to prove x86 CPU (including PII, PII, K7, etc.) is a group "Dinosaur", which is no longer competitive with RISC CPUs (such as Motorola G3 and G4, Sun Ultrasparc II, etc.).
Indeed, only 8 universal registers can be used ridiculous. Everyone knows that the register is only used to temporarily store the command value. If the CPU needs to add two values, it needs to store the calculation results with one register, and the added value is stored with 2 registers. The CPU cannot directly calculate the value stored in the main memory, and the superclapping CPU can handle a large amount of calculations in parallel. Eight registers can not meet the requirements, in the same clock cycle, if there is 3 instructions, you need 3 Output register and 6 input registers. What should we do?
The intelligent Intel and Nexgen engineers discover how to break through this restriction: "Register Rename". The secret registers within the CPU can be used after using the register rename technology. PII has 40 such registers, and K6 has 48, while the classic Pentium is not. Can affirm that the P7 of K7 and Intel will also have a large number of registers to improve the average instruction execution rate of each clock cycle.
The register rename technology is now deeply rooted in the over-scale CPU. The CPU is renamed in the decoding process. The decoder turns "secret" register name to "usual" register name, essentially re-mapped the X86 register to those secret registers through a table. Therefore, although from the outside, modern X86 CPU has only 8 universal registers, but actually used registers are much greater than this number. Although the renaming and recovery techniques seem to be very complicated, they maintain the competitiveness of the X86 CPU. Third, the number of register renamed and chase the implementation technology
In order to illustrate the performance improvement of register renaming techniques, the performance of chart, we simulate an exception of 8 arithmetic instructions in an over-scale CPU. It is assumed that it can decode 2 instructions in each clock cycle, and the result of the lead calculation occurs in three clock cycles after the instruction release. In the following table, readers only need to pay attention to "release" and "lead" columns.
Executive
(1) In the first clock cycle, two instructions are released: they are not associated with each other, so they will be taken out after 3 clock cycles (4th clock);
(2) In the second clock cycle, we encountered "instruction dependencies" for the first time, the command 3 needs the results of the command 2, and the instruction 3 cannot be released;
(3) If it is executed in order, the instructions 4, 5, 6 cannot be released before the instruction 3. The instruction 3 can be issued only when the fifth clock cycle (the result of the instruction 2 is obtained).
(4) There is a big problem in the sixth clock cycle: We want to write the result to the register R1, but this will change the result of the instruction 5. Therefore, we only publish the instruction 6 when R1 is idle (the 10th clock cycle).
As can be seen from the table, although this CPU is over-standard, each clock cycle can decode 2 instructions, but its performance is very poor. The actual decoding rate (IPC, instruction execution per clock cycle) is only 0.53.
2. Rename the register in order
If the registers required for each program are being used, we can put the data in another "secret" register. Since the register R1 is renamed in the sixth clock cycle, the instruction 6 and the command 8 no longer delay the work of the CPU. As a result, we can increase IPC by 50%, and register rename technology can make up for the shortcomings of the X86 processor.
3. Doodle execution and register rename
In order of sequential execution, once we encounter instruction dependencies, the pipeline will be stagnant. If we use the dilating, we can jump to the next non-dependent directive and release it. In this way, the execution unit can always be in the working state, and the time waste is minimized. Distress execution can allow us to issue instructions 4 to 8 before publishing instructions 3, and the execution result of these instructions can be immediately drawn immediately after the instruction 3 is drawn (it is necessary for the X86 CPU), the actual decoding rate will be Increase 25%. However, PII and K6 have a limited benefit from chart, because if the CPU encounters instruction dependencies, it must find more non-dependent instructions for publishing.
Winchip performance makes us see a sequential execution of a large-capacity first-level Cache to perform CPU competition with the charter, in terms of clock cycle, Cache's current price is very high. The CPU performance is performed in order to perform CACHE, which is better than only smaller capacity Cache's performance of CPUs.
Rise's engineers have made mistakes in this regard, and the first-class Cache of MP6 is only 16KB, so the frequency of cache is not hysteresis than other CPUs, so that it is difficult to "feeding" its 3 flow lines. This is a unfortunate thing because a CPU is not too complicated in order, so you can do it. If the Rise CPU has a larger level of Cache and high clock frequencies, the Rise CPU is a fierce opponent for the chart in K6-2, and it has better floating point performance (dual fpu). Turnding lines) and the cost is also cheaper. The MP6-II integrating 256KB secondary Cache may correct this error, but it will reach satisfactory clock frequency. Since K7 uses a large-capacity buffer, it can promptly publish enough non-dependent instructions. Large-capacity first Cache, large-capacity buffering and charter, so that the two FPU pipelines of K7 are more likely to "feed" than the two pipelines of the RISE MP6, which is more efficient.
Fourth, busy execution unit
Why is the execution unit in the modern CPU more than the number of decoders? Take a look at the actual program code, you will notice at any specific time period, a determined instruction type, such as the X87 floating point command, 3DNW! Or integer instructions are dominated. Each execution unit will specifically perform a determined instruction type. Therefore, the CPU designer must make the decoder "feeding" execution unit.
The three decoders of K7 are common, in other words, there are 3 instruction outputs in most time, so if the three instructions are integer instructions, K7 has to prepare 3 integer execution units, If the execution is MMX, 3DNOW! Or X87 instruction, 3 FPU pipelines must match the decoding unit. do you understand? K7 To prepare 3 execution units for each type of instruction to fully meet the requirements of the decoding unit. Of course, K7 can only publish and perform 3 instructions at each clock cycle under perfect conditions. We already know that it is difficult for the CPU to do a lot of work in parallel due to the existence of the instruction dependence and program branch.
Note that K7 is more reasonable, K6-III has a floating point unit and two 3DNOW! / MMX units, but it cannot publish X87 instructions and 3DNOW! Instructions at the same time, because these two instructions use the same register. The multimedia unit of K7 consists of 3DNOW!, X87, and MMX units.
Five, small knot
An old and old X86 architecture is still the main reason why the RISC competition today is that modern X86 CPU has advanced RISC core, register rename, dilot execution and other technologies, thereby breaking through X86 architecture restrictions.
K7 is not a "revolutionary", it seems to be the most harmonious chart in evolution X86 CPU. Intel's MERCED uses a completely different new design, will this CPU crush its opponent?