Pentium IV processor architecture analysis
Intel released Pentium4 1.4GHz and Pentium4 1.5GHz processors on the morning of the US Western Time (Beijing Time 20). While INTEL releases the Pentium4 processor, we have completed their comprehensive testing first. Pentium 4 1.4 / 1.5 GHz release shows that Intel starts to recover the loss of the land, and the crown of the speed from the AMD. Everyone recalls that, it can be found that before AMD release Athlon processor in August 1999, Intel's Pentium II and Celeron processor dominated the high-performance desktop processor market, Intel has achieved great success in it, and it took a big success. Partial processor market share. However, in the past six months, Intel's road is on the road to the processor, and people can't help but doubt whether the Intel era will end. Although Intel has committed various mistakes in half a year, there is still a lot of intellectuals, but there are still many Intel processors, I hope Intel has launched a new processor to save the face, and of course Intel also realizes the situation of himself. So I put all the effort and hope all in the Main Products of the next 5 years - Pentium 4 processor! Let's take a comprehensive analysis of the Pentium 4 processor. P4 processor design architecture 1. Ultra-length computing line Intel introduces a true brand new IA-32 X86 architecture in the P4 processor, and is called NetBurst micro-architecture. Intel's PIII processor is the over-scale design of the 12-layer computing line, and the P4 processor is a top-scale design of the 20-layer computing line, wherein the multi-line parallel calculation is integrated. AMD's ATLON is a 10-layer computing line, and the 15-layer floating point line is designed. The 1.13GHz PIII processor previously recovered, its bug is definitely coming from the conflict between design architectures and manufacturing processes, and since AMD has already turned to the copper process, it is not similar to Intel while the frequency is improved. The problem. Therefore, this time INTEL can only redesign the micro-frame structure of the P4 processor under the premise of manufacturing the process. Intel Today's P4 processor, the kernel integrates 8KB level cache, 12KB tracking cache and 256KB of secondary cache, all of which work with the same frequency of the processor. The speed limit of the 0.18 micron process used in the PIII processor is around 1.1-1.2GHz. The p4 processor's hop speed is 1.4GHz, which is the same manufacturing process, and the PIII processor of different architectures cannot be achieved. Intel issued 1.4GHz today, 1.5GHz P4 processors still used a 0.18 micron manufacturing process, aluminum interconnection process. So how does Intel make the 1.13GHz PIII processor process to create a 1.4, 1.5 GHz P4 processor? The answer is that the architecture design, the design feature of P4 NetBurst (Internet Outbreak) microarchite is the multi-frequency speed of the processor and the number of unit computational line completion instructions per clock cycle, the less command, the less the unit calculation line, the less The higher the frequency of the processor, but the decrease in unit calculation line performance is necessary to affect the overall calculation capability of the processor, so the increase in the P4 processor's frequency is improved in the premise of sacrifice partial performance. Intel claims that the NetBurst (Internet Outbreak) microarching design used by the P4 processor allows the P4 processor's frequency speed than the speed limit of the PIII of about 40%, that is, the processor speed of NetBurst micro-frame design. The upper limit is between 1.55-1.7 GHz. Such a maximum speed is currently sufficient to fight against the thunderbird processor of the current AMD copper process 100MHz DDR. In addition, the opening of the 0.13 micron copper process in the second half of next year is also the efficacy of the P4 processor's effectiveness.
2. Ultra-long computing lines and make up Intel this NetBurst micro-frame design in the P4 processor, can be said to be a slowdown, without retrofitting the existing 0.18 micron aluminum process, lift the speed of the processor. But the disadvantage of this slowdown is also obvious. First, since the P4 processor uses a 20-layer ultra-long computing line, each layer of calculation line is assigned to the workload, and the instructions obtained per clock cycle are also reduced, that is, IPC (Instructions Per Clock) Reduce. This reduction means that the Pentium 4 processor reduces efficiency per clock cycle, which is also one of the main reasons for Pentium 4 processor and Athlon processors in many preliminary tests. Second, since the current processor uses a branch prediction function to improve the instruction processing effectiveness of the kernel computing line, there is a prediction error, and the entire task has to start executing from the calculation pipeline header, clear 12 floors in the Pentium III processor. The calculation line restarts the micro-operation due to the prediction error and the time to perform the micro-operation is shorter than the P4 processor having a 20 layer computing line. For the above two problems, Intel uses a series of means and techniques to make up for the P4 architecture. First, for the P4 processor built-in branch prediction unit performance descend, Intel integrates 4KB size BTB (predictive target cache, branch target buffer) inside the P4 processor to store a branch of branch prediction arithmetic units. The predicted jump operation result, and the ordinary PIII processor, the P4 processor is inferior to the BTB capacity, and the PIII processor's BTB capacity is only 512byte. In addition, Intel also added "advanced branch prediction arithmetic unit" in the P4 processor architecture, and it is desirable to increase the ability of branch prediction to 93-94%, exceeding 33% of the PIII processor branch prediction capability. Second, Intel uses a trace cache to store the micro-operation sent from the P4 decoding unit to resolve the problem of re-acquiring after the predicts. Tracking cache is located between the command decoder and the core first layer computing line, and after acquiring and decoding within the decoding unit, the micro-operation first should go through the storage and output of the kernel to reach the core first layer computing line. And executed, the tracking cache can store up to 1200 micro operations, and the capacity is 12KB. In the previous P6 architecture system, if the PIII processor is running, many programs have repeatedly executed the same operation, which consumes a lot of data, but the significance of performing performance is not large, so it is necessary to closely compress repeated data. And design a special operation to perform this data, so these operations are called SIMD (single instructions, multi-data stream). Pentium IV Processor Architecture Analysis (II) Intel added an SSE2 instruction set in the P4 processor. Compared with the SSE instruction set used before the PIII processor, the entire SEE2 instruction set of P4 currently has a total of 144, including the original 68 sets of SEE instructions and the newly adding 76 sets of SEE2 instructions. The new SEE2 instruction except for the traditional integer MMX register into 128 bits (128bit MMX), and also provides 128-bit SIMD integer operations and 128-bit dual-precision floating point operations. The introduction of the SSE2 instruction set compensates for the P4 processor unit to calculate the insufficient performance of the line.
Intel also adds a quick execution engine (Ree, Rapid Execution Engine), and the Intel claiming REE is twice the running speed of the processor, thus 1.5GHz Prentium 4 processing. The ALU unit running speed is 3.0GHz, and the ALU unit utilizes the working principle similar to DDR memory, and the ALU partial circuit can perform the same frequency operation in the upper and lower edges of a processor time period, 0.5 clock cycles, ALU can complete an arithmetic logic instruction. Since the ALU is responsible for the processor's integer operation, Ree introduces, allowing the P4 processor's integer calculation performance than the PIII processor, and an integer operation operation can be completed within half a clock cycle. The running speed of the processor unit of the gold portion in the figure is doubled. At the front-end bus architecture of Pentium 4, Intel uses QDR (Quad Data Rate) technology, with 400 MHz transmission efficiency by simultaneously transmitting 4 different 64-bit data streams at the same time (similar ATA-100) over the 100MHz system bus. The DDR uses the technique expressed by the principle of up and down waveform transmission materials), so this Intel said that Pentium 4's front-end bus speed runs 400 MHz, in fact, still at 100MHz. Implementation of the 400MHz front-end bus transmission speed, so that the data bandwidth between the P4 processor secondary cache and the system memory interface reaches 3.2Gb / s, and the bandwidth of the previously fused PIII processor is only 1.06Gb / s . Since the second-level cache of the P4 processor is 8WAY, 256 bit full speed cache, and the data bandwidth between the first-level cache reaches 45Gb / s, and the first-level cache is 4WAY, 256 bit full speed cache, and processor core core interface The data bandwidth has also reached a mass of 22Gb / s. It is always possible to see 12KB tracking cache, double computational speed fast execution engine, SSE2 instruction set and 400MHz front-end bus are used to make up for the negative impact of the ultra-long pipeline for processor performance. Http://pophard.yeah.net Pentium IV processor architecture (1) INTEL released Pentium4 1.4GHz and Pentium4 1.5GHz processors on the 20th morning (Beijing time on the evening). While INTEL releases the Pentium4 processor, we have completed their comprehensive testing first. Pentium 4 1.4 / 1.5 GHz release shows that Intel starts to recover the loss of the land, and the crown of the speed from the AMD. Everyone recalls that, it can be found that before AMD release Athlon processor in August 1999, Intel's Pentium II and Celeron processor dominated the high-performance desktop processor market, Intel has achieved great success in it, and it took a big success. Partial processor market share. However, in the past six months, Intel's road is on the road to the processor, and people can't help but doubt whether the Intel era will end. Although Intel has committed various mistakes in half a year, there is still a lot of intellectuals, but there are still many Intel processors, I hope Intel has launched a new processor to save the face, and of course Intel also realizes the situation of himself. So I put all the effort and hope all in the Main Products of the next 5 years - Pentium 4 processor! Let's take a comprehensive analysis of the Pentium 4 processor. P4 processor design architecture 1. Ultra-length computing line Intel introduces a true brand new IA-32 X86 architecture in the P4 processor, and is called NetBurst micro-architecture. Intel's PIII processor is the over-scale design of the 12-layer computing line, and the P4 processor is a top-scale design of the 20-layer computing line, wherein the multi-line parallel calculation is integrated.
AMD's ATLON is a 10-layer computing line, and the 15-layer floating point line is designed. The 1.13GHz PIII processor previously recovered, its bug is definitely coming from the conflict between design architectures and manufacturing processes, and since AMD has already turned to the copper process, it is not similar to Intel while the frequency is improved. The problem. Therefore, this time INTEL can only redesign the micro-frame structure of the P4 processor under the premise of manufacturing the process. Intel Today's P4 processor, the kernel integrates 8KB level cache, 12KB tracking cache and 256KB of secondary cache, all of which work with the same frequency of the processor. The speed limit of the 0.18 micron process used in the PIII processor is around 1.1-1.2GHz. The p4 processor's hop speed is 1.4GHz, which is the same manufacturing process, and the PIII processor of different architectures cannot be achieved. Intel issued 1.4GHz today, 1.5GHz P4 processors still used a 0.18 micron manufacturing process, aluminum interconnection process. So how does Intel make the 1.13GHz PIII processor process to create a 1.4, 1.5 GHz P4 processor? The answer is that the architecture design, the design feature of P4 NetBurst (Internet Outbreak) microarchite is the multi-frequency speed of the processor and the number of unit computational line completion instructions per clock cycle, the less command, the less the unit calculation line, the less The higher the frequency of the processor, but the decrease in unit calculation line performance is necessary to affect the overall calculation capability of the processor, so the increase in the P4 processor's frequency is improved in the premise of sacrifice partial performance. Intel claims that the NetBurst (Internet Outbreak) microarching design used by the P4 processor allows the P4 processor's frequency speed than the speed limit of the PIII of about 40%, that is, the processor speed of NetBurst micro-frame design. The upper limit is between 1.55-1.7 GHz. Such a maximum speed is currently sufficient to fight against the thunderbird processor of the current AMD copper process 100MHz DDR. In addition, the opening of the 0.13 micron copper process in the second half of next year is also the efficacy of the P4 processor's effectiveness. 2. Ultra-long computing lines and make up Intel this NetBurst micro-frame design in the P4 processor, can be said to be a slowdown, without retrofitting the existing 0.18 micron aluminum process, lift the speed of the processor. But the disadvantage of this slowdown is also obvious. First, since the P4 processor uses a 20-layer ultra-long computing line, each layer of calculation line is assigned to the workload, and the instructions obtained per clock cycle are also reduced, that is, IPC (Instructions Per Clock) Reduce. This reduction means that the Pentium 4 processor reduces efficiency per clock cycle, which is also one of the main reasons for Pentium 4 processor and Athlon processors in many preliminary tests. Second, since the current processor uses a branch prediction function to improve the instruction processing effectiveness of the kernel computing line, there is a prediction error, and the entire task has to start executing from the calculation pipeline header, clear 12 floors in the Pentium III processor. The calculation line restarts the micro-operation due to the prediction error and the time to perform the micro-operation is shorter than the P4 processor having a 20 layer computing line. For the above two problems, Intel uses a series of means and techniques to make up for the P4 architecture. First, for the P4 processor built-in branch prediction unit performance descend, Intel integrates 4KB size BTB (predictive target cache, branch target buffer) inside the P4 processor to store a branch of branch prediction arithmetic units. The predicted jump operation result, and the ordinary PIII processor, the P4 processor is inferior to the BTB capacity, and the PIII processor's BTB capacity is only 512byte.
In addition, Intel also added "advanced branch prediction arithmetic unit" in the P4 processor architecture, and it is desirable to increase the ability of branch prediction to 93-94%, exceeding 33% of the PIII processor branch prediction capability. Second, Intel uses a trace cache to store the micro-operation sent from the P4 decoding unit to resolve the problem of re-acquiring after the predicts. Tracking cache is located between the command decoder and the core first layer computing line, and after acquiring and decoding within the decoding unit, the micro-operation first should go through the storage and output of the kernel to reach the core first layer computing line. And executed, the tracking cache can store up to 1200 micro operations, and the capacity is 12KB. In the previous P6 architecture system, if the PIII processor is running, many programs have repeatedly executed the same operation, which consumes a lot of data, but the significance of performing performance is not large, so it is necessary to closely compress repeated data. And design a special operation to perform this data, so these operations are called SIMD (single instructions, multi-data stream). Pentium IV Processor Architecture Analysis (II) Intel added an SSE2 instruction set in the P4 processor. Compared with the SSE instruction set used before the PIII processor, the entire SEE2 instruction set of P4 currently has a total of 144, including the original 68 sets of SEE instructions and the newly adding 76 sets of SEE2 instructions. The new SEE2 instruction except for the traditional integer MMX register into 128 bits (128bit MMX), and also provides 128-bit SIMD integer operations and 128-bit dual-precision floating point operations. The introduction of the SSE2 instruction set compensates for the P4 processor unit to calculate the insufficient performance of the line. Intel also adds a quick execution engine (Ree, Rapid Execution Engine), and the Intel claiming REE is twice the running speed of the processor, thus 1.5GHz Prentium 4 processing. The ALU unit running speed is 3.0GHz, and the ALU unit utilizes the working principle similar to DDR memory, and the ALU partial circuit can perform the same frequency operation in the upper and lower edges of a processor time period, 0.5 clock cycles, ALU can complete an arithmetic logic instruction. Since the ALU is responsible for the processor's integer operation, Ree introduces, allowing the P4 processor's integer calculation performance than the PIII processor, and an integer operation operation can be completed within half a clock cycle. The running speed of the processor unit of the gold portion in the figure is doubled. At the front-end bus architecture of Pentium 4, Intel uses QDR (Quad Data Rate) technology, with 400 MHz transmission efficiency by simultaneously transmitting 4 different 64-bit data streams at the same time (similar ATA-100) over the 100MHz system bus. The DDR uses the technique expressed by the principle of up and down waveform transmission materials), so this Intel said that Pentium 4's front-end bus speed runs 400 MHz, in fact, still at 100MHz. Implementation of the 400MHz front-end bus transmission speed, so that the data bandwidth between the P4 processor secondary cache and the system memory interface reaches 3.2Gb / s, and the bandwidth of the previously fused PIII processor is only 1.06Gb / s . Since the second-level cache of the P4 processor is 8WAY, 256 bit full speed cache, and the data bandwidth between the first-level cache reaches 45Gb / s, and the first-level cache is 4WAY, 256 bit full speed cache, and processor core core interface The data bandwidth has also reached a mass of 22Gb / s. It is always possible to see 12KB tracking cache, double computational speed fast execution engine, SSE2 instruction set and 400MHz front-end bus are used to make up for the negative impact of the ultra-long pipeline for processor performance.