Virtual register with parallel instruction processing parts
1. The progress and existence of the processor architecture has been increased in nearly 10 years. The microprocessor speed increases by 50% to 100% per year. The latest processor speed reaches 200 million to 300 million / s (such as MIPS R8000, Power 620). P6), up to 1 billion / s (Dec alpha21164). Obviously, this progress is the most active factor that drives the development of the entire computer. The high-speed microprocessor promotes the combination of computer and communication technology; accelerates the practical application of multimedia technology; affects the technical pathway for supercarcies; and makes computer industry structures deeply changed. The performance of the processor is rapidly improved, and it is due to the development of the inventions of the RISC architecture and the development of the command level parallel ILP. Figure 1 can indicate the road of processors for more than 10 years. After the complex command system computer CISC analysis, the researchers concluded 20% -80% of the law, that is, 20% of the CISC instruction system, which occupies an average of 80% of the program. On this basis, John Cocke puts forward the design ideas of streamlined instruction system computer RISC. However, practice shows that the number of renewal instructions cannot meet the needs of commercial products, while the key to RISC thinking is not to reduce the number of instructions, but to reduce the average cycle number of CPIs required for each instruction. However, if only a traditional single-shot structure of only one instruction is only transmitted each cycle, then the CPI can only be close to 1 without less than 1; only the multi-transmit structure that can transmit multiple instructions in each cycle is designed. The CPI may be less than 1 or the average instruction number IPC executed per cycle is greater than 1. The RISC single emission and multi-transmitting structure must be executed on the basis of the pipeline structure. In the pipeline structure, there is data correlation and transfer correlation between the instructions of the adjacent cycle; and the resource collision may occur between the instructions of the adjacent periods when using resources. Therefore, instruction scheduling and compiling optimization must be used to relieve correlations and resource conflicts. The command scheduling technology includes software and hardware measures. In the multi-transmitting structure, there is a correlation and resource collision problem, but there is a correlation and resource collision problem, but there is a correlation and resource collision problem, but there is a correlation and resource collision problem; therefore, the importance of the command scheduling is more significant. In addition, in order to solve the problems of multiple instructions in multi-transmitting structures, there is a plurality of execution components in modern new processes. The parallelism (referred to as the instruction level parallelism ILP) can be improved to rapidly increase the performance of the processor and maintain the performance of the processor and keep the software's compatibility. This is already a clear development direction of the modern processor architecture. In order to make the plurality of execution components to perform the pipeline more efficient, Intel's Pent IUM sets a plurality of physical pipelines in addition to absorbing a large number of RISC design ideas, two of which are pipelines that perform integer calculations, one is to perform floating point operations. assembly line. IBM POWER 601 and 603 sets more physical pipelines. Pent IUM and Power are currently the most promising processor machine in the world. Japan pays very much attention to multi-threaded tone. Its design idea is to divide the instruction stream into multiple threads. In each cycle, an instruction is transmitted from each thread; because the commands between the thread are no correlation, the plurality of instructions emitted in one cycle It is theoretically no correlation. Thus, after the multiple instructions are transmitted, the ILP can be improved in the multi-execution component or multi-stream structure.
2. Parallel execution of instruction processing components and data processing components in six phases of the development of the modern process shown in Figure 1, especially in the recent stage of researchers to reduce CPI (or increase IPC), no It is difficult to see that the researchers focus on how to improve the instructions of execution of data processing. In other words, it means that parallelism is performed in the multi-execution component or the multi-stream structure after the instruction emission. The new concept of instruction processing components proposed in this article is based on the following ideas: Why is it traditionally noted that the implementation parallelism of improved data processing after the improvement instruction launch, and not pay attention to unmuntered "lying" in the memory A large bunch of instructions are not "touched"? The "lying" is taken out in advance in advance of the unmunted instructions in the memory should be processed while the data processed after the processed instruction is processed. In other words, the highly parallel new processor in the future should have two major parallel components: that is, instruction processing components and data processing components, these two parts are parallel, see Figure 2. In the data processing component, an instruction of a plurality of unrelated correlations by the instruction transmitting unit is transmitted in each cycle, providing a plurality of execution components to perform parallelism to achieve a high degree of instructions. This is a traditional multi-transmit RISC structure. At the same time, the prefetch instruction unit has taken a batch of instructions from the instruction memory, sent to the instruction processing unit to pre-processed, perform instruction scheduling, and release it. Then, the handled instruction flows to the instruction transmit unit. This allows each cycle to emit out the number of instructions executed in parallel, inevitably increase, thereby improving the command-level parallel ILP. In the data processing unit of Figure 2, we use the register stack and virtual register. The working principle and its advantages of these virtual registers, and the two parts of the instruction processing and data processing will be discussed below. 3 Features and principles of virtual registers and the architectural and compilation optimization of modern RISC processors focus on the command level parallelism that is committed to the development program execution. However, the program itself contains a certain parallelism, but in the current processor and compiler, there is almost no exception to the code generation phase has assigned a physical register, which has a certain parallel program. First make a serialization of serialization, then develop instruction level parallelism from the "serialization" instruction stream, this traditional practice is bound by people's thoughts, thus brings great limitations in the instruction level parallel development. Sex. In addition, after the physical register is assigned, the size of the basic block is also limited. According to analysis, the basic block only contains 5 to 8 instructions. It is difficult to perform effective instruction scheduling in such a basic block of such size. Of course, there will be many ways to gradually expand basic blocks, or cross-based blocks, which are remedies under static allocation of physical registers. In response to the limitations of the above traditional practices, we propose the concept of virtual register VR. The physical implementation of the virtual register is similar to the virtual memory VM. Virtual memory has its management agency that enlays the storage address space available to the user by switching with the exemption. The management organization of the virtual register VR is swapped with CAC He, expanding the register stack space available. More important, from logically, VR is equivalent to the intermediate results in the program. The compiler of the virtual register concept is assigned to the code generation. It is only a dynamic allocation of the actual physical register when the program is executed. Thus, when the code is generated, the parallelism of the program is not subject to a lot of losses due to the distribution of a large number of logical registers; when the program is executed, the dynamically assigned physical register has great flexibility, so that it can be significant Improve instruction level parallelism, see Figure 3. Dynamic allocation of virtual registers has a significant advantage that it simplifies the design of the compiler. Since the traditional physical register heap size in the past is limited (generally 32, the Intel structure is less), it takes great effort to the optimization of the register allocation when generating the code. Moreover, due to the small number of physical registers, adjacent basic blocks are likely to use the same number of registers to represent the intermediate results of the calculated, which increases the additional correlation between the basic blocks. If you use multi-window register stack, each window corresponds to a process, the entire physical register is very large, and some local registers are not fully utilized. The assignment of the virtual register in this article is dynamic. The number of registers occupied by each function can be dynamically adjusted, which is actually equivalent to the register window size is adaptively changed, so that the use efficiency is higher, this is also Another advantage of the virtual register.
4. Virtual registers in the chase execution and register replacement are shown in Figure 4. The contemporary new processor (POWER620 and Intel P6) uses an out-of-or-der in order to improve the instruction level parallelism. carried out. When the predetermined command is prefetted in the command Cache, the operation of multiple instructions (or decomposed micro-operation) is placed in a reorder buffer after the decoder. Sorting buffer disorderly distributes instructions to multi-execute components or multi-stream components in accordance with their dependencies and conventional physical register stacks, usually, in order to prepare the source operand, first-ready instruction or operation Execute, the results are temporarily stored in a conventional physical register. Retire component is a sequential component that performs intermediate results of the chart, according to its original characteristics, and the "lifecycle" life cycle that sends an instruction operation variable in a register in a register "(L IFE Time) value, restore the order of the results that the original program should be sequentially executed, and store it in (store) data memory. After the virtual register is used in FIG. 4, it is not only a temporary storage of the intermediate results; and like the logical address in the virtual memory, the "Image" mechanism is converted into a physical address, but also in the image table can be placed in the image table. The characteristics of the register, the vital counter value, and other markers such as protection and sorting, so that the retirement component can restore the result of the original order. The logical address image of the virtual register has a physical register, and can also be used in register renaming technology, Figure 5. Register renovation technology is used to eliminate false correlations of registers, for example, RA may have false correlations, such as resembling RA to RM and RN can eliminate: RA: = 8 [RP] RB: = RA 156 RA: = 45 RK In addition, the register renovation technology has recently used for more important occasions, that is, pursuing software compatibility. For example, CYRIX has recently claimed that its M1 product can be compatible with Intel Pentium, which mainly uses register renovation techniques, ie, the RISC structure of the register (32 general registers) is used in M1, and the general register is used. The image mechanism is renamed the 8 registers in the Intel architecture to achieve the use of RISC structural advantages, but is compatible with CISC's Intel Processors. The structure in Figure 5 can be used in reordering of retirement components after interpretation, and can be used for register renovation techniques. The virtual register renum list in the figure is actually a fully associated content Addressing Rapid Memory (CAM), which is found in the full phase connection according to the characteristics recorded in the virtual register logic address and in the reordering buffer. One of the features of the virtual register, indicating a virtual register of the lifecycle, and other logos such as protection and sorting functions, mapping the hit physical registers. 5. Instruction processing component instruction processing components and its cooperation with the data processing components are shown in Figure 6. In the queue of the instructions taken in the command cache, in a queue of the prefetch instruction, multiple decoding components must be coded simultaneously with multiple instructions to improve the parallelity of instruction processing. It is very important to the processing of transfer instructions, so it is generally provided with a transfer processing component, and the function of the transfer processing component can be processed to the transfer condition code, such as RS6000; or transfer prediction components, such as Pentium. The information for decoding information and transfer processing is sent to the correlation detector and reorder buffer. The feature of the instruction can be detected and recorded over there. Reissue buffers will execute instructions for schedule and operational downtime, which exchange information with the virtual register management unit, and send the unrelated correlation and its source operand, ready-to-inventory first to the instruction transmission unit, so that each cycle can be Make sure that multiple instructions are transmitted simultaneously, which are performed in parallel with multiple execution parts or multi-stream components. The functionality of the virtual register and its image tables has been described in the previous section. Reissue buffers, virtual register mapping tables, and multi-execute components are connected to retire parts, retirement components integrated this information, and then return the results of the chart in order to the register stack or data Cache in the original order. There is also a memory instruction polysem parsing unit in the instruction processing component, which is a DISAMBIG UIT ION that is dedicated to the storage access command in the instruction stream. In this regard, we will have a special paper discussion.
6 Conclusion The processor structure having instruction processing components will significantly improve the parallelism of the processor. The introduction of new concepts of virtual registers will greatly improve the parallelism of execution instructions within the data processing part. While the data processing component works, the instruction processing section has previously taken the instruction stream from the instruction cache (and the traditional processor structure is lying in the instruction storage, which is actually a "information waste". ). After the instruction processing, a large amount of transfer correlation and data correlation can be released, and the memory commands the polysest parser eliminates the polymity of a large number of storage access type instructions. In summary, the instructions from the instruction processing component to the instruction transmit unit can be greatly improved, so the instruction-level parallel ILP has been fully developed, and the programmer is completely transparent, thereby maintaining the compatibility of the software Sex, I believe this new type of processor architecture will have a good future future. Virtual registers and instruction processing components are our new concepts, and we will publish papers and simulation models in this area and experimental data.