1 Introduction
This manual describes how to write highly optimized assembly code, focusing on the Pentium® series of microprocessors.
All the information here is based on my research. Many people provide useful information and wrong corrections for this manual, and I have updated it after getting any new important information. So this manual is more than other similar information sources. Accurate, detailed, precise and easy to understand, and it also contains many details descriptions that can be found elsewhere. This information allows you to accurately count the number of clock cycles that floors in a small code. However, I can't guarantee All information in the manual is exactly: some time tests are difficult or impossible to measure, I can't see the internal technical documentation of the Intel manual author.
This manual discusses the following versions of the Pentium processor:
Abbreviated Name PPLAINPLAIN Old Pentium (without MMX) PMMX has MMX Pentium II (including Celeron and Xeon) PIIIPENTIUM III (including some comparable CPUs)
This manual uses Masm 5.10's compilation syntax. There is no official X86 assembly language, but this is the closest standard that you can accept, because almost all of the assembler has Masm 5.10 compatible mode. (But I don't recommend it. " The 5.10 version of MASM. Because it has a serious bug in 32-bit mode. It is best to use a subsequent version of TASM or MASM).
Some comments in the manual seem to be criticized to Intel. But this is not a good thing. The Pentium series microprocessor can make a better value in the evaluation class, there is a better document, And more test feature. For these reasons, there will be no comparison testing of me or others to do similar goods.
Compilation language programming is more complicated than using advanced languages. It is very difficult to manufacture bug, but it is very difficult to find BUG. Now I have reminded you! I assume that the reader already has the experience of assembly programming. No, please do it, please do it Optimize read some compilation books and write some code to obtain some compilation experience.
The hardware design of PPLAIN and PMMX chips has many features specifically for some common instructions or instead of using those general optimization modules. Therefore, the rules of this design optimization software are very complicated, and there are many exceptions, but this way Doing a substantial benefits. PPRO, PII and PIII processors have very different designs, they use the charter to do many optimization work, but these design of the processor brings many potential bottlenecks, so Manual optimization for these processors will get a lot of benefits. The Pentium 4 processor has also used another design, and the Pentium 4 optimization guidance route is very different. This manual does not pentium 4 - readers please Check out the manual of Intel.
Before you turn your code into compilation, confirm that your algorithm is sufficient. Usually you can increase the efficiency of code efficiency than the transfer to compile by optimization algorithms.
Second, you must find the most critical part of your program. Usually 99% of the CPU time spent in the most inside of the program. In this case, you only need to optimize this loop and use all other things with advanced Language Writing. Some assembly programmers spend a lot of effort on the wrong part of their programs, and their unique result is that the program is more difficult to debug and maintain.
If your program is not so obvious, you can use Profil. If you find bottlenecks in disk operation, then you can try the modification program to make the disk operation set, improve the hit rate of disk buffering, not use Compilation to write the code. If the bottleneck is output, you can try to find a way to reduce the number of calling images. Some advanced language compilers provide relatively good optimization for the specified processor, but handmade optimization Will be better.
Please don't send me your programming problem. I will not help you do homework!
I wish you good luck in the reading!
2.
On Intel's WWW stations, there are many useful documents and tutorials on printing text or CD-ROM. It is recommended that you study these documents have some understanding of the structure of the microprocessor. However, Intel's documentation is not always right - Especially those tutorials have a lot of errors (obviously, those of Intel have not tested their examples).
Here I don't give the URL because the location of the file changes. You can use the search tool on the link on the link on the link on www.agner.org/assem to find the document you want. Some documents are .pdf format If you don't display or print the PDF tool, you can go to http://www.adobe.com/ download the Acrobat file reader.
Optimizing dedicated programs using MMX and XMM (SIMD) instructions are described in several manuals. Various manuals and tutorials have descriptions of their instructions.
VTUNE is Intel to optimize the code software tool I have not tested it, so it is not evaluated.
There is also a lot more useful than Intel. There are also these resources in FAQ in the newsgroup comp.land.asm.x86. The resources on the other Internet are also on www.agner.org/assem. link.
3. Call assembly functions in advanced languages
You can use online compilation or write the entire subroutine with compilation and then connect to your project. If you choose the latter, it is recommended that you use the high-level language to compile the compiler. So you can get the correct function. Call the prototype. All C compilers can do this.
The method of transmitting parameters depends on the form of call:
Calling mode parameters in the stack of order parameters from who are removed from _CDECL first parameter in low address caller _stdcall first parameter in low address subroutine _fastcall compiler specifies subroutine _Pascal first parameter in high address subroutine
Function calling prototypes and syndrome named by the compiler may be very complicated. There are many different call conversion rules, different compilers are also incompatible. If you call the assembly language from C , the best way is Compare your function with externaln "c" and _cdel definition. The function name of the assembly code must take a next line (_) and add case-sensitive option (option -mx) when compiling . E.g:
; extern "c" INT _CDECL Square (int X);
_SQUARE PROC NEAR; integer square function
Public _square
MOV EAX, [ESP 4]
Imul EAX
RET
_SQUARE ENDP
If you need overload functions, overload operators, methods, and other C proprietary things, you must write a good code in C to compile the compiler into assembly code to get the correct connection information and call models. These As the compiler is different, the document is rare. If you want the compile function to use other call prototypes instead of extern "c" and _cdecl, you can call different compiler, then you need to be The compiler writes a name. For example, a Square function is overloaded:
INT Square (int X);
Square_i proc Near; integer Square function
@ Square $ qi label near; the connection name of the Borland compiler
? Square @@ yahh @ z label near; the connection name of the Microsoft compiler
_SQUARE__FI Label Near; GNU Compiler connection name
Public @ Square $ qi,? Square @@ yahh @ z, _square__fi
MOV EAX, [ESP 4]
Imul EAX
RET
Square_i endp
Double Square (double x);
Square_d proc Near; Double Precision Floating Point Square Function
@ Square $ qd Label Near; the connection name of the Borland compiler
? Square @@ yann @ z Label Near; Microsoft Compiler connection name _square__fd label near; GNU compiler connection name
Public @ Square $ qd,? Square @@ yann @ z, _square__fd
FLD Qword PTR [ESP 4]
Fmul St (0), ST (0)
RET
Square_d endp
This method is capable of working because all of these compilers defaults to _cdecl calls. However, for methods (member functions), for various compilers, even call methods are different (Borland and GNU compiler) Use the _cdecl method, 'this' pointer is the first parameter, and Microsoft uses _stdcall mode, and 'this' pointer is placed in ECX).
Generally, when you use the following things, don't count on different compilers can be compatible in the target file level: long double, member pointer, virtual machine, new, delete, exception, system function call, or standard library function.
16-bit mode DOS and Windows, C or C register: AX is a 16-bit return value, DX: AX is a 32-bit return value, st (0) is a floating point return value. Register AX, BX, CX, DX, ES Mathematical tags can be changed; other registers must be saved and restored. One process does not change the premise of Si, Di, BP, DS, and SS, will not affect another process ..
32-bit mode Windows, C and other programming registers Using: Integer Return value is placed in Eax, floating point return value is placed in ST (0). Register EAX, ECX, EDX (no EBX) can be modified; other The register must be retained and restored. Segment registers cannot be changed, and even temporary changes. CS, DS, ES, and SS are pointed to the current segment group. FS is used by operating system, but GS is not used, but is reserved. Mark bit The process changes under the limit: the direction flag is 0. The direction flag can be temporarily modified, but must be cleared before any call or return. The interrupt flag must be cleared. The floating point register stack is empty at the process entrance When returning, it should also be empty. Except for ST (0) if it is used to return values. The MMX register can be changed but must be used before returning or during the time of calling may use floating point operations.
EMMS clear. All XMM registers can be modified. Transfer parameters in the XMM register and the description of the return value in Intel's application document AP 589. A process can not change
EBX, ESI, EDI, EBP, and all segment registers are called by another process.
4. Debug and check
As you have discovered, debugging code is very difficult and easy to be frustrated. I suggest you write the small code you need to optimize in a subroutine with a high-level language. Then write a small test program to fully test your child. Program. Verify that the test program can test all branches.
When the subroutine of the senior language can work, you will translate it into assembly code.
Now you can start to optimize. Every time you have done a change, you should run the test program to see if you can work correctly. Take all your versions to the number and save it, so that the test program can't check the error ( For example, when writing to the wrong address), you can turn back to retest.
Use all methods mentioned in Chapter 30 or test the most critical part of the most program in your program with test programs. If code is slower than your expectations, the most likely reason is: Cache Failure (Chapter 7) Unconfigured operation (Chapter 6), the first running consumption (Chapter 8), branch prediction failed (Chapter 22) to take instruction issues, (Chapter 15), register read delay (Chapter 16), or Excessive dependent loop (Chapter 20).
Highly optimized code will become very difficult to read to others, even have difficulties in the future. In order to make the maintenance code become possible, the code organization is a small good predefined macro and clear annotation logic segment (process) Or macro) is very important. The more complicated the code, the more important written documents.
5. Memory mode
Pentium is mainly 32-bit code design, and the performance of 16-bit code is very poor. Draw your code and data segmentation will significantly reduce performance, so you should use 32-bit flat mode and use the operating system that supports this mode. If not specifically indicate, all examples in this manual use 32-bit flat memory mode. 6. Alignment
All data in memory must follow the table below to complete the address to 2, 4, 8 or 16:
Alignment Operation Data Length PPLAIN and PMMX PPRO, PII and PIII 1 (Byte) 11 2 (Word) 22 4 (DWORD) 44 6 (FWORD) 48 8 (qword) 88 10 (TBYTE) 816 16 (OWORD) N.A.16
On PPLAIN and PMMX, there will be at least 3 clock cycles at least 3 clock cycles when the 4-byte boundary line is interlaced. When the Cache boundary line is interlaced, the loss is greater.
On PPRO, PII, and PIII, when the Cache boundary line is interlaced, the unsigned data will consume 6-12 clock cycles. The unveiled operation of less than 16 bytes will not be interlaced at the 32-byte boundary. Loss.
There may be problems with 8 or 16 on the DWORD stack. The common situation is to set the aligned frame pointer. The function of the local data can be like this:
_FuncWithalign Proc Near
Push EBP; Continued code
MOV EBP, ESP
AND EBP, -8; align frame pointer with 8
FLD DWORD PTR [ESP 8]; function parameters
Sub ESP, LocalSpace 4; Assign Local Space
FSTP QWORD PTR [EBP-LOCALSPACE]; Save something in the aligned space
...
Add ESP, LOCALSPACE 4; End Code. Restore ESP
POP EBP; (PPLAIN / PMMX has AGI delay)
RET
_Funcwithalign endp
Although alignment data is always important, there is no need for PPLAIN and PMMX. PPRO, PII, and PIII's principle of Qi Diplings in Chapter 15.
7. Cache
PPLAIN and PPRO have code-free 8 KB-piece Cache (first-class cache), 8 KB data cache. PMMX, PII and PIII have 16 KB code cache and 16 kb data cache. The data in the first class can be in one clock Reading and writing within the cycle, Cache will lose a lot of clock cycles when it is not hystead. It is very important to understand how Cache works, so it can be more effective.
The data Cache consists of 256 or 512 lines per line. Each time you read the data, the processor will read a whole Cache line from memory. Cache line always aligns 32 bytes of physical addresses. You can read a byte from one by 32, and the 31-byte read and write will not have extra consumption. You can align the data in 32-byte blocks to get the benefits. For example, if you have One loop To operate two arrays, you can combine two numbers into an array of structures, so that the data being used is stored together.
If the array or other data structure is a 32-byte multiple, you'd better align it at 32 bytes.
Cache is Lenovo setting. That is to say that the Cache line cannot set it to the specified memory address. Each cache has a 7-bit setting value to match the 5 to 11 bits of the physical address (0-4 bit is the Cache line 32 Bytes. PPLAIN and PPRO can have two Cache rows corresponding to one value in 128 sets, so two Cache rows can be set to any RAM address. PMMX, PII, and PIII can have four.
The result is that Cache does not retain the same different data blocks of 5-11 digits of more than 2 or 4 addresses. You can use the following methods to detect the same setting value in the following: Take the lower 5 digits of the address Get a value that can be removed by 32. If the difference between the two truncated addresses is a multiple of 4096 (= 1000h), the two addresses have the same setting value. Let me explain the following small code, here ESI Place an address that can be completely removed by 32:
Again: MOV EAX, [ESI]
MOV EBX, [ESI 13 * 4096 4]
MOV ECX, [ESI 20 * 4096 28]
Dec edx
Jnz Again
These three addresses have the same set value, because different cut offers are multiple multiple of 4096. This loop is quite slow in PPLAIN and PPRO. When you read ECX, there is no appropriate set value Cache, so the processor will use one of the two groups of Cache rows recently used, that is, the one used by EAX, then reads the data from [ESI 20 * 4096 31] to [ESI 20 * 4096 31] Fill and read ECX. Next, when reading EAX, you will find that the Cache line saved for EAX has been discarded, so the least recently used Cache line is used, that is, save the EBX value, etc., etc. This will result in a large number of Cache lost, this cycle is over-selling 60 clock cycles. If the third line is changed:
MOV ECX, [ESI 20 * 4096 32]
This way we will interleave on the 32-byte boundary, so we have the same set value as the top two lines, which specifies the cache line for these three addresses. This cycle consumes only 3 clock cycles ( In addition to the first run) - a considerable improvement! As mentioned, PMMX, PII and PIII have 4 to Cache, so you can have 4 Cache, which set the same value. (Some Intel Documents are wrong) It is said that PII's cache is two-way).
It is very difficult to detect if your data address has the same set value, especially if they are dispersed in different segments. You can avoid this problem is to put all the data used by key parts in one. No more than the Cache size continuous data block, or in two continuous data blocks that do not exceed half of the Cache (eg, a static data block, a stack data block) so your Cache will be used efficiently.
If your code is to operate large data structure or random data address, you may want to save all common variables (counter, pointers, control variables, etc.) in a separate maximum of 4K continuous blocks, so You have a complete cache rowset to access random data. Since you usually need stack space to save the parameters and return the address for the child program, the best way is to copy all common static data to the stack, if they are changed, Copy the key loop.
Reading a data that is not in the first-level cache will cause the entire Cache line from the secondary cache, which is about 200ns (20 clock cycles on the 100MHz system, or 40 cycles on 200MHz), but you first The required data will be ready after 50-100 ns. If the data is not in the second level, you will touch the 200-300 NS latency. If the data is interlaced in the DRAM page, the delay time will be longer. (4 / 8 MB 72-wire memory DRAM page size is 1KB, 16/32 MB is 2KB).
When reading a large block from memory, the data is limited to populate the Cache line. Sometimes you can read the data in a discontinuous order to improve the speed: Before you read a cache, you start to read a Cache's first Data. This method can improve the speed of 20-20% of the second-stage Cache, and PPRO, PII, PII secondary Cache, and PPRO, PII, PII. This method is that the program code is very Clumsy and difficult to understand. For more information on these, please refer to http://www.intelligentfirm.com/.
When you write to an address that is not in the first-level cache, this value is written directly to the secondary cache or RAM (depending on how Level 2 Cache is set). This is about 100 ns. If you A 32-byte memory block is written repeated 8 times or more or more, and this block is not at first-level cache, which will cause a dumb eye to this block to load into a cache line. All Subsequently, the write operation to this block will be oriented to cache, only consume only one clock cycle each time. On PPLAIN and PMMX, sometimes it is not read because it is repeatedly written to an address. Punishment. On PPRO, PII, and PIII, a write mistake will be read into a Cache line, but it may set a memory area to do different operations, such as memory (see Pentium Pro Series Developer Manual, Vol. 3: Operating System Writer) guide").
The method of improving memory readback is discussed below in Chapter 27.8.
PPLAIN and PPRO have two write caches, PMMX, PII, and PIII have four. On PMMX, PII, and PIII you can do four unfinished write operations without cache memory and do not generate delays in later operations. Each Write caches can control 64-bit wide.
Temporary data can be conveniently placed in the stack because the stack area is very likely to be in Cache. However, if your data element is greater than the stack word, you should pay attention to the problem.
If the life of the two data structures does not overlap, sharing the same RAM area can improve Cache efficiency. Generally, allocate spaces in the stack for temporary variables.
Since the temporary data is saved in the register. Since the register is a rare resource, you may want to use [ESP] instead of [EBP] to locate the data in the stack, so you can release EBP for other purposes. Don't forget that the ESP will be changed when you do Push or POP. (You can't use ESP under 16-bit Windows, because the clock interrupt will set the ESP's high-character setting to the location where you cannot predict in your code. )
There is a separate cache to use the code, it is similar to the data cache. The size of the code cache is 8KB on PPLAIN and PPRO, 16KB on PMMX, PII, and PIII. Keys your code (most inside The loop) is placed in the code cache. It is important. The most commonly used code or routines are preferably stored in the adjacent position. Uncommon branches or procedures can be placed below or otherwise positions.
8. The first VS repeatedly running a code is often more time to run more time during the first runtime. The reason is seen:
Reading code from RAM to cache to get more than running it. All data of the code operation must be loaded into cache, which is more time to do more than the execution of those operations. When the code is repeated, the data is almost all in cache. In the first time, the jump instruction is not in the branch cache, so it is generally not correct. See Chapter 22. Pplain, the code decoding is a bottleneck. If it takes a clock cycle to detect The command length, then it is impossible to decode two instructions in a clock cycle because the processor does not know that the second instruction starts from there. Pplain remembers the length of the instruction saved in the cache from the last runtime. This problem. The result of doing this is that the first command on PPLAIN will not be paired if only one byte length is not only one byte. PMMX, PPRO, PII and PIII have no such problem in the first decoding.
For these four reasons, a piece of code in the loop is usually more time than the subsequent operation.
If you use a big loop and you cannot put into code cache, it will lead to a decline in efficiency because they can't run in Cache. So you should reorganize the loop to make Cache to put them down.
If you have a lot of jumps, calls, branches in the loop, you will repeatedly generate branch cache failures.
Similarly, if a loop repeats a large data structure, a data cache is not hidden in all times.
9. Address Generate Interlocking (AGI) (PPLAIN AND PMMX) instructions The addresses needed to operate the address requires a clock cycle to calculate. It is usually calculated on the pipeline when the previous instruction or instructions are running. But If the calculation of the address relies on the operation of the previous clock cycle, you need an additional clock cycle to wait for the calculation of the address. This is called AGI delay. For example: Add EBX, 4 / MOV EAX, [EBX]; AGI delay example The delay can be
Add EBX, 4 and
MOV EAX, [EBX] add some other instructions or rewritten
MOV EAX, [EBX 4] / Add EBX, 4 is removed.
When you type ESP addressing, such as PUSH, POP, Call, And Ret, and ESP is modified by MOV, add, or SUB, etc. in the previous cycle, which can also cause AGI delay. Pplain and PMMX have specialized circuits To predict the ESP value after the stack operation, you will not encounter AGI delays after changing the ESP with Push, POP, or CALL. After RET, only the AGI delay is generated immediately when the ESP is active.
E.g:
Add ESP, 4 / POP ESI; AGI Delay
POP Eax / Pop ESI; no stall, pair
MOV ESP, EBP / RET; AGI delay
Call L1 / L1: MOV Eax, [ESP 8]; no STALL
RET / POP EAX; no STALL
RET 8 / POP EAX; AGI delay
When the LEA instruction uses a base register or an index register, they have changed in the previous clock cycle, and the AGI delay will also occur. For example:
INC ESI / LEA EAX, [EBX 4 * ESI]; AGI delay
PPRO, PII, and PIII do not have AGI delays on read memory and Lea, but there is an AGI latency when writing memory. If later code is not necessary to wait for the end of the operation, there is no significant effect.
10. Integer instruction pairing (PPLAIN and PMMX)
10.1 The perfect pair of PPLSIN and PMMX have two pipelines to run instructions, namely U-pipes and V-pipes. Two instructions under certain conditions can be run at the same time at the U-Pipes. This can make The speed is doubled. Therefore, you will reorganize your instructions and make them paired.
The following instructions can be paired in any pipe:
MOV registers, memory, or immediate to register or memory Push Register or Immediate, POP Register LEA, NOP, And, OR, XOR, and some form of TEST (see chapter 26.14) The following instructions can only be paired in U-Pipelines:
ADC, SBB SHR, SAR, SHL, SAL Move Immediate RROR, ROL, RCR, RCL Mobile Immediate 1 Bit 1 The following instructions can be operated any pipe, but only in V-pipe pair:
Near Call Short and Near Jump Short and NEAR conditions jump.
All integer instructions can be operated at U-Pipes, but it is not necessarily pairing.
Two continuous instructions can be paired when the following requirements are met:
1. The first instructions are in the U pipeline, and the second instructions are in the V pipe, and they are palapped.
2. When the first instruction writes a register, the second instruction does not read it. For example:
MOV EAX, EBX / MOV ECX, EAX; writing back and reading, can't match
MOV Eax, 1 / Mov Eax, 2; Write back and write, can not be paired
Mov EBX, EAX / MOV EAX, 2; after reading, you can pair
MOV EBX, EAX / MOV ECX, EAX; reading is followed by reading, you can pair
MOV EBX, EAX / INC EAX; After reading, follow-up, can be paired 3. In the second rule, a part of the register is treated as the entire register, for example:
MOV Al, BL / MOV AH, 0
Write different parts of the relative register and cannot be paired
4. When the two instructions are simultaneously written in different parts of the tag register, rules 2 and 3 can be ignored. For example:
SHR EAX, 4 / INC EBX;
5. Instructions for a write tag register can jump pairing and ignore rules 2. For example:
CMP EAX, 2 / JA Labelbigger;
6. The following instruction pair, although the stack pointer is modified, but they can still pair:
Push Push, Push Call, POP POP
7. There are some restrictions on the prefix pairing instructions. Several forms of prefix are listed below:
The instructions of the non-default segment addressing have a segment prefix. When the 16-bit data is used in the 32-bit code, the operating number of 32-bit data is used in the 16-bit code, the operand has a size prefix. In the 16-bit mode, use 32 The bit site or 32-bit index register has an address size prefix. Repeatable string operation instructions There is a repeated prefix. The lock instruction has a Lock prefix. Many instructions in the 8086 processor have two bytes Operation code, where the first byte is 0FH. This 0FH byte is the same as the prefix on the PPLAIN, but the subsequent version will not. The most common instruction with 0FH is: Movzx, Movsx, PUSH FS POP FS, PUSH GS, POP GS, LFS, LGS, LSS, SETCC, BT, BTC, BTR, BTS, BSF, BSR, SHLD, SHRD, and IMUL with two operands and no immediate number.
On PPLAIN, the prefix instruction can only be executed in the U pipe except for the close-up condition jump.
On PMMX, instructions with operating size, address size, or 0FH prefix can operate at any conduit, but instructions with segment prefixes, repetitive prefixes, or lock prefixes can only be run on the U pipeline.
8. With an indirect number and an immediate instructions that cannot be paired on PPLAIN, only on the PMMX pairing:
MOV DWORD PTR DS: [1000], 0; Cannot be paired, or can only match U pipeline
CMP BYTE PTR [EBX 8], 1; Cannot be paired, or can only match the U pipeline
CMP BYTE PTR [EBX], 1;
CMP BYTE PTR [EBX 8], Al; can be paired
(Another problem with an indirect number and an immediate number on PMMX is that this instruction may be longer than 7 bytes, which means that only one clock cycle is only one instruction can be decoded, these are placed in 12 Chapter explanation.)
9. Two instructions must have been prepared in and decoded. These are explained in Chapter 8.
10. PMMX, there is a special pairing rule for the MMX instruction:
MMX shift, Pack and Unpack instructions can be performed at any of the pipeline, but cannot be paired with another MMX shift, Pack, and UNPACK instructions. MMX multiplication instructions can be operated at any pipe, but cannot be paired with another MMX multiplication instruction. Multiplication The instructions need to consume 3 clock cycles, while the last two clock cycles can be executed simultaneously with other instructions, just like floating point instructions (see Chapter 24). A MMX instruction access to memory or integer register can only run in U pipeline And you can't pair with non-MMX instructions.
10.2 has a defect pair
In several cases, two pairs of instructions cannot be run in parallel, or only part of the time overlap. However, they are still being paired, because the first instruction runs in the U pipeline, and the second in the V pipe The subsequent instructions must only start running after the two defective commands are completed.
Defect match occurs under the following conditions:
1. If the second instruction encounters an AGI delay (see Chapter 9).
2. Two instructions cannot access the same dword in memory at the same time. The following example assumes that the ESI can be 4 consolidations: MOV Al, [ESI] / MOV BL, [ESI 1] two operands are in the same dword, So they can't be executed at the same time. This requires 2 clock cycles. MOV Al, [ESI 3] / MOV BL, [ESI 4] The two operands here are in two DWORD boundaries, so their perfect pairing It only needs to consume 1 clock cycle. 3. Article 2 can be extended to 2-4 bits of the two addresses (Cache District conflicts). For DWORD addresses, this means that the two address differences cannot be 32. E.g:
MOV [ESI], ES / MOV [ESI 32000], EBX; defective pairing
MOV [ESI], EAX / MOV [ESI 32004], EBX; Perfect Pair
Matching Matching Matching Matching Multi-Total Directive A clock cycle can be performed, but interrupt predictive failure jump exceptions. MOV read or write memory instructions When the data area is in cache, only one clock cycle is required. Even It does not have a speed punishment in complex addressing mode based on index register.
A set of pairs of integer instructions, if you need to read memory, save the results in the register or tag register after calculation, you need to consume two clock cycles. (Read / Modify the instruction).
A set of pairs of integer instructions, if you need to read memory, do some calculations, write the results back to memory, you need to consume 3 clock cycles. (Read / Modify / write instructions).
4. If a read / modification / write instruction and a read / modification or read / modification / write instruction pair, then they are a defective pairing.
The following table shows the number of clock cycles required in various situations:
Article 2 Instructions 2 instructions MOV or just register operation read / modify read / modification / write MOV or simply register operation 1 2 3 read / modification 2 2 3 read / modification / write 3 4 5
For example: add [MEM1], EAX / Add EBX, [MEM2]; 4 clock cycles Add EBX, [MEM2] / Add [Mem1], EAX; 3 clock cycles
5. When both pairing instructions require extreme time due to Cache failure, no alignment, or jump prediction failure, etc., a pair of instructions will be longer than any of the time required, but ratio is compared to two The sum of the time required for the instruction.
6. FXCH can be combined with floating point instructions, and make a defect pair when the next instruction is not a floating point command.
In order to avoid defect pairing, you have to know which instruction enters the U pipeline, which enters the V pipe. You can look at your code before, find which instruction can be paired, or can only be in a pipe Pairing, or because of the rules mentioned above.
Defect pairs can usually be avoided by reorganizing your instructions. For example:
L1: MOV Eax, [ESI]
MOV EBX, [ESI]
Inc ECX
Here, two MOV instructions have made a defective pair because they have access the same memory address, so this set of instructions consume 3 clock cycles. You can use the reorganization instruction to pair the INC ECX with one of the MOV instructions.
L2: MOV Eax, Offset A
XOR EBX, EBX
Inc EBX
MOV ECX, [EAX]
JMP L1
Inc EBX / MOV ECX, [EAX] This pair is a defective pair because IS imperfect because the latter has an AGI delay because the latter instruction is consumed. This set of instructions consume 4 clock cycles. If you insert a NOP or any The instructions are replaced into MOV ECX, [EAX] is paired with JMP L1, so that this set of instructions only consume 3 clock cycles.
The next example is 16-bit mode, assuming that the SP can be 4:
L3: Push AX
Push bx
PUSH CX
Push dx
Call func
Here, the PUSH instruction consists of two defect pairs, because the two operands in each pair of instructions are placed in the same DWord in memory. Push BX may be able to match the PUSH CX (because they are accessing two different DWORD), but not like this because it has paired with Push AX. This set of instructions consume 5 clock cycles. If you insert a NOP or other instruction, let Push BX pair with Push CX, while Push DX and Call func Pairing, this set of instructions only requires only 3 clock cycles. Another solution is that the SP is not 4. I want to know if the SP is still difficult in 16-bit mode, so avoid this problem. The best solution is to use 32-bit modes .11. Split the complex instruction set as simple instructions (PPLAIN and PMMX)
You can cut read / modifications and read / modify / write instructions to improve pairing efficiency. For example: add [MEM1], EAX / Add [MEM2], EBX; 5 clock cycles This code can be cut, but only need to consume 3 clock cycles:
MOV ECX, [MEM1] / MOV EDX, [MEM2] / Add ECX, EAX / Add Edx, EBX
MOV [MEM1], ECX / MOV [MEM2], EDX
Similarly you can cut you can't make them pair:
Push [MEM1]
Push [MEM2];
Cut into:
MOV EAX, [MEM1]
Mov EBX, [MEM2]
Push EAX
Push EBX; all paired
There are other examples below, showing some instructions that cannot be paired, becomes a pairlable instruction: CDQ cut: MOV EDX, EAX / SAR EDX, 31NOT EAX is changed to XOR EAX, -1neg Eax cut xor EAX, -1 / Inc Eaxmovzx Eax, Byte PTR [MEM] Cut XOR EAX, EAX / MOV AL, BYTE PTR [MEM] JECXZ Cut Test ECX, ECX / JZLOOP Since Dec ECX / JNZXLAT Change to MOV AL, [EBX EAX]
If the cut instruction does not increase the speed, you should maintain complex instructions or cannot be paired, so that the size of the code can be reduced.
For PPRO, PII and PIII, you don't need to cut the instruction unless you can generate a smaller code.