How to optimize the Pentium series processor's code from: http:www.codingnow.com2000downloadpentopt.htm#26

xiaoxiao2021-03-06 72

How to Optimize for the Pentium Family of MicroProcessors

Contents

Introduction Literature Calling assembly functions from high level language Debugging and verifying Memory model Alignment Cache First time versus repeated execution Address generation interlock (PPlain and PMMX) Pairing integer instructions (PPlain and PMMX)

Perfect pairing Imperfect pairing Splitting complex instructions into simpler ones (PPlain and PMMX) Prefixes (PPlain and PMMX) Overview of PPro, PII and PIII pipeline Instruction decoding (PPro, PII and PIII) Instruction fetch (PPro, PII and PIII) Register renaming ( PPRO, PII and PIII)

Eliminating Dependencies Register Read Stalls Out of Order Execution (PPRO, PII AND PIII) Retirement (PPRO, PII AND PIII) Partial Stalls (PPRO, PII and PIII)

Partial register stalls Partial flags stalls Flags stalls after shifts and rotates Partial memory stalls Dependency chains (PPro, PII and PIII) Searching for bottlenecks (PPro, PII and PIII) Jumps and branches (all processors)

Branch prediction in PPlain Branch prediction in PMMX, PPro, PII and PIII Avoiding jumps (all processors) Avoiding conditional jumps by using flags (all processors) Replacing conditional jumps by conditional moves (PPro, PII and PIII) Reducing code size (all processors) Scheduling floating point code (PPLAIN AND PMMX) Loop Optimization (All Processors)

LOOPS in Pplain and Pmmx Loops in PPRO, PII AND PIII PROBLEMATIC INSTRUCTIONS

XCHG (all processors) Rotates through carry (all processors) String instructions (all processors) Bit test (all processors) Integer multiplication (all processors) WAIT instruction (all processors) FCOM FSTSW AX (all processors) FPREM (all processors) FRNDINT (all processors) FSCALE and exponential function (all processors) FPTAN (all processors) FSQRT (PIII) MOV [MEM], ACCUM (PPlain and PMMX) TEST instruction (PPlain and PMMX) Bit scan (PPlain and PMMX) FLDCW (PPro, PII and PIII) Special topicsLEA instruction (all processors) Division (all processors) Freeing floating point registers (all processors) Transitions between floating point and MMX instructions PMMX, PII and PIII) Converting from floating point to integer (All processors) Using integer instructions to do floating point operations (All processors) Using floating point instructions to do integer operations (PPlain and PMMX) Moving blocks of data (All processors) Self-modifying code (All processors) Detecting processor type (All processors) list of instruction Timings for PPLAIN AND PMMX

Integer INSTRUCTIONS FLOATING POINT INSTRUCTIONS MMX INSTRUCTIONS (PMMX) List of instruction Timings and Micro-Op Breakdown for PPRO, PII AND PIII

INTEGER INSTRUCTIONS FLOATING POINT INSTRUCTIONS MMX Instructions (PII and PIII) XMM Instructions (PIII) Testing Speed Comparison of The Different Microprocessors

INTRODUCTION

This Manual Describes in detail how to write Optimized Assembly Language Code, with Particular Focus on the Pentium® Family of Microprocessors.

Most of the information herein is based on my own research. Many people have sent me useful information and corrections for this manual, and I keep updating it whenever I have new important information. This manual is therefore more accurate, detailed, comprehensive and exact than any other source of information, and it contains many details not found anywhere else. This information will enable you in many cases to calculate exactly how many clock cycles a piece of code will take. I do not claim, though, that all information in this manual is exact: Some timings etc. can be difficult or impossible to measure exactly, and I do not have access to the inside information on technical implementations that the writers of Intel manuals have.The following versions of Pentium processors are discussed in this manual:

AbbreviationNamepplainPlain Old Pentium (WITHOUT MMX) PMMXPENTIUM WITH MMXPROPENTIUM PROPIIPENTIUM II (Including Celeron and Xeon) PiiIpendium III (Including Variants)

The assembly language syntax used in this manual is MASM 5.10 syntax. There is no official standard for X86 assembly language, but this is the closest you can get to a de facto standard since most assemblers have a MASM 5.10 compatible mode. (I do not Recommend Using Masm Version 5.10 Though, Because IT Has A Serious Bug in 32 bit mode. Use tasm or a later version of masm).

Some of the remarks in this manual may seem like a criticism of Intel. This should not be taken to mean that other brands are better. The Pentium family of microprocessors compare well with competing brands, they are better documented, and have better testability features. For these reasons, no competing brand has been subjected to the same level of independent research by me or by anybody else.Programming in assembly language is much more difficult than high level language. Making bugs is very easy, and finding them is very difficult. Now you have been warned! It is assumed that the reader is already experienced in assembly programming. If not, then please read some books on the subject and get some programming experience before you begin to do complicated optimizations.

The hardware design of the PPlain and PMMX chips has many features which are optimized specifically for some commonly used instructions or instruction combinations, rather than using general optimization methods. Consequently, the rules for optimizing software for this design are complicated and have many exceptions, but the possible gain in performance may be substantial. The PPro, PII and PIII processors have a very different design where the processor takes care of much of the optimization work by executing instructions out of order, but the more complicated design of these processors generate many potential bottlenecks, so there may be a lot to gain by optimizing manually for these processors The Pentium 4 processor has yet another design, and the optimization guidelines for Pentium 4 are quite different from previous versions This manual does not cover the Pentium 4 -.. the Reader is refered to manuals from intel.

Before you start to convert your code to assembly, make sure that your algorithm is optimal. Often you can improve a piece of code much more by improving the algorithm than by converting it to assembly code.Next, you have to identify the critical parts of your program. Often more than 99% of the CPU time is spent in the innermost loop of a program. in this case you should optimize only this loop and leave everything else in high level language. Some assembly programmers waste a lot of energy optimizing the Wrong Parts of Their Programs, The Only Significant Effect of Their Effort Being That The Programs Become More Difficult To Debug and Maintain!

If it is not obvious where the critical parts of your program are then you may use a profiler to find them. If it turns out that the bottleneck is disk access, then you may modify your program to make disk access sequential in order to improve disk .

Some High Level Language Compilers Offer Relative, Good Optimization for Specific Processors, But Further Optimization by Hand CAN Usually Make It Much Better.

Please don't send Your Programming Questions to me. I am Not Gonna Do Your HomeWork for you!

Good luck with your hunt for nanoseconds!

2. LitrateuRe

A lot of useful literature and tutorials can be downloaded for free from Intel's www site or acquired in print or on CD-ROM. It is recommended that you study this literature in order to get acquainted with the microprocessor architecture. However, the documents from Intel are not always accurate -. especially the tutorials have many errors (evidently, they have not tested their own examples) .I will not give the URL's here because the file locations change very often you can find the documents you need by using the search Facilities at: developer.intel.com or folow the links from www.agner.org/assem

Some Documents Are IN .pdf format. If you don't have.com www.adobe.com www.adobe.com

The use of mmx and xmm (simd) instructions for Optimizing Specific Applications Are Described in Several Application Notes. The instruction set is described in Various Manuals and Tutorials.

VTune Is A Software Tool from Intel for Optimizing Code. I have not test it and can there ip at the IT Here.

A Lot of Other Sources Than Intel Also Have Useful Information. Thase Sources Are Listed In The Faq for the newsGroup Comp.lang.asm.x86. For other Internet restources Follow the links from www.agner.org/assem

3. Calling assembly functions from High Level Language

You can either use inline assembly or code a subroutine entirely in assembly language and link it into your project. If you choose the latter option, then it is recommended that you use a compiler which is capable of translating high level code directly to assembly. This Assouting Method Right. MOST C Compilers Can do this.

The way of transferring parameters depends on the calling convention: calling convention parameter order on stack parameters removed by _cdecl first par at low address caller _stdcall first par at low address subroutine _fastcall compiler specific subroutine _pascal first par at high address subroutine...

The methods for function calling and name mangling can be quite complicated. There are many different calling conventions, and the different brands of compilers are not compatible in this respect. If you are calling assembly language subroutines from C , then the best method in terms of consistency and compatibility is to declare your functions extern "C" and _cdecl The assembly code must then have the function name prefixed by an underscore (_) and be assembled with case sensitivity on externals (option -mx) Example..:

; extern "c" INT _CDECL Square (int X);

_SQUARE PROC NEAR; Integer Square Function

Public _square

MOV EAX, [ESP 4]

Imul EAX

RET

_SQUARE ENDP

If you need to make overloaded functions, overloaded operators, methods, and other C specialties then you have to code it in C first and make your compiler translate it to assembly in order to get the right linking information and calling method. These details are different for different brands of compilers and seldom documented. If you want an assembly function with any other calling method than extern "C" and _cdecl to be callable from code compiled with different compilers then you need to give it one public name for each compiler. for Example An Overloaded Square Function:

INT Square (int X);

Square_i proc Near; Integer Square FUNCTION

@ Square $ qi Label Near; Link Name for Borland Compiler? Square @@ yahh @ z label Near; Link Name for Microsoft Compiler

_SQUARE__FI Label Near; Link Name for GNU Compiler

Public @ Square $ qi,? Square @@ yahh @ z, _square__fi

MOV EAX, [ESP 4]

Imul EAX

RET

Square_i endp

Double Square (double x);

Square_d proc Near; Double Precision Float Square Function

@ Square $ qd Label Near; Link Name for Borland Compiler

? Square @@ yann @ z Label Near; Link Name for Microsoft Compiler

_SQUARE__FD Label Near; Link Name for GNU Compiler

Public @ Square $ qd,? Square @@ yann @ z, _square__fd

FLD Qword PTR [ESP 4]

Fmul St (0), ST (0)

RET

Square_d endp

This method works because all these compilers use the _cdecl calling convention by default for overloaded functions. For methods (member functions), however, not even the calling convention is the same for all compilers (Borland and Gnu compilers use _cdecl convention with the 'this 'Pointer First, Microsoft Uses The_stdcall convention with the' this' Pointer IN ECX).

In general, you should not expect different compilers to be compatible on the object file level if you are using any of these constructs: long double, member pointers, virtual methods, new, delete, exceptions, system function calls, or standard library functions .

Register USAGE IN 16 Bit Mode Dos or Windows, C or C : 16-Bit Return Value In DX: AX: AX, FLOATING POINT RETURN VALUE IN ST (0). Registers AX, BX, CX, DX , ES and arithmetic flags may be changed by the procedure; all other registers must be saved and restored A procedure can rely on SI, DI, BP, DS and SS being unchanged across a call to another procedure.Register usage in 32 bit Windows. , C and other programming languages:.. Integer return value in EAX, floating point return value in ST (0) Registers EAX, ECX, EDX (not EBX) may be changed by the procedure; all other registers must be saved and restored Segment registers can not be changed, not even temporarily. CS, DS, ES, and SS all point to the flat segment group. FS is used by the operating system. GS is unused, but reserved. Flags may be changed by the procedure with the following Restrictions: The Direction Flag IS 0 by Default. The Direction Flag May Be set Temporarily, But Must Be Cleared Before Any Call OR return. The interrupt flag can not be cleared. The floating point register stack is empty at the entry of a procedure and must be empty at return, except for ST (0) if it is used for return value. MMX registers may be changed by the procedure and if so cleared by EMMS before returning and before calling any other procedure that may use floating point registers. All XMM registers may be modified by procedures. Rules for passing parameters and return values in XMM registers are described in Intel's application note AP 589. A Procedure Can Rely ON EBX, ESI, EDI, EBP AND All Segment Registers Being Unchanged Across A Call To Another Procedure.

4. Debugging and verifying

Debugging assembly code can be quite hard and frustrating, as you probably already have discovered. I would recommend that you start with writing the piece of code you want to optimize as a subroutine in a high level language. Next, write a test program that will test your subroutine thoroughly. Make sure the test program goes into all branches and boundary cases.When your high level language subroutine works with your test program then you are ready to translate the code to assembly language.

Now you can start to optimize. Each time you have made a modification you should run it on the test program to see if it works correctly. Number all your versions and save them so that you can go back and test them again in case you discover An Error That The Test Program Didn't Catch (SUCH AS WRITIG TO A WRONG Address).

Test the speed of the most critical part of your program with the method described in chapter 30 or with a test program If the code is significantly slower than expected, then the most probable causes are:. Cache misses (chapter 7), misaligned operands ( Chapter 6), First Time Penalty (Chapter 8), Branch Mispredictions (Chapter 22), Instruction Fetch Problems (Chapter 15), Register Read Stalls (16), or Long Dependency Chains (Chapter 20).

Highly optimized code tends to be very difficult to read and understand for others, and even for yourself when you get back to it after some time. In order to make it possible to maintain the code it is important that you organize it into small logical units (Procedures or Macros) with a well-defined interface and appropriate Comments. The More complicated the code is to read, The more important is a good documentation.

5. Memory Model

The Pentiums are designed primarily for 32 bit code, and the performance is inferior on 16 bit code. Segmenting your code and data also degrades performance significantly, so you should generally prefer 32 bit flat mode, and an operating system which supports this mode. The Code Examples Shown in this Manual Assumeth A 32 Bit Flat Memory Model, UNSS OtherWise Specified.6. Alignment

All Data in Ram Should Be Aligned to Addresses Divisible By 2, 4, 8, or 16 According to this scheme:

Alignment Operand Size PPLAIN AND PMMX PPRO, PII AND PIII 1 (Byte) 11 2 (Word) 22 4 (DWORD) 44 6 (FWORD) 48 8 (qword) 88 10 (TBYTE) 816 16 (OWORD) N.A.16

ON PPLAIN AND PMMX, Misaligned Data Will Take At Least 3 Clock Cycles Extra To Access IF A 4 byte Boundary Is Crossed. The Penalty Is Higher When a Cache Line Boundary is crossed.

ON PPRO, PII AND PIII, MISALIGNED DATA WILL COSTYO 6-12 Clocks Extra When a cache line boundary is crossed. Misaligned Operands Smaller Than 16 Bytes That Do Not Cross A 32 byte Boundary Give No Penalty.

Aligning Data By 8 or 16 on a dword size stack may be a problem. A common method is to set up an aligned frame pointer. A Function with aligned local Data May Look Like this:

_FuncWithalign Proc Near

Push EBP; ProLog Code

MOV EBP, ESP

And EBP, -8; Align Frame Pointer by 8

FLD DWORD PTR [ESP 8]; Function Parameter

Sub ESP, Localspace 4; Allocate Local Space

FSTP QWORD PTR [ebp-localspace]; Store Something in Aligned Space

...

Add ESP, LocalSpace 4; Epilog Code. Restore ESP

POP EBP; (AGI Stall on PPLAIN / PMMX)

RET

_Funcwithalign endp

While Aligning Data IS ALWAYS IMPORTANT, Aligning Code Is Not Necessary on the PPLAIN AND PMMX. Principles for Aligning Code on PPRO, PII AND PIII Are Explained in Chapter 15.7. Cache

THE PPLAIN AND PPRO HAVE 8 KB OF ON-Chip Cache (Level One Cache) for Code, AND 8 KB for Data. The PMMX, PII AND PIII HAVE 16 KB for Code and 16 KB for Data. Data in The Level 1 Cache CAN Be Read Or Written to In Just One Clock Cycle, WHEREAS A Cache Miss May Cost Many Clock Cycles. It is the the cache works in Order to Use it..

The data cache consists of 256 or 512 lines of 32 bytes each. Each time you read a data item which is not cached, the processor will read an entire cache line from memory. The cache lines are always aligned to a physical address divisible by 32 . When you have read a byte at an address divisible by 32, then the next 31 bytes can be read or written to at almost no extra cost. you can take advantage of this by arranging data items which are used near each other together into aligned blocks of 32 bytes of memory. If, for example, you have a loop which accesses two arrays, then you may interleave the two arrays into one array of structures, so that data which are used together are also stored together.

If The size of an array or other data structure is a multiple of 32 bytes, the you will preformed Align IT by 32.

The cache is set-associative. This means that a cache line can not be assigned to an arbitrary memory address. Each cache line has a 7-bit set-value which must match bits 5 through 11 of the physical RAM address (bit 0- 4 define the 32 bytes within a cache line). The PPlain and PPro have two cache lines for each of the 128 set-values, so there are two possible cache lines to assign to any RAM address. The PMMX, PII and PIII have four .The consequence of this is that the cache can hold no more than two or four different data blocks which have the same value in bits 5-11 of the address. You can determine if two addresses have the same set-value by the following method : Strip Off The Lower 5 Bits of Each Address To Get A Value Divisible by 32. IF The Difference Between The Two Truncated Addresses IS A Multiple Of 4096 (= 1000H), THE Addresses Have The Same Set-Value.

Let me illustrate this by the folowing piece of code, where esi holds an address divisible by 32:

Again: MOV EAX, [ESI]

MOV EBX, [ESI 13 * 4096 4]

MOV ECX, [ESI 20 * 4096 28]

Dec edx

Jnz Again

The three addresses used here all have the same set-value because the differences between the truncated addresses are multipla of 4096. This loop will perform very poorly on the PPlain and PPro. At the time you read ECX there is no free cache line with the Proper set-value so the processor takes the least all Was, That Is The One Which Was Used for Eax, And Fills It With The Data from [ESI 20 * 4096] To [ESI 20 * 4096 31] and reads ECX. Next, when reading EAX, you find that the cache line that held the value for EAX has now been discarded, so you take the least recently used line, which is the one holding the EBX value, and so ON .. You Have Nothing But Cache Misses and The Loop Takes Something Like 60 Clock Cycles. If The Third Line Is Changd To: Mov ECX, [ESI 20 * 4096 32]

then we have crossed a 32 byte boundary, so that we do not have the same set-value as in the first two lines, and there will be no problem assigning a cache line to each of the three addresses. The loop now takes only 3 clock cycles (except for the first time) -! a very considerable improvement As already mentioned, the PMMX, PII and PIII have 4-way caches so that you have four cache lines with the same set-value (Some Intel documents erroneously say. That the PII Cache IS 2-WAY).

It may be very difficult to determine if your data addresses have the same set-values, especially if they are scattered around in different segments. The best thing you can do to avoid problems of this kind is to keep all data used in the critical part or your program within one contiguous block not bigger than the cache, or two contiguous blocks no bigger than half that size (for example one block for static data and another block for data on the stack). This will make sure that your cache lines are used optimally.If the critical part of your code accesses big data structures or random data addresses, then you may want to keep all frequently used variables (counters, pointers, control variables, etc.) within a single contiguous block of max 4 kbytes so that you have a complete set of cache lines free for accessing random data. Since you probably need stack space anyway for subroutine parameters and return addresses, the best thing is to copy all frequently used static data to dynamic Variables on the stack, and copy the..

Reading a data item which is not in the level one cache causes an entire cache line to be filled from the level two cache, which takes approximately 200 ns (that is 20 clocks on a 100 MHz system or 40 clocks on a 200 MHz system) , but the bytes you ask for first 40-100 ns. if the data item is not in The Level Two Cache Either, The Level Get a Delay of Something Like 200-300 NS. this delay Will Be Somewhat Longer IF you cross A DRAM Page Boundary. (The Size of A DRAM Page IS 1 KB for 4 AND 8 MB 72 PIN RAM MODULES, AND 2 KB FOR 16 and 32 MB MODULES).

When reading big blocks of data from memory, the speed is limited by the time it takes to fill cache lines You can sometimes improve speed by reading data in a non-sequential order:. Before you finish reading data from one cache line start reading the first item from the next cache line This method can increase reading speed by 20 -.. 40% when reading from main memory or level 2 cache on PPlain and PMMX, and from level 2 cache on PPro, PII and PIII A disadvantage of this method is of course that the program code becomes extremely clumsy and complicated. For further information on this trick see www.intelligentfirm.com.When you write to an address which is not in the level 1 cache, then the value will go right through to the level 2 cache or to the RAM (depending on how the level 2 cache is set up) on the PPlain and PMMX. This takes approximately 100 ns. If you write eight or more times to the same 32 byte block of memory without also reading from IT, And The Block is Not in The Level O ne cache, then it may be advantageous to make a dummy read from the block first to load it into a cache line. All subsequent writes to the same block will then go to the cache instead, which takes only one clock cycle. On PPlain and Pmmx, There is Sometimes A Small Penalty for Writing Repeatedly To The Same Address WITHOUTETING IN BETWEEN.

On PPro, PII and PIII, a write miss will normally load a cache line, but it is possible to setup an area of memory to perform differently, for example video RAM (See Pentium Pro Family Developer's Manual, vol 3:. Operating System Writer's ").

Other Ways of Speeding Up Memory Reads and Writes Are Discussed in Chapter 27.8 Below.

The PPlain and PPro have two write buffers, PMMX, PII and PIII have four. On the PMMX, PII and PIII you may have up to four unfinished writes to uncached memory without delaying the subsequent instructions. Each write buffer can handle operands up to 64 bits wide.Temporary data may conveniently be stored on the stack because the stack area is very likely to be in the cache. However, you should be aware of the alignment problems if your data elements are bigger than the stack word size.

If the life ranges of two data structures do not overlap, then they may share the same RAM area to increase cache efficiency. This is consistent with the common practice of allocating space for temporary variables on the stack.

Storing temporary data in registers is of course even more efficient. Since registers is a scarce ressource you may want to use [ESP] rather than [EBP] for addressing data on the stack, in order to free EBP for other purposes. Just don ' t forget that the value of ESP changes every time you do a PUSH or POP. (you can not use ESP under 16-bit Windows because the timer interrupt will modify the high word of ESP at unpredictable places in your code.)

There is a separate cache for code, which is similar to the data cache. The size of the code cache is 8 kb on PPlain and PPro and 16 kb on the PMMX, PII and PIII. It is important that the critical part of your code (the innermost loops) fit in the code cache. Frequently used pieces of code or routines which are used together should preferable be stored near each other. Seldom used branches or procedures should be put away in the bottom of your code or somewhere else.

8. First Time Versus REPEATED EXECUTION

A piece of code usually takes much more time the first time it is executed than when it is repeated The reasons are the following:. Loading the code from RAM into the cache takes longer time than executing it Any data accessed by the code has to. be loaded into the cache, which may take much more time than executing the instructions. When the code is repeated then the data are more likely to be in the cache. Jump instructions will not be in the branch target buffer the first time they execute, and therefore are less likely to be predicted correctly. See chapter 22. In the PPlain, decoding the code is a bottleneck. If it takes one clock cycle to determine the length of an instruction, then it is not possible to decode two instructions per clock cycle, because the processor does not know where the second instruction begins. The PPlain solves this problem by remembering the length of any instruction which has remained in the cache since last time it was executed. As a consequence of this, a set of instructions will not pair in the PPlain the first time they are executed, unless the first of the two instructions is only one byte long. The PMMX, PPro, PII and PIII have no penalty on first time decoding.

Fors the Subsequent Times.

If you have a big loop which does not fit into the code cache then you will get penalties all the time because it does not run from the cache. You should therefore try to reorganize the loop to make it fit into the cache.

IF you have very much one jumps, calls, and branches inside a loop, the you get the pen point the branch target buffer miss.

Likewise, IF a loop refeatedly acfor

It takes one clock cycle to calculate the address needed by an instruction which accesses memory. Normally, this calculation is done at a separate stage in the pipeline while the preceding instruction or instruction pair is executing. But if the address depends on the result of an instruction executing in the preceding clock cycle, then you have to wait an extra clock cycle for the address to be calculated This is called an AGI stall Example:.. ADD EBX, 4 / MOV EAX, [EBX]; AGI stall The stall in This Example Can Be Removed by Putting Some Other Instructions in Between Add Ebx, 4 and Mov EX, [EBX] or by ReWriting The Code To: MOV EAX, [EBX 4] / Add EBX, 4

You can also get an AGI stall with instructions which use ESP implicitly for addressing, such as PUSH, POP, CALL, and RET, if ESP has been changed in the preceding clock cycle by instructions such as MOV, ADD, or SUB. The PPlain and PMMX have special circuitry to predict the value of ESP after a stack operation so that you do not get an AGI delay after changing ESP with PUSH, POP, or CALL. you can get an AGI stall after RET only if it has an immediate operand To Add to ESP.

EXAMPLES:

Add ESP, 4 / POP ESI; AGI STALL

POP Eax / Pop ESI; No Stall, Pair

MOV ESP, EBP / RET; AGI Stall

Call L1 / L1: MOV Eax, [ESP 8]; No Stall

RET / POP EAX; No Stall

RET 8 / POP EAX; AGI Stall

The Lea Instruction is Also Subject To An Agi Stall IT Uses A Base or Index Register Which Has Been Changed In The Preceding Clock Cycle. EXAMPLE: INC ESI / LEA ESI]; AGI Stall

PPRO, PII AND PIII HAVE NO AGI Stalls for Memory Reads and LEA, But They Do Have Agi Stalls for Memory Writes. This is not VERY SIGNIFICANT UNSE Subsequent Code Has to Wait for the Write To Finish.

10. Pairing Integer Instructions (PPLAIN AND PMMX)

10.1 Perfect Pairing

The PPlain and PMMX have two pipelines for executing instructions, called the U-pipe and the V-pipe. Under certain conditions it is possible to execute two instructions simultaneously, one in the U-pipe and one in the V-pipe. This can Almost Double The Speed. It is Therefore Advantageous To Reorder Your Instructions to make them pair.

The Following Instructions Are Pairable In Either Pipe:

Mov Register, Memory, Or Immediate Into Register or Memory Push Register Or Immediate, Pop Register Lea, NOP Inc, DEC, Add, Sub, CMP, AND, OR, XOR, AND Some Forms of Test (See Chapter 26.14)

The Following Instructions Are Pairable in The U-PIPE ONLY:

ADC, SBB SHR, SAR, SHL, SAL with IMMEDIAT ROR, ROL, RCR, RCL with an IMMEDITE COUNT OF 1

The Following Instructions Can Execute In Either Pipe But Are ONLY PAIRABLE WHENIN THE V-PIPE:

Near Call Short and Near Jump Short and Near Conditional Jump.

All Other Integer INSTRUCTIONS CAN EXECUTE IN THE U-PIPE ONLY, AND ARE NOT PAIRABLE.

Two Consecutive Instructions Will Pair When The Following Conditions Are Met:

1. The first instruction is pairable in The U-PIPE AND THE Second Instruction IS Pairable In The V-PIPE.

2. The Second Instruction Does Not Read or Write a Register Which To. Examples: Mov Eax, EBX / MOV ECX, EBX / MOV ECX, EBX / MOV ECX, EBX / MOV ECX, EBX / MOV ECX, EBX / MOV ECX, EBX / MOV ECX, EBX / MOV ECX, EBX / MOV ECX, EBX / MOV EAX

MOV Eax, 1 / Mov Eax, 2; Write After Write, Do Not Pair

MOV EBX, EAX / MOV Eax, 2; Write After Read, Pair OK

MOV EBX, EAX / MOV ECX, EAX; Read after Read, Pair OK

MOV EBX, EAX / Inc Eax; Read and Write After Read, Pair OK

3. In Rule 2 Partial Registers Are Treated As Full Registers. EXAMPLE:

MOV Al, BL / MOV AH, 0

Writes to Different Parts of The Same Register, Do Not Pair

4. Two instructions Which Both Write to Parts of The Flags Register CAN Pair Despite Rule 2 and 3. Example:

SHR EAX, 4 / Inc EBX; Pair OK

5. An Instruction Which Writes to the Flags Can Pair with a Conditional Jump Despite Rule 2. Example:

CMP EAX, 2 / JA Labelbigger; Pair OK

6. The Following Instruction Combinations Can Pair Despite The Fact That Thei Both Modify The Stack Pointer:

Push Push, Push Call, POP POP

7. There Are Restrictions on the Pairing of Instructions with prefix. There Are Several Types of Prefixes:

instructions addressing a non-default segment have a segment prefix. instructions using 16 bit data in 32 bit mode, or 32 bit data in 16 bit mode have an operand size prefix. instructions using 32 bit base or index registers in 16 bit mode have an address size prefix. repeated string instructions have a repeat prefix. locked instructions have a LOCK prefix. many instructions which were not implemented on the 8086 processor have a two byte opcode where the first byte is 0FH. The 0FH byte behaves as a prefix on the PPLAIN, But Not On The Other Versions. The Most Common Instructions with 0fh Prefix Are: Movzx, Movsx, Push Fs, POP FS, PUSH GS, POP GS, LFS, LGS, LSS, SETCC, BT, BTC, BTR, BTS, BSF, BSR, SHLD, SHRD, AND IMUL WITH TWO OPERANDS AND NO IMMEDITERAND.ON THE PPLAIN, A Prefixed Instruction Can Only Except for Conditional Near Jumps.

ON The PMMX, Instructions with Operands, Address Size, or 0FH Prefix CAN Execute IN Either Pipe, WHEREAS INSTRUCTIONS with Segment, Repeat, or Lock Prefix Can Only Execute In The U-PIPE.

8. An Instruction Which Has Both a Displacement and Immediate Data IS Not Pairable On The PPLAIN AND OnLY PAIRABLE IN THE U-PIPE ON THE PMMX:

MOV DWORD PTR DS: [1000], 0; Not Pairable or Only in U-PIPE

CMP BYTE PTR [EBX 8], 1; Not Pairable or Only in U-PIPE

CMP Byte PTR [EBX], 1; Pairable

CMP BYTE PTR [EBX 8], Al; Pairable

(Another problem with instructions which have both a displacement and immediate data on the PMMX is that such instructions may be longer than 7 bytes, which means that only one instruction can be decoded per clock cycle, as explained in chapter 12.)

9. Both Instructions Must Be Preloaded and Decoded. This is expected in chapter 8.10. There Are Special Pairing Rules for MMX Instructions on the PMMX:

MMX shift, pack or unpack instructions can execute in either pipe but can not pair with other MMX shift, pack or unpack instructions. MMX multiply instructions can execute in either pipe but can not pair with other MMX multiply instructions. They take 3 clock cycles and the last 2 clock cycles can overlap with subsequent instructions in the same way as floating point instructions can (see chapter 24). an MMX instruction which accesses memory or integer registers can execute only in the U-pipe and can not pair with a non-MMX instruction.

10.2 Imperfect Pairing

There are situations where the two instructions in a pair will not execute simultaneously, or only partially overlap in time. They should still be considered a pair, though, because the first instruction executes in the U-pipe, and the second in the V- .

Imperfect Pairing Will Happen in The Following Cases:

1. If The Second Instructions Suffers An AGI Stall (See Chapter 9).

. 2. Two instructions can not access the same DWORD of memory simultaneously The following examples assume that ESI is divisible by 4: MOV AL, [ESI] / MOV BL, [ESI 1] The two operands are within the same DWORD, so they can not execute simultaneously. The pair takes 2 clock cycles. MOV AL, [ESI 3] / MOV BL, [ESI 4] Here the two operands are on each side of a DWORD boundary, so they pair perfectly, and take only one Clock Cycle.

3. Rule 2 is extended to the case where bit 2-4 is the same in the two addresses (cache bank conflict) For DWORD addresses this means that the difference between the two addresses should not be divisible by 32. Examples:. MOV [ ESI], EAX / MOV [ESI 32000], EBX; Imperfect Pairing

MOV [ESI], ESI 32004], EBX; Perfect Pairing

Pairable integer instructions which do not access memory take one clock cycle to execute, except for mispredicted jumps. MOV instructions to or from memory also take only one clock cycle if the data area is in the cache and properly aligned. There is no speed penalty for Using Complex Addressing Modes Such as scaled index registers.

A Pairable Integer Instruction Which Reads from Memory, Does Some Calculation, and Stores The Result IN A Register OR Flags, Takes 2 Clock Cycles (Read / Modify Instructions).

A Pairable Integer Instruction Which Reads from Memory, Does Some Calculation, And Writes The Result Back to The Memory, Takes 3 Clock Cycles (Read / Modify / Write Instructions).

4. IF a read / modify / write instruction ispaired with a read / modify or read / modify / write instruction, THEY WILL PAIR IMPERFECTLY.

The Number of Clock Cycles Used is Given In The Following Table:

First INSTRUCOND INSTRUCTION MOV OR Register Only Read / Modify Read / Modify / Write Mov or Register Only 1 2 3 Read / Modify 2 2 3 Read / Modify / Write 3 4 5

Example: Add [Mem1], Eax / Add EBX, [MEM2]; 4 Clock Cycles Add Ebx, [MEM2] / Add [Mem1], EAX; 3 Clock Cycles

5. When Two Paired Instructions Both Take Extra Time Due To Cache Misses, Misalignment, or Jump Misprediction, The Pair Will Take More Time Than Each Instruction, But Less Than Sum of The Two.

6. A pairable floating point instruction followed by FXCH will make imperfect pairing if the next instruction is not a floating point instruction.In order to avoid imperfect pairing you have to know which instructions go into the U-pipe, and which to the V- For one of the rules above.

Imperfect pairing can offten be avoided by reordering your instructions. EXAMPLE:

L1: MOV Eax, [ESI]

MOV EBX, [ESI]

Inc ECX

Here the two MOV instructions form an imperfect pair because they both access the same memory location, and the sequence takes 3 clock cycles. You can improve the code by reordering the instructions so that INC ECX pairs with one of the MOV instructions.

L2: MOV Eax, Offset A

XOR EBX, EBX

Inc EBX

MOV ECX, [EAX]

JMP L1

The pair INC EBX / MOV ECX, [EAX] is imperfect because the latter instruction has an AGI stall. The sequence takes 4 clocks. If you insert a NOP or any other instruction so that MOV ECX, [EAX] pairs with JMP L1 instead .

The next example is in 16 bit mode, Assuming That SP is Divisible By 4:

L3: Push AX

Push BX

PUSH CX

Push dx

Call func

Here the PUSH instructions form two imperfect pairs, because both operands in each pair go into the same dword of memory. PUSH BX could possibly pair perfectly with PUSH CX (because they go on each side of a DWORD boundary) but it does not because it has already been paired with PUSH AX. The sequence therefore takes 5 clocks. If you insert a NOP or any other instruction so that PUSH BX pairs with PUSH CX, and PUSH DX with CALL FUNC, then the sequence will take only 3 clocks. Another way to solve the problem is to make sure that SP is not divisible by 4. Knowing whether SP is divisible by 4 or not in 16 bit mode can be difficult, so the best way to avoid this problem is to use 32 bit mode. 11. Splitting Complex Instructions INTO SIMPLER ANES (PPLAIN and PMMX)

You may split up read / modify and read / modify / write instructions to improve pairing Example:. ADD [mem1], EAX / ADD [mem2], EBX; 5 clock cycles This code may be split up into a sequence which takes only 3 Clock Cycles:

MOV ECX, [MEM1] / MOV EDX, [MEM2] / Add ECX, EAX / Add Edx, EBX

MOV [MEM1], ECX / MOV [MEM2], EDX

LIKEWISE You May Split Up Non-Pairable Instructions Into Pairable Instructions:

Push [MEM1]

Push [MEM2]; non-pairable

Split Up Into:

MOV EAX, [MEM1]

Mov EBX, [MEM2]

Push EAX

Push Ebx; Everything PAIRS

Other examples of non-pairable instructions which may be split up into simpler pairable instructions: CDQ split into: MOV EDX, EAX / SAR EDX, 31 NOT EAX change to XOR EAX, -1 NEG EAX split into XOR EAX, -1 / INC Eax Movzx Eax, Byte Ptr [Mem] split Into xor Eax, Eax / Mov Al, Byte Ptr [MEM] JECXZ Split Into Test ECX, ECX / JZ LOOP SPLIT INTO DEC ECX / JNZ XLAT CHANGE to MOV Al, [EBX EAX ]

If splitting instructions does not improve speed, then you may keep the complex or nonpairable instructions in order to reduce code size.Splitting instructions is not needed on the PPro, PII and PIII, except when the split instructions generate fewer uops.

12. prefixes (PPLAIN AND PMMX)

An Instruction with One or More Prefixes May Not Be Aable To Execute In The V-PIPE (SE Chapter 10, SECT.7), AND IT May Take More Than One Clock Cycle to Decode.

On the PPLAIN, The DECODING DELAY IS One Clock Cycle for Each Prefix Except for the 0fh Prefix of Conditional Near Jumps.

The PMMX has no decoding delay for 0FH prefix. Segment and repeat prefixes take one clock extra to decode. Address and operand size prefixes take two clocks extra to decode. The PMMX can decode two instructions per clock cycle if the first instruction has a segment or repeat prefix or no prefix, and the second instruction has no prefix. Instructions with address or operand size prefixes can only decode alone on the PMMX. Instructions with more than one prefix take one clock extra for each prefix.

Address size prefixes can be avoided by using 32 bit mode. Segment prefixes can be avoided in 32 bit mode by using a flat memory model. Operand size prefixes can be avoided in 32 bit mode by using only 8 bit and 32 bit integers.

Where prefixes are unavoidable, the decoding delay may be masked if a preceding instruction takes more than one clock cycle to execute. The rule for the PPlain is that any instruction which takes N clock cycles to execute (not to decode) can 'overshadow' the decoding delay of N-1 prefixes in the next two (sometimes three) instructions or instruction pairs. in other words, each extra clock cycle that an instruction takes to execute can be used to decode one prefix in a later instruction. This shadowing effect even extends across a predicted branch. Any instruction which takes more than one clock cycle to execute, and any instruction which is delayed because of an AGI stall, cache miss, misalignment, or any other reason except decoding delay and branch misprediction, has a shadowing effect .

The PMMX has a similar shadowing effect, but the mechanism is different. Decoded instructions are stored in a transparent first-in-first-out (FIFO) buffer, which can hold up to four instructions. As long as there are instructions in the FIFO buffer you get no delay. When the buffer is empty then instructions are executed as soon as they are decoded. The buffer is filled when instructions are decoded faster than they are executed, ie when you have unpaired or multi-cycle instructions. The FIFO buffer is emptied when instructions execute faster than they are decoded, ie when you have decoding delays due to prefixes. The FIFO buffer is empty after a mispredicted branch. The FIFO buffer can receive two instructions per clock cycle provided that the second instruction is without prefixes and None of the instructions area longer Than 7 Bytes. The Two Execution Pipelines (U and V) Can Each Receive One Instruction Per Clock Cycle from the FIFFER.EXAMPLES: CLD / Rep Movsd The CLD Installer ruction takes two clock cycles and can therefore overshadow the decoding delay of the REP prefix. The code would take one clock cycle more if the CLD instruction was placed far from the REP MOVSD. CMP DWORD PTR [EBX], 0 / MOV EAX, 0 / SETNZ AL The CMP instruction takes two clock cycles here because it is a read / modify instruction. The 0FH prefix of the SETNZ instruction is decoded during the second clock cycle of the CMP instruction, so that the decoding delay is hidden on the PPlain ( The pmmx has no decoding delay for the 0fh.

Prefix Penalties In PPRO, PII AND PIII Are Described in Chapter 14.

13. Overview of PPRO, PII and PIII PIPELINE

The architecture of the PPro, PII and PIII microprocessors is well explained and illustrated in various manuals and tutorials from Intel. It is recommended that you study this material in order to get an understanding of how these microprocessors work. I will describe the structure briefly here with particular focus on those elements that are important for optimizing code.Instruction codes are fetched from the code cache in aligned 16-byte chunks into a double buffer that can hold two 16-byte chunks. The code is passed on from the double buffer to the decoders in blocks which I will call ifetch blocks (instruction fetch blocks). The ifetch blocks are usually 16 bytes long, but not aligned. The purpose of the double-buffer is to make it possible to decode an instruction that crosses a 16- BYTE Boundary (IE An Address Divisible By 16).

The ifetch block goes to the instruction length decoder, which determines where each instruction begins and ends, and next to the instruction decoders. There are three decoders so that you can decode up to three instructions in each clock cycle. A group of up to three Instructions That Are Decoded In The Same Clock Cycle IS Called A Decode Group.

. The decoders translate instructions into micro-operations, abbreviated uops Simple instructions generate only one uop, while more complex instructions may generate several uops For example, the instruction ADD EAX, [MEM] is decoded into two uops:. One for reading the source .

The three decoders are called D0, D1, and D2. D0 can handle all instructions, while D1 and D2 can handle only simple instructions that generate one uop.The uops from the decoders go via a short queue to the register allocation table (RAT) . The execution of uops work on temporary registers which are later written to the permanent registers EAX, EBX, etc. The purpose of the RAT is to tell the uops which temporary registers to use, and to allow register renaming (see later).

After the RAT, the uops to go the reorder buffer (ROB). The purpose of the ROB is to enable out-of-order execution. A uop stays in the reservation station until the operands it needs are available. If an operand for one Uop is delayed Because A Previous Uop That Genereates The Operand Is Not Finished Yet, The Rob May Find Another Uop Later In The Meantime in Order to Save Time.

The uops that are ready for execution are sent to the execution units, which are clustered around five ports: Port 0 and 1 can handle arithmetic operations, jumps, etc. Port 2 takes care of all reads from memory, port 3 calculates addresses for memory Writes, and port 4 does memory writes.

When an instruction has been executed then it is marked in the ROB as ready to retire. It then goes to the retirement station. Here the contents of the temporary registers used by the uops are written to the permanent registers. While uops can be executed out ORDER, The Must Be Retired in Order.

In The Following Chapters, I Will Describe in detail how to optimize the throughput of each step in the pipeline.

14. Instruction Decoding (PPRO, PII AND PIII)

I am describing instruction decoding before instruction fetching here because you need to know how the decoders work in order to understand the possible delays in instruction fetching.The decoders can handle three instructions per clock cycle, but only when certain conditions are met. Decoder D0 can .

To Summarize The Rules for Decoding Two or Three Instructions in The Same Clock Cycle:

The first instruction (D0) generates no more than 4 uops, The second and third instructions generate no more than 1 uop each, The second and third instructions are no more than 8 bytes long each, The instructions must be contained within the same 16 bytes IFETCH Block (See next chapter).

........................... ..

An Instruction That Genere Than 4 uops Takes Two or More Clock Cycles To Decode, and no Other Instructions Can Decode in Parallel.

It follows from the rules above that the decoders can produce a maximum of 6 uops per clock cycle if the first instruction in each decode group generates 4 uops and the next two generate 1 uop each. The minimum production is 2 uops per clock cycle, which YOU GET WHEN ALL INSTRUCTIONS Generate 2 UOPS EACH, SO THAT D1 and D2 Are Never USED.

For maximum throughput, it is recommended that you order your instructions according to the 4-1-1 pattern: instructions that generate 2 to 4 uops can be interspearsed with two simple 1-uop instructions for free, in the sense that they do not add To the decoding time. EXAMPLE: MOV EBX, [MEM1]; 1 UOP (d0)

INC EBX; 1 UOP (D1)

Add Eax, [MEM2]; 2 UOPS (D0)

Add [MEM3], EAX; 4 UOPS (D0)

THIS Takes 3 Clock Cycles To Decode. You Can Save ONE CLOCK CYCLE BY REORDERING The Instructions Into Two Decode Groups:

Add Eax, [MEM2]; 2 UOPS (D0)

MOV EBX, [MEM1]; 1 UOP (D1)

INC EBX; 1 UOP (D2)

Add [MEM3], EAX; 4 UOPS (D0)

The decoders now generate 8 uops in two clock cycles, which is probably satisfactory. Later stages in the pipeline can handle only 3 uops per clock cycle so with a decoding rate higher than this you can assume that decoding is not a bottleneck. However, complications In The Fetch Mechanism CAN Delay Decoding As Described in The next chapter

You can see How Many Uops Each Instruction Generated IN The Tables in Chapter 29.

Instruction Prefixes Can Also Incoders. Instructions CAN Have Several Kinds of Prefixes:

An operand size prefix is needed when you have a 16-bit operand in a 32-bit environment or vice versa. (Except for instructions that can only have one operand size, such as FNSTSW AX). An operand size prefix gives a penalty of a few clocks if the instruction has an immediate operand of 16 or 32 bits because the length of the operand is changed by the prefix Examples: ADD BX, 9; no penalty because immediate operand is 8 bitsMOV WORD PTR [MEM16], 9;. Penalty Because Operand IS 16 BITS The Last Instruction SHOULD BE CHANGED TO: MOV EAX, 9

MOV WORD PTR [MEM16], AX; no penalty because no immediate An address size prefix is used when you use 32-bit addressing in 16 bit mode or vice versa This is seldom needed and should generally be avoided The address size prefix gives.. a penalty whenever you have an explicit memory operand (even when there is no displacement) because the interpretation of the r / m bits in the instruction code is changed by the prefix. Instructions with only implicit memory operands, such as string instructions, have no penalty with address size prefix. segment prefixes are used when you address data in a non-default data segment. segment prefixes give no penalty on the PPro, PII and PIII. Repeat prefixes and lock prefixes give no penalty in the decoders. There is always A Penalty if you have more Than One Prefix. This Penalty is Usually One Clock Per Prefix.

15. Instruction Fetch (PPRO, PII and PIII)

The code is fetched in aligned 16-bytes chunks from the code cache and placed in the double buffer, which is called so because it can contain two such chunks. The code is then taken from the double buffer and fed to the decoders in blocks which are usually 16 bytes long, but not necessarily aligned by 16. I will call these blocks ifetch blocks (instruction fetch blocks). If an ifetch block crosses a 16 byte boundary in the code then it needs to take from both chunks in the double buffer . So the purpose of the double buffer is to allow instruction fetching across 16 byte boundaries.The double buffer can fetch one 16-bytes chunk per clock cycle and can generate one ifetch block per clock cycle. The ifetch blocks are usually 16 bytes long, But Can Be Shorter if The IS A Predicted Jump In The Block. (See Chapter 22 About Jump PREDIX).

Unfortunately, the double buffer is not big enough for handling fetches around jumps without delay. If the ifetch block that contains the jump instruction crosses a 16-byte boundary then the double buffer needs to keep two consecutive aligned 16-bytes chunks of code in order to generate it. If the first instruction after the jump crosses a 16-byte boundary, then the double buffer needs to load two new 16-bytes chunks of code before a valid ifetch block can be generated. This means that, in the worst case , the decoding of the first instruction after a jump can be delayed for two clock cycles. You get one penalty for a 16-byte boundary in the ifetch block containing the jump instruction, and one penalty for a 16-byte boundary in the first instruction after the jump. you can get bonus if you have more than one decode group in the ifetch block that contains the jump because this gives the double buffer extra time to fetch one or two 16-byte chunks of code in advance for the instructions af ter the jump. The bonuses can compensate for the penalties according to the table below. If the double buffer has fetched only one 16-byte chunk of code after the jump, then the first ifetch block after the jump will be identical to this chunk, that is, aligned to a 16-byte boundary. In other words, the first ifetch block after the jump will not begin at the first instruction, but at the nearest preceding address divisible by 16. If the double buffer has had time to load two 16-Byte Chunks, THE New IFETCH Block Can Cross A 16-Byte Boundary And Beg. These Rules Are Summarized in The Following Table:

Number ofdecode groupsin ifetch-blockcontaining jump16-byteboundary in thisifetch-block16-byteboundary in firstinstruction afterjumpdecoder delayalignment of firstifetch after jump1000by 161011to instruction1101by 161112to instruction2000to instruction2010to instruction2100by 162111to instruction3 or more000to instruction3 or more010to instruction3 or more100to instruction3 or more110to instructionJumps delay the fetching so that a loop Always Takes At Least Two Clock Cycles More Per Itemion Than The Number of 16 Byte Boundaries in the loop.

A further problem with the instruction fetch mechanism is that a new ifetch block is not generated until the previous one is exhausted. Each ifetch block can contain several decode groups. If a 16 bytes long ifetch block ends with an unfinished instruction, then the next ifetch block will begin at the beginning of that instruction. The first instruction in an ifetch block always goes to decoder D0, and the next two instructions go to D1 and D2, if possible. The consequence of this is that D1 and D2 are used less than optimally. If the code is structured according to the recommended 4-1-1 pattern, and an instruction intended to go into D1 or D2 happens to be the first instruction in an ifetch block, then that instruction has to go into D0 with the result that one clock cycle is wasted. This is probably a hardware design flaw. At least it is suboptimal design. The consequence of this problem is that the time it takes to decode a piece of code can vary considerably depending on where the First IFETCH Block Begins.

If decoding speed is critical and you want to avoid these problems then you have to know where each ifetch block begins. This is quite a tedious job. First you need to make your code segment paragraph-aligned in order to know where the 16-byte boundaries are. Then you have to look at the output listing from your assembler to see how long each instruction is. (It is recommended that you study how instructions are coded so that you can predict the lengths of the instructions.) If you know where one ifetch block begins then you can find where the next ifetch block begins in the following way:.. Make the block 16 bytes long If it ends at an instruction boundary then the next block will begin there If it ends with an unfinished instruction then the next block will begin at the beginning of this instruction. (Only the lengths of the instructions counts here, it does not matter how many uops they generate or what they do). This way you can work your way all through the code and mark WHERE EACH . Ifetch block begins The only problem is knowing where to start If you know where one ifetch block is then you can find all the subsequent ones, but you have to know where the first one begins Here are some guidelines..:

The first ifetch block after a jump, call, or return can begin either at the first instruction or at the nearest preceding 16-bytes boundary, according to the table above. If you align the first instruction to begin at a 16-byte boundary then you can be sure that the first ifetch block begins here. you may want to align important subroutine entries and loop entries by 16 for this purpose. If the combined length of two consecutive instructions is more than 16 bytes then you can be certain that the second one does not fit into the same ifetch block as the first one, and consequently you will always have an ifetch block beginning at the second instruction. you can use this as a starting point for finding where subsequent ifetch blocks begin. The first ifetch block after a branch misprediction begins at a 16-byte boundary. As explained in chapter 22.2, a loop that repeats more than 5 times will always have a misprediction when it exits. The first ifetch block after such a loop will therefor e begin at the nearest preceding 16-byte boundary. Other serializing events also cause the next ifetch block to start at a 16-byte boundary. Such events include interrupts, exceptions, self-modifying code, and serializing instructions such as CPUID, IN, And out.i am sure you want an esample now:

Address INSTRUCTION Length UOPS Expected Decoder

-------------------------------------------------- --------------------

1000H MOV ECX, 1000 5 1 D0

1005H ll: MOV [ESI], EAX 2 2 d0

1007H MOV [MEM], 0 10 2 D0

1011h Lea EBX, [EAX 200] 6 1 D1

1017H MOV BYTE PTR [ESI], 0 3 2 D0101AH BSR EDX, EAX 3 2 D0

101DH MOV BYTE PTR [ESI 1], 0 4 2 D0

1021H DEC ECX 1 1 D1

1022H JNZ LL 2 1 D2

Let '

s assume that the first ifetch block begins at 1000h and ends at 1010h. This is before the end of the MOV [MEM], 0 instruction so the next ifetch block will begin at 1007h and end at 1017h address. This is at an instruction boundary so the third ifetch block will begin at 1017h and cover the rest of the loop. The number of clock cycles it takes to decode this is the number of D0 instructions, which is 5 per iteration of the LL loop. The last ifetch block contained three decode blocks covering the last five instructions, and it has one 16-byte boundary (1020h). Looking at the table above we find that the first ifetch block after the jump will begin at the first instruction after the jump, that is the LL label at 1005h, and end at 1015h. This is before the end of the LEA instruction, so the next ifetch block will go from 1011h to 1021h, and the last one from 1021h covering the rest. Now the LEA instruction and the DEC instruction both fall At The Beginning of An iFETCH Block Which Forces them to go into D0. We now have 7 instructions in D0 and the loop takes 7 clocks to decode in the second iteration. The last ifetch block contains only one decode group (DEC ECX / JNZ LL) and has no 16-byte boundary. According to the table, the next ifetch block after the jump will begin at a 16-byte boundary, which is 1000h. This will give us the same situation as in the first iteration, and you will see that the loop takes alternatingly 5 and 7 clock cycles to decode. Since there are no other bottlenecks, the complete loop will take 6000 clocks to run 1000 iterations. If the starting address had been different so that you had a 16-byte boundary in the first or the last instruction of the loop The IT Would Take 8000 Clocks.

If you reorder the loop so that no D1 or D2 instructions fall at the beginning of an ifetch block then you can make it take only 5000 clocks.The example above was deliberately constructed so that fetch and decoding is the only bottleneck. The easiest way to avoid this problem is to structure your code to generate much more than 3 uops per clock cycle so that decoding will not be a bottleneck despite the penalties described here. In small loops this may not be possible and then you have to find out how to optimize The instruction fetch and decoding.

One thing you can do is to change the starting address of your procedure in order to avoid 16-byte boundaries where you do not want them. Remember to make your code segment paragraph aligned so that you know where the boundaries are.

If you insert an ALIGN 16 directive before the loop entry then the assembler will put in NOP's and other filler instructions up to the nearest 16 byte boundary. Most assemblers use the instruction XCHG EBX, EBX as a 2-byte filler (the so called 2 -byte NOP). Whoever got this idea, it's a bad one because this instruction takes more time than two NOP's on most processors! If the loop executes many times then whatever is outside the loop is unimportant in terms of speed and you do not have to care about the suboptimal filler instructions. But if the time taken by the fillers is important then you may select the filler instructions manually. you may as well use filler instructions that do something useful, such as refreshing a register in order to avoid register read stalls (see chapter 16.2) for example, if you are using register EBP for addressing but seldom write to it, then you may use MOV EBP, EBP or ADD EBP, 0 as filler in order to reduce the possibilities of register read stalls. IF you have nothing useful to do, you may use FXCH ST (0) as a good filler because it does not put any load on the execution ports, provided that ST (0) contains a valid floating point value.Another possible remedy is to reorder Your instructions in Order To Get The iFetch Boundaries Where The I ''

Yet another possibility is to manipulate instruction lengths. Sometimes you can substitute one instruction with another one with a different length. Many instructions can be coded in different versions with different lengths. The assembler always chooses the shortest possible version of an instruction, but it is FOR EXAMPLE, DEC ECX IS One Byte Long, Sub ECX, 1 IS 3 BYTES, AND You CAN CODE A 6 BYtes Version with a long immediate Operand Using this trick: Sub ECX, 9999

Org $ -4

DD 1

Instructions with a memory operand can be made one byte longer with a SIB byte, but the easiest way of making an instruction one byte longer is to add a DS: segment prefix (DB 3Eh) The microprocessors generally accept redundant and meaningless prefixes (except. Lock) Not Exceed 15 Bytes. Even INSTRUCTIONS WITHOUT A MEMORY OPERAND CAN HAVE A Segment Prefix. So if you want the dec ECX INSTRUCTION TO BE 2 bytes long, Write:

DB 3EH

Dec ECX

Remember that you get a penalty in the decoder if an instruction has more than one prefix It is possible that instructions with meaningless prefixes -. Especially repeat and lock prefixes - will be used in future processors for new instructions when there are no more vacant instruction codes But I Would Consider IT Safe To Use A Segment Prefix with anything.

With these methods it will usually be pasile to put the meansetch boundaries where you want, although it can be becom, although..

16. Register Renaming (PPRO, PII AND PIII)

16.1 Eliminating Dependencies

Register renaming is an advanced technique used by these Microprocessors to Remove Dependencies Between Different Parts of the code. Example:

Mov Eax, [MEM1] Imul Eax, 6

MOV [MEM2], EAX

Mov Eax, [MEM3]

INC EAX

MOV [MEM4], EAX

Here the last three instructions are independent of the first three in the sense that they do not need any result from the first three instructions. To optimize this on earlier processors you would have to use a different register instead of EAX in the last three instructions and reorder the instructions so that the last three instructions could execute in parallel with the first three instructions. The PPro, PII and PIII processors do this for you automatically. They assign a new temporary register for EAX every time you write to it. Thereby the MOV EAX, [MEM3] Instruction Becomes Independent of the Preceding Instructions. With out-of-order execution it is likely to finish the move to [mem4] before the slow imul instruction is finished.

Register renaming goes fully automatically. A new temporary register is assigned as an alias for the permanent register every time an instruction writes to this register. An instruction that both reads and writes a register also causes renaming. For example the INC EAX instruction above uses one .

All general purpose registers, stack pointer, flags, floating point registers, MMX registers, XMM registers and segment registers can be renamed. Control words, and the floating point status word can not be renamed and this is the reason why the use of these registers is .

A common way of setting a register to zero is XOR EAX, EAX or SUB EAX, EAX. These instructions are not recognized as independent of the previous value of the register. If you want to remove the dependency on slow preceding instructions then use MOV EAX , 0.Register renaming is controlled by the register alias table (RAT) and the reorder buffer (ROB). The uops from the decoders go to the RAT via a queue, and then to the ROB and the reservation station. The RAT can handle On the means.

................ ..

16.2 Register Read Stalls

But there is another limitation which may be quite serious, and that is that you can only read two different permanent register names per clock cycle This limitation applies to all registers used by an instruction except those registers that the instruction writes to only Example..:

MOV [EDI ESI], EAX

MOV EBX, [ESP EBP]

The first instruction generates two uops: one that reads EAX and one that reads EDI and ESI The second instruction generates one uop that reads ESP and EBP EBX does not count as a read because it is only written to by the instruction Let's assume... that these three uops go through the RAT together. I will use the word triplet for a group of three consecutive uops that go through the RAT together. Since the ROB can handle only two permanent register reads per clock cycle and we need five register reads, our triplet will be delayed for two extra clock cycles before it comes to the reservation station. With 3 or 4 register reads in the triplet it would be delayed by one clock cycle.The same register can be read more than once in the same triplet without Adding to the count. if the instructions above area change to:

MOV [EDI ESI], EDI

MOV EBX, [EDI EDI]

...................... ..

A Register That Is Going Uop Is Stored in The Rob So That Can Be Read for Free Until It Is Written Back, Which Takes At Least 3 Clock Cycles, And Usually More. Write-Back Is The end of the execution stage where the value becomes available. in other words, you can read any number of registers in the RAT without stall if their values are not yet available from the execution units. The reason for this is that when a value becomes available it is immediately written directly to any subsequent ROB entries that need it. But if the value has already been written back to a temporary or permanent register when a subsequent uop that needs it goes into the RAT, then the value has to be read from the register file, which has only two read ports. There are three pipeline stages from the RAT to the execution unit so you can be certain that a register written to in one uop-triplet can be read for free in at least the next three triplets. If The Writeback Is Delayed by Reordering, Slow Instructions, Dependency Chains, Cache Misses, or by Any Other Kind of Stall, The Register CAN Be Read for Free Further Down The Instruction Stream.example:

MOV EAX, EBX

SUB ECX, EAX

Inc EBX

Mov Edx, [EAX]

Add ESI, EBX

Add Edi, ECX

These 6 instructions generate 1 uop each. Let's assume that the first 3 uops go through the RAT together. These 3 uops read register EBX, ECX, and EAX. But since we are writing to EAX before reading it, the read is free and we get no stall. The next three uops read EAX, ESI, EBX, EDI, and ECX. Since both EAX, EBX and ECX have been modified in the preceding triplet and not yet back then they can be read for free, so that only written ESI and EDI count, and we get no stall in the second triplet either. If the SUB ECX, EAX instruction in the first triplet is changed to CMP ECX, EAX then ECX is not written to and we will get a stall in the second triplet for reading ESI, EDI and ECX. Similarly, if the INC EBX instruction in the first triplet is changed to NOP or something else then we will get a stall in the second triplet for reading ESI, EBX and EDI.No uop can read more than Therefore, All Instructions That Need To Read More Two Registers Are Split Up Into Two or More U OPS.

To count the number of register reads, you have to include all registers which are read by the instruction. This includes integer registers, the flags register, the stack pointer, floating point registers and MMX registers. An XMM register counts as two registers, except when only part of it is used, as eg in ADDSS and MOVHLPS. Segment registers and the instruction pointer do not count. For example, in SETZ AL you count the flags register but not AL. ADD EBX, ECX counts both EBX and ECX, ...................

The FXCH instruction is a special case. It works by renaming, but does not read any values so that it does not count in the rules for register read stalls. An FXCH instruction behaves like 1 uop that neither reads nor writes any registers with regard to the rules for register read stalls.Don't confuse uop triplets with decode groups. A decode group can generate from 1 to 6 uops, and even if the decode group has three instructions and generates three uops there is no guarantee that the three Uops Will Go Into the Rat Together.

The Queue Between The Decoders and the Rat Is So SHORT (10 uops) That You Cannot Assume That Register Read Stalls Do Not Stall The Decoders Or That Flu Now Stall The Rat.

It is very difficult to predict which uops go through the RAT together unless the queue is empty, and for optimized code the queue should be empty only after mispredicted branches Several uops generated by the same instruction do not necessarily go through the RAT together;. The uops are simply taken consecutively from the queue, three at at time The sequence is not broken by a predicted jump:. uops before and after the jump can go through the RAT together Only a mispredicted jump will discard the queue and start over again so. That the next three uops are sag.

If three consecutive uops read more than two different registers then you would of course prefer that they do not go through the RAT together. The probability that they do is one third. The penalty of reading three or four written-back registers in one triplet of uops is one clock cycle. You can think of the one clock delay as equivalent to the load of three more uops through the RAT. With the probability of 1/3 of the three uops going into the RAT together, the average penalty will be the equivalent of 3/3 = 1 uop. to calculate the average time it will take for a piece of code to go through the RAT, add the number of potential register read stalls to the number of uops and divide by three. You can see that It does not pay to remove the stall by putting in an extra instruction unless you know for sure which uops go into the RAT together or you can prevent more than one potential register read stall by one extra instruction.In situations where you aim at a Throughput of 3 uops per clock, THE LIMIT OF TWO Permanent Register Reads Per Clock Cycle May Be a problematic bottleneck to handle. Possible Ways to remove Register Read Stalls Are:

keep uops that read the same register close together so that they are likely to go into the same triplet keep uops that read different registers spaced so that they can not go into the same triplet place uops that read a register no more than 3 -.. 4 triplets after an instruction that writes to or modifies this register to make sure it has not been written back before it is read (it does not matter if you have a jump between as long as it is predicted). If you have reason to expect the register write to be delayed for whatever reason then you can safely read the register somewhat further down the instruction stream. use absolute addresses instead of pointers in order to reduce the number of register reads. you may rename a register in a triplet where it Doesn't Cause A Stall in Order To Prevent a Read Stall for this Register in One or More Later Triplets. EXAMPLE: MOV ESP, ESP / ... / MOV EX, [ESP 8]. This Method COSTS An Extra uop and Therefore doesn't pay unsses the expected avee number er of read stalls prevented is more than 1 / 3.For instructions that generate more than one uop you may want to know the order of the uops generated by the instruction in order to make a precise analysis of the possibility of register read stalls. I Have the theefore listed the MOST Common Case Below.

. Writes to memory A memory write generates two uops The first one (to port 4) is a store operation, reading the register to store The second uop (port 3) calculates the memory address, reading any pointer registers Examples:.. MOV [ EDI], EAX First Uop Reads Eax, Second Uop Reads EDI. FSTP Qword PTR [EBX 8 * ECX] First Uop Reads St (0), Second Uop Reads EBX AND ECX.

Read and modify An instruction that reads a memory operand and modifies a register by some arithmetic or logical operation generates two uops. The first one (port 2) is a memory load instruction reading any pointer registers, the second uop is an arithmetic instruction (port 0 or 1) reading and writing to the destination register and possibly writing to the flags Example:. ADD EAX, [ESI 20] First uop reads ESI, second uop reads EAX and writes EAX and flags.Read/modify/write A read / modify / write instruction generates four uops. The first uop (port 2) reads any pointer registers, the second uop (port 0 or 1) reads and writes to any source register and possibly writes to the flags, the third uop (port 4 ) reads only the temporary result which does not count here, the fourth uop (port 3) reads any pointer registers again. Since the first and the fourth uop can not go into the RAT together you can not take advantage of the fact that they read the Same Pointer Registers. EXAMPLE: OR [ESI EDI], EA X The first uop reads ESI and EDI, the second uop reads EAX and writes EAX and the flags, the third uop reads only the temporary result, the fourth uop reads ESI and EDI again. No matter how these uops go into the RAT you can be sure that the uop that reads EAX goes together with one of the uops that read ESI and EDI. A register read stall is therefore inevitable for this instruction unless one of the registers has been modified recently.

Push register A push register instruction generates 3 uops. The first one (port 4) is a store instruction, reading the register. The second uop (port 3) generates the address, reading the stack pointer. The third uop (port 0 or 1 ) subtracts the word size from the stack pointer, reading and modifying the stack pointer.Pop register A pop register instruction generates 2 uops. The first uop (port 2) loads the value, reading the stack pointer and writing to the register. The second UOP (port 0 or 1) Adjusts The Stack Pointer, Reading and Modifying The Stack Pointer.

Call A near call generates 4 uops (port 1, 4, 3, 01). The first two uops read only the instruction pointer which does not count because it can not be renamed. The third uop reads the stack pointer. The last uop reads And Modifies The Stack Pointer.

Return A Near Return GeneRates 4 UOPS (Port 2, 01, 01, 1). The first uop reads the stack pointer. The Third Uop Reads and Modifier The Stack Pointer.

An Example of How To Avoid A Register Read Stall is Given in Example 2.6.

17. Out of Order Execution (PPRO, PII and PIII)

The reorder buffer (ROB) can hold 40 uops. Each uop waits in the ROB until all its operands are ready and there is a vacant execution unit for it. This makes out-of-order execution possible. If one part of the code is DELAYED BECAUSE OF A Cache Miss The IT Won't delay later parts of the code ife...............

Writes to memory can not execute out of order relative to other writes. There are four write buffers, so if you expect many cache misses on writes or you are writing to uncached memory then it is recommended that you schedule four writes at at time and make sure the processor has something else to do before you give it the next four writes. Memory reads and other instructions can execute out of order, except IN, OUT and serializing instructions.If your code writes to a memory address and soon after reads from the same address, then the read may by mistake be executed before the write because the ROB does not know the memory addresses at the time of reordering. This error is detected when the write address is calculated, and then the read operation (which was executed speculatively ) HAS TO BE RE-DONE. The Penalty For this Is Approximately 3 Clocks. The Only Way To Avoid this Penalty Is To Make Sure The Execution Unit Has Other Things To Do Between A Write and A Subsequent Read from The SA ME MEMORY ADDRESS.

There are several execution units clustered around five ports. Port 0 and 1 are for arithmetic operations etc. Simple move, arithmetic and logic operations can go to either port 0 or 1, whichever is vacant first. Port 0 also handles multiplication, division, integer shifts and rotates, and floating point operations. Port 1 also handles jumps and some MMX and XMM operations. Port 2 handles all reads from memory and a few string and XMM operations, port 3 calculates addresses for memory write, and port 4 executes all memory write operations. In chapter 29 you'll find a complete list of the uops generated by code instructions with an indication of which ports they go to. Note that all memory write operations require two uops, one for port 3 and one for port 4, while memory read operations use only one uop (port 2) .In most cases each port can receive one new uop per clock cycle. This means that you can execute up to 5 uops in the same clock cycle if they go to five different ports, But sin CE THERE IS A LIMIT OF 3 UOPS Per Clock Earlier in The Pipeline You Will Never Execute More Than 3 Uops Per CLOCK ON AVERAGE.

You must make sure that no execution port receives more than one third of the uops if you want to maintain a throughput of 3 uops per clock. Use the table of uops in chapter 29 and count how many uops go to each port. If port 0 and 1 are saturated while port 2 is free then you can improve your code by replacing some MOV register, register or MOV register, immediate instructions with MOV register, memory in order to move some of the load from port 0 and 1 to port 2.

Most Uops Take Only ONE Clock Cycle To Execute, But Multiplications, Division, and Many Floating Point Operations Take More:

Floating point addition and subtraction takes 3 clocks, but the execution unit is fully pipelined so that it can receive a new FADD or FSUB in every clock cycle before the preceding ones are finished (provided, of course, that they are independent) .Integer multiplication takes 4 clocks, floating point multiplication 5, and MMX multiplication 3 clocks Integer and MMX multiplication is pipelined so that it can receive a new instruction every clock cycle Floating point multiplication is partially pipelined:.. The execution unit can receive a new FMUL instruction two clocks after the preceding one, so that the maximum throughput is one FMUL per two clock cycles. The holes between the FMUL's can not be filled by integer multiplications because they use the same circuitry. XMM additions and multiplications take 3 and 4 clocks respectively, and are Fully Pipelined. But Since Each Logical XMM Register Is Implement As Two Physical 64-Bit Registers, You NEED TWO UOPS for a Packed XMM Operation, And The Throughput Will The BE One Arithmetic XMM Instruction Every Two Clock Cycles. Xmm Add and Multiply Instructions CAN EXECUTE IN Parallel Because They Don't Use The Same Execution Port.

Integer and floating point division takes up to 39 clocks and is not pipelined. This means that the execution unit can not begin a new division until the previous division is finished. The same applies to squareroot and transcendental functions.

Also jump instructions, calls, and returns are not fully pipelined. You can not execute a new jump in the first clock cycle after a preceding jump. So the maximum throughput for jumps, calls, and returns is one for every two clocks.

. You should, of course, avoid instructions that generate many uops The LOOP XX instruction, for example, should be replaced by DEC ECX / JNZ XX.If you have consecutive POP instructions then you may break them up to reduce the number of uops:

POP ECX / POP EBX / POP EAX; Can Be Changed TO:

MOV ECX, [ESP] / MOV EBX, [ESP 4] / MOV EAX, [ESP] / Add ESP, 12 The former code generate 6 uops, the latter generates "doing the same

PUSH instructions is less advantageous because the split-up code is likely to generate register read stalls unless you have other instructions to put in between or the registers have been renamed recently. Doing it with

Call and

RET INSTRUCTIONS WILL Interfere with prediction in the return stack buffer. Note Also That the

Add ESP INSTRUCTION CAUSE AN AGI Stall in Earlier Processors.

18. Retirement (PPRO, PII and PIII)

Retirement is a process where the temporary registers use by the uops are copy.

The retirement station can handle three uops per clock cycle. This may not seem like a problem because the throughput is already limited to 3 uops per clock in the RAT. But retirement may still be a bottleneck for two reasons. Firstly, instructions must retire in order. If a uop is executed out of order then it can not retire before all preceding uops in the order have retired. And the second limitation is that taken jumps must retire in the first of the three slots in the retirement station. Just like decoder D1 and D2 can be idle if the next instruction only fits into D0, the last two slots in the retirement station can be idle if the next uop to retire is a taken jump. This is significant if you have a small loop where the number of uops in the loop is not divisible by three.All uops stay in the reorder buffer (ROB) until they retire. The ROB can hold 40 uops. This sets a limit to the number of instructions that can execute during the long delay of a division or Other slow operation. Before the division is finished the ROB will be filled up with executed uops waiting to retire. Only when the division is finished and retired can the subsequent uops begin to retire, because retirement takes place in order.

In case of speculative execution of predicted branches (see chapter 22) the speculatively executed uops can not retire until it is certain that the prediction was correct. If the prediction turns out to be wrong then the speculatively executed uops are discarded without retirement.

19. Partial Stalls (PPRO, PII and PIII)

19.1 Partial Register Stalls

Partial Register Stall Is A Problem That Occurs WHEN You Write To Part of A 32 Bit Register And Later Read from The Whole Register OR A Bigger Part of It. EXAMPLE: MOV Al, Byte Ptr [M8]

MOV EBX, EAX; Partial Register Stall

This gives a delay of 5-6 clocks. The reason is that a temporary register has been assigned to AL (to make it independent of AH). The execution unit has to wait until the write to AL has retired before it is possible to combine The value from al with the value of the rest of eax. The stall can be avoided by changing to code to:

Movzx EBX, Byte Ptr [Mem8]

And Eax, 0FFFFFFFFFFF00h

OR EBX, EAX

Of Course You Can Also Avoid The Partial Stalls by Putting In Other Instructions After the Write To The Partial Register So That It Has Time To Retire Before Read from The Full Register.

You Should Be Aware of Partial Stalls WHENEVER You Mix Different Data Sizes (8, 16, And 32 Bits):

MOV BH, 0

Add BX, Ax; Stall

Inc EBX; Stall

You don't get a stall at Writing to the full register, or a bigger part of it:

Mov Eax, [MEM32]

Add BL, Al; NO Stall

Add Bh, Ah; No Stall

MOV CX, AX; No Stall

MOV DX, BX; Stall

The easiest way to avoid partial register stalls is to always use full registers and use MOVZX or MOVSX when reading from smaller memory operands. These instructions are fast on the PPro, PII and PIII, but slow on earlier processors. Therefore, a compromise is offered When You Want Your Code to Perform Reasonably Well on All Processors. The Replacement for Movzx Eax, Byte Ptr [M8] Looks Like this:

XOR EAX, EAX

MOV Al, Byte Ptr [M8]

The PPro, PII and PIII processors make a special case out of this combination to avoid a partial register stall when later reading from EAX. The trick is that a register is tagged as empty when it is XOR'ed with itself. The processor remembers that THE ZERO, SO THAT $ Eax Area

MOV Al, 3

MOV EBX, EAX; no Stall

XOR AH, AH

MOV Al, 3

MOV BX, AX; No Stall

XOR EAX, EAX

Mov Ah, 3

MOV EBX, EAX; Stall

SUB EBX, EBX

MOV BL, DL

MOV ECX, EBX; No Stall

MOV EBX, 0

MOV BL, DL

MOV ECX, EBX; Stall

MOV BL, DL

XOR EBX, EBX; NO Stall

Setting a register to zero by subtracting it from itself Works The Same as the xor, but setting it to zero with the Mov Instruction Doesn't Prevent The Stall.

You can set the xor outside a loop:

XOR EAX, EAX

MOV ECX, 100

LL: MOV Al, [ESI]

MOV [EDI], EAX; No Stall

Inc ESI

Add Edi, 4

Dec ECX

JNZ LL

In the Upper 24 Bits of Eax Are The Upper 24 Bits of Eax Are The Upload, Misprediction, or OR Other Serializing Event.

You Should Remember To Neutralize Any Partial Register You Have Used Before Calling a Subroutine That Might Push The Full Register:

Add BL, Al

MOV [MEM8], BL

XOR EBX, EBX; Neutralize BL

Call _highlevelfunction

Most High Level Language Procedures Push Ebx At The Start of The Procedure Which Would Generate A Partial Register Stall In The Example Aboveness BL.

Setting a register to zero with the xor method Doesn't Break ITS Dependency On Earlier Instructions:

Div EBX

MOV [MEM], EAX

Mov Eax, 0; Break DependencyXor Eax, Eax; Prevent Partial Register Stall

MOV Al, Cl

Add Ebx, EAX

Setting Eax To Zero TWICE Here Seem Redundant, But Without the Mov Eax, 0 The Last Instructions Would Have To Wait For The Slow Div To Finish, And Without Xor Eax, Eax You Would Have A Partial Register Stall.

The FNSTSW AX instruction is special: in 32 bit mode it behaves as if writing to the entire EAX In fact, it does something like this in 32 bit mode:. AND EAX, 0FFFF0000h / FNSTSW TEMP / OR EAX, TEMP hence, you don 'T Get a Partial Register Stall When Reading Eax After this Instruction in 32 bit mode:

Fnstsw AX / MOV EBX, EAX; Stall ONLY IF 16 BIT MODE

MOV AX, 0 / FNSTSW AX; Stall ONLY IF 32 Bit Mode

19.2 Partial Flags Stalls

The Flags Register Can Also Cause Partial Register Stalls:

CMP EAX, EBX

Inc ECX

JBE XX; Partial Flags Stall

The JBE instruction reads both the carry flag and the zero flag. Since the INC instruction changes the zero flag, but not the carry flag, the JBE instruction has to wait for the two preceding instructions to retire before it can combine the carry flag from the CMP instruction and the zero flag from the INC instruction. This situation is likely to be a bug rather than an intended combination of flags. to correct it change INC ECX to ADD ECX, 1. A similar bug that causes a partial flags stall is SAHF / Jl xx. The Jl Instruction Tests The Sign Flag and the overflow flag, But Sahf Doesn't Change The overflow flag. To Correct IT, Change JL XX To JS XX.

Unexpectedly (and contrary to what intel manuals say) You Also Get a Partial Flags Stall An Instruction That Modifies Some of the Flag Bits When Reading Only Unmodified Flag Bits: CMP EAX, EBX

Inc ECX

JC XX; Partial Flags Stall

But not When Reading Only Modified Bits:

CMP EAX, EBX

Inc ECX

JE xx; no stall

Partial flags stalls are likely to occur on instructions that read many or all flags bits, ie LAHF, PUSHF, PUSHFD The following instructions cause partial flags stalls when followed by LAHF or PUSHF (D):. INC, DEC, TEST, bit tests, Bit Scan, CLC, STC, CMC, CLD, STD, CLI, STI, MUL, IMUL, AND All Shifts and Rotates. The Following Instructions: and, OR, XOR, Add, ADC, SUB, SBB , CMP, NEG. It is strange that TEST and aND behave differently while, by definition, they do exactly the same thing to the flags. You may use a SETcc instruction instead of LAHF or PUSHF (D) for storing the value of a flag In ORDER TO AVOID A Stall.

EXAMPLES:

Inc Eax / Pushfd; Stall

Add Eax, 1 / Pushfd; No Stall

SHR EAX, 1 / Pushfd; Stall

SHR EAX, 1 / OR EAX, EAX / PUSHFD; NO Stall

Test EBX, EBX / LAHF; Stall

And EBX, EBX / LAHF; NO Stall

Test EBX, EBX / Setz Al; No Stall

CLC / Setz Al; Stall

CLD / setz al; no stall

The Penalty for Partial Flags Stalls Is Approximately 4 Clocks.

19.3 Flags Stalls After Shifts and Rotates

You Can Get a stall reasoning the partial flags stall at a shift or rotate, Except for Shifts and Rotates by One (Short Form):

SHR EAX, 1 / JZ XX; No Stall

SHR EAX, 2 / JZ XX; Stall

SHR EAX, 2 / OR EAX, EAX / JZ XX; No Stall

SHR EAX, 5 / JC XX; StallShr Eax, 4 / SHR EAX, 1 / JC XX; No Stall

SHR EAX, CL / JZ XX; Stall, Even IF CL = 1

Shrd EAX, EBX, 1 / JZ XX; Stall

ROL EBX, 8 / JC XX; Stall

The Penalty For There Stalls Is ApproxImately 4 Clocks.

19.4 Partial Memory Stalls

A Partial Memory Stall Is Somewhat Analog to a Partial Register Stall. It Occurs When You Mix Data Sizes for the Same Memory Address:

MOV BYTE PTR [ESI], Al

MOV EBX, DWORD PTR [ESI]; Partial Memory Stall

Here you get a stall because the processor has to combine the byte written from AL with the next three bytes, which were in memory before, to get the four bytes needed for reading into EBX. The penalty is approximately 7-8 clocks.

Unlike The Partial Register Stalls, You Also Get A Partial Memory Stall When You Write a Bigger Operand To Memory and Then Read Part of It, IF The Smaller Part Doesn't Start At The Same Address:

Mov DWORD PTR [ESI], EAX

MOV BL, BYTE PTR [ESI]; No Stall

MOV BH, BYTE PTR [ESI 1]; Stall

You can avoid this stall by changing the last line to MOV BH, AH, But Such a Solution IS NOT POSSIBLE IN A SITION LIKE THIS:

Fistp Qword PTR [EDI]

MOV EAX, DWORD PTR [EDI]

MOV EDX, DWORD PTR [EDI 4]; Stall

Interestingly, You Can Also Get A Partial Memory Stall When Writing and Reading Completely Different Addresses If They Happen To Have The Same Set-Value In Different Cache Banks:

MOV BYTE PTR [ESI], Al

MOV EBX, DWORD PTR [ESI 4092]; No Stall

MOV ECX, DWORD PTR [ESI 4096]; Stall

20. Dependency Chains (PPRO, PII AND PIII)

A series of instructions where each instruction depends on the result of the preceding one is called a dependency chain Long dependency chains should be avoided, if possible, because they prevent out-of-order and parallel execution.Example.:

MOV EAX, [MEM1]

Add Eax, [MEM2]

Add Eax, [MEM3]

Add Eax, [MEM4]

MOV [MEM5], EAX

In this EaxMple, The Add Instructions Generate 2 Uops Each, One forread from Memory (Port 2), AND One for adding (port 0 or 1). The read uops can execute out or order, while the add uops virt wait for T previous uops to finish This dependency chain does not take very long to execute, because each addition adds only 1 clock to the execution time But if you have slow instructions like multiplications, or even worse:.. divisions, then you should definitely do something to Break the dependency chain. The Way to do this is to use multiple accumulators:

Mov Eax, [MEM1]; Start First Chain

MOV EBX, [MEM2]; Start Other Chain in Different Accumulator

Imul Eax, [MEM3]

Imul EBX, [MEM4]

Imul Eax, EBX; JOIN CHAINS IN THE END

MOV [MEM5], EAX

Here, The Second Imul Instruction Can Start Before The First One IS Finished. Since The Imul Instruction Has A Delay of 4 Clocks and is Fully Pipelined, You May Have Up To 4 Accumulators.

FLOATING POINT INTRUCTIONS HAVE A LONGER DELAY THAN INTEGER INSTRUCTIONS, SO you Should Definitely Break Up Long Dependency Chains with floating point instructions:

FLD [MEM1]; Start First Chain

FLD [MEM2]; Start Second Chain in Different Accumulator

Fadd [MEM3]

FXCH

Fadd [MEM4]

FXCH

Fadd [MEM5]

Fadd; Join Chains in the End

FSTP [MEM6]

You need a lot of FXCH instructions for this, but do not worry: they are cheap FXCH instructions are resolved in the RAT by register renaming so they do not put any load on the execution ports An FXCH does count as 1 uop.. In the Rat, Rob, And Retirement Station, Though.

IF The Dependency Chain Is Long You NEED Three Accumulators:

FLD [MEM1]; Start First Chain

FLD [MEM2]; Start Second Chain

FLD [MEM3]; Start Third Chain

Fadd [MEM4]; Third Chain

FXCH ST (1)

Fadd [Mem5]; SECOND CHAIN

FXCH ST (2)

Fadd [MEM6]; First Chain

FXCH ST (1)

Fadd [MEM7]; Third Chain

FXCH ST (2)

Fadd [MEM8]; Second Chain

FXCH ST (1)

Fadd; Join First and Third Chain

Fadd; Join with Second Chain

FSTP [MEM9]

Avoid Storing Intermediate Data in Memory and Read ThemmediatelyAfterwards:

MOV [Temp], EAX

MOV EBX, [TEMP]

There is a penalty for attempting to read from a memory address before a previous write to that address is finished. In the example above, change the last instruction to MOV EBX, EAX or put some other instructions in between.

There IS One Situation Where You Cannot Avoid Storing Intermediate Data In Memory, And That Is When Transferring Data from An Integer Register To A Floating Point Register, or Vice Versa. For Example:

MOV EAX, [MEM1]

Add Eax, [MEM2]

MOV [Temp], EAX

Fild [Temp]

IF you don't have anything to put in between the Write to Temp and the read from teit, the you may consider Using A Floating Point Register Instead of Eax:

Fild [MEM1]

FIADD [MEM2]

Consecutive jumps, calls, or returns may also be considered dependency chains. The throughput for these instructions is one jump per two clock cycles. It is therefore recommended that you give the microprocessor something else to do between the jumps.21. Searching for bottlenecks ( PPRO, PII and PIII)

When optimizing code for these processors, it is important to analyze where the bottlenecks are. Spending time on optimizing away one bottleneck does not make sense if there is another bottleneck which is narrower.

IF you expect code cache misses the you will..

If you expect many data cache misses then forget about everything else and concentrate on how to restructure your data to reduce the number of cache misses (chapter 7), and avoid long dependency chains after a data read cache miss (chapter 20).

IF you have managened the division.

Dependency chains tend to hamper out-of-order execution (chapter 20). Try to break long dependency chains, especially if they contain slow instructions such as multiplication, division, and floating point instructions.

If you have many jumps, calls, or returns, and especially if the jumps are poorly predictable, then try if some of them can be avoided. Replace conditional jumps with conditional moves if possible, and replace small procedures with macros (chapter 22.3).

If you are mixing different data sizes (8, 16, and 32 bit integers) then look out for partial stalls. If you use PUSHF or LAHF instructions then look out for partial flags stalls. Avoid testing flags after shifts or rotates by more than 1 (Chapter 19) .IF You AT A THROUGHPUT OF 3 uops per clock cycle the be aware of possible delays in instruction fetch and decoding (Chapter and 14 and 15), especially in small loops.

The limit of two permanent register reads per clock cycle may reduce your throughput to less than 3 uops per clock cycle (chapter 16.2). This is likely to happen if you often read registers more than 4 clock cycles after they last were modified. This may , for example, happen if you offten us ,,,,,,,,,,,,,,,,,,,,,,,

A Throughput of 3 UOps Per Clock Requires That No Execution Port Gets More Than One Third of THE UOPS (Chapter 17).

The Retirement Station CAN Handle 3 UOPS Per Clock, But May Be Slightly Less Effective For Taken Jumps (Chapter 18).

22. Jumps and branches (all processors)

The Pentium family of processors attempt to predict where a jump will go to, and whether a conditional jump will be taken or fall through. If the prediction is correct, then it can save a considerable amount of time by loading the subsequent instructions into the pipeline And Start Decoding Theme Before The Prediction Turns Out To Be Wrong, The Pipeline Has To BE Flushed, Which Will Cost A Penalty Depending on The length of the pipeline.

The predictions are based on a Branch Target Buffer (BTB) which stores the history for each branch or jump instruction and makes predictions based on the prior history of executions of each instruction. The BTB is organized like a set-associative cache where new entries are allocated according to a pseudo-random replacement method.When optimizing code, it is important to minimize the number of misprediction penalties. This requires a good understanding of how the jump prediction works.

The branch prediction mechanisms are not described adequately in Intel manuals or anywhere else. I am therefore giving a very detailed description here. This information is based on my own research (with the help of Karki Jitendra Bahadur for the PPlain).

In the following, I will use the term 'control transfer instruction' for any instruction which can change the instruction pointer, including conditional and unconditional, direct and indirect, near and far, jumps, calls, and returns. All these instructions use prediction.

22.1 Branch prediction in Pplain

The branch prediction mechanism for the PPlain is very different from the other three processors. Information found in Intel documents and elsewhere on this subject is directly misleading, and following the advises given is such documents is likely to lead to sub-optimal code.

The PPlain has a branch target buffer (BTB), which can hold information for up to 256 jump instructions. The BTB is organized like a 4-way set-associative cache with 64 entries per way. This means that the BTB can hold no more than 4 entries with the same set value. Unlike the data cache, the BTB uses a pseudo random replacement algorithm, which means that a new entry will not necessarily displace the least recently used entry of the same set-value. How the set-value is calculated will be explained later Each BTB entry stores the address of the jump target and a prediction state, which can have four different values:. state 0: "strongly not taken" state 1: "weakly not taken" state 2: "weakly Taken "State 3:" Strongly Taken "

A branch instruction is predicted to jump when in state 2 or 3, and to fall through when in state 0 or 1. The state transition works like a two-bit counter, so that the state is incremented when the branch is taken, and decremented when it falls through. The counter saturates, rather than wrap around, so that it does not decrement beyond 0 or increment beyond 3. Ideally, this would provide a reasonably good prediction, because a branch instruction would have to deviate twice from what it does Most of the time, before the prediction change.

However, this mechanism has been compromised by the fact that state 0 also means 'unused BTB entry'. So a BTB entry in state 0 is the same as no BTB entry. This makes sense, because a branch instruction is predicted to fall through if IT HAS NO BTB Entry. This Improves The Utilization of The BTB, Because A Branch Instruction Which IS Seldom Taken Will Most of The Time Not Take Up Any BTB Entry.

Now, if a jumping instruction has no BTB entry, then a new BTB entry will be generated, and this new entry will always be set to state 3. This means that it is impossible to go from state 0 to state 1 (except for a very special case discussed later). From state 0 you can only go to state 3, if the branch is taken. If the branch falls through, then it will stay out of the BTB.This is a serious design flaw. By throwing state 0 entries out of the BTB and always setting new entries to state 3, the designers apparently have given priority to minimizing the first time penalty for unconditional jumps and branches often taken, and ignored that this seriously compromises the basic idea behind the mechanism and reduces the performance in small innermost loops. The consequence of this flaw is, that a branch instruction which falls through most of the time will have up to three times as many mispredictions as a branch instruction which is taken most of the time. (Apparently, Intel engineers have B Een UnaWare of this flaw untric I public my findings.

You May Take this Asymmetry Into Account by Organizing Your Branches So That The isy. Consider for Example this if -hen-else Construction:

Test Eax, EAX

JZ A

JMP E

If branch 1 is executed more often than branch 2, and branch 2 is seldom executed twice in succession, then you can reduce the number of branch mispredictions by up to a factor 3 by swapping the two branches so that the branch instruction will jump more often Than Fall THROUGH:

Test Eax, EAX

JNZ A

JMP E

(This is contrary to the recommendations in intel's manuals and tutorials).

There may be reasons to put the most often executed branch first, however:. Putting seldom executed branches away in the bottom of your code can improve code cache utilization A branch instruction seldom taken will stay out of the BTB most of the time, possibly improving BRANCH INSTRUCTION. The Asymmetry In Branch Prediction Only The PPLICTION.

These considerations have little weight, however, for small critical loops, so I would still recommend organizing branches with a skewed distribution so that the branch instruction is taken more often than not, unless branch 2 is executed so seldom, that misprediction does not matter .

Likewise, you's ,,,,,,,,,,,,,,,,,,

MOV ECX, [N]

L: MOV [EDI], EAX

Add Edi, 4

Dec ECX

Jnz L

IF N IS High, THE JNZ INSTRUCTION Here Will Be Taken More Offen Than NOT, AND Never Fall Through TWICE IN SUCCESSION.

Consider the situation where a branch is taken every second time. The first time it jumps the BTB entry will go into state 3, and will then alternate between state 2 and 3. It is predicted to jump all the time, which gives 50% mispredictions Assume Now That It Deviates from this regular pattern and falls throughs.. The jump pattern is:

01010101010101010101010, WHERE 0 means nojump, AND 1 means jump.

^ The Extra Nojump is Idicated with a

^ Above. After this incident, the BTB entry will alternate between state 1 and 2, which gives 100% mispredictions. It will continue in this unfortunate mode until there is another deviation from the 0101 pattern. This is the worst case for this branch prediction mechanism.22.1.2 BTB is looking ahead (PPlain) The BTB mechanism is counting instruction pairs, rather than single instructions, so you have to know how instructions are pairing in order to analyze where a BTB entry is stored. The BTB entry for any AN UN Unpaired Instruction Counts as One Pair. EXAMPLE:

SHR EAX, 1

MOV EBX, [ESI]

CMP EAX, EBX

JB L

Here Shr Pairs with Mov, And Cmp Pairs with JB. The BTB Entry for JB L is Thr Eax, 1 Instruction. When this btb entry is met, and if it is in state 2 or 3, then the Pentium will read the target address from the BTB entry, and load the instructions following L into the pipeline. This happens before the branch instruction has been decoded, so the Pentium relies solely on the information in the BTB when doing this.

You may remember, that instructions are seldom pairing the first time they are executed (see chapter 8). If the instructions above are not pairing, then the BTB entry should be attached to the address of the CMP instruction, and this entry would be wrong on the next execution, when instructions are pairing. However, in most cases the PPlain is smart enough to not make a BTB entry when there is an unused pairing opportunity, so you do not get a BTB entry until the second execution, and hence you will not get a prediction until the third execution. (in the rare case, where every second instruction is a single-byte instruction, you may get a BTB entry on the first execution which becomes invalid in the second execution, but since the instruction it is attached to will then go to the V-pipe, it is ignored and gives no penalty. A BTB entry is only read if it is attached to the address of a U-pipe instruction) .A BTB entry is identified by its Set-Value Which Is Equal to Bits 0-5 of the Address I T is attached to. BITS 6-31 Are Tag. Addresses Which Are Spaced a Multiple of 64 Bytes Apart Will Have The Same Set-Value. You Can Have No More Than Four BTB Entries with the Same Set -value. If you want to check whether your jump instructions contend for the same BTB entries, then you have to compare bits 0-5 of the addresses of the U-pipe instructions in the preceding instruction pairs. This is very tedious, and I Have Never Heard of Anybody Doing So. There Are no Tools Available to do this job for you.

22.1.3 Consesecutive Branches (PPLAIN)

When a jump is mispredicted, then the pipeline gets flushed. If the next instruction pair executed also contains a control transfer instruction, then the PPlain will not load its target because it can not load a new target while the pipeline is being flushed. The result is that the second jump instruction is predicted to fall through regardless of the state of its BTB entry. Therefore, if the second jump is also taken, then you will get another penalty. The state of the BTB entry for the second jump instruction does get correctly updated, though. If you have a long chain of control transfer instructions, and the first jump in the chain is mispredicted, then the pipeline will get flushed all the time, and you will get nothing but mispredictions until you meet an instruction pair which Does Not Jump. The Most Extreme Case of this Is A Loop Which Jumps To Itself: It Will Get A Misprediction Penalty For Each Iteration.this is Not The Only Problem with Conncutive Control Transfer Instructions. Another problem is that you can have another branch instruction between a BTB entry and the control transfer instruction it belongs to If the first branch instruction jumps to somewhere else, then strange things may happen Consider this example..:

SHR EAX, 1

MOV EBX, [ESI]

CMP EAX, EBX

JB L1

JMP L2

L1: MOV EAX, EBX

Inc EBX

WHEN JB L1 Falls Through, He Will Get A BTB Entry For Jmp L2 Attached to The Address of Cmp Eax, EBX. But What Will Happen WHEN JB L1 LATER IS TAKEN? AT TIME WHEN TAKEN? AT TIME WHEN THE BTB Entry for JMP L2 Is Read, the processor does not know that the next instruction pair does not contain a jump instruction, so it will actually predict the instruction pair MOV EAX, EBX / INC EBX to jump to L2. The penalty for predicting non-jump instructions to jump is 3 clock cycles. The BTB entry for JMP L2 will get its state decremented, because it is applied to something which does not jump. If we keep going to L1, then the BTB entry for JMP L2 will be decremented to state 1 and 0, so that the problem will disappear until next time JMP L2 is executed.The penalty for predicting the non-jumping instructions to jump only occurs when the jump to L1 is predicted. In the case that JB L1 is mispredictedly jumping, then the pipeline gets flushed And we't get the false l2 target loaded, so in this case we will not SEE The Penalty of Predickning The Non-Jumping Instructions To Jump, But We do Get The BTB Entry for JMP L2 Decrement.

Suppose, now, that we replace the INC EBX instruction above with another jump instruction. This third jump instruction will then use the same BTB entry as JMP L2 with the possible penalty of predicting a wrong target, (unless it happens to also have L2 as Target).

To Summarize, Consecutive Jumps CAN Lead to the Following Problems:

failure to load a jump target when the pipeline is being flushed by a preceding mispredicted jump. a BTB entry being mis-applied to non-jumping instructions and predicting them to jump. a second consequence of the above is that a mis-applied BTB entry will get its state decremented, possibly leading to a later misprediction of the jump it belongs to. Even unconditional jumps can be predicted to fall through for this reason. two jump instructions may share the same BTB entry, leading to the prediction of a wrong target .All this message, SO You Should Definitely Avoid Having An Instruction Pair Containing a Jump Immediately After Another Poorly Predictable Control Transfer Instruction Or ITS Target.

IT IS Time for Another Illustrative Example:

Call P

Test Eax, EAX

JZ L2

L1: MOV [EDI], EBX

Add Edi, 4

Dec EAX

JNZ L1

L2: Call P

THIS LOOKS LIKE A QUITE NICE AND NORMAL PIECE OF CODE: A Function Call, a loop Which is bypasssed WHEN THE Count Is ZERO, AND ANOTHER FUNCTION CALL. How Many PROBLEMS CAN you spot in this program?

First, we may note that the function P is called alternatingly from two different locations. This means that the target for the return from P will be changing all the time. Consequently, the return from P will always be mispredicted.

Assume, now, that EAX is zero. The jump to L2 will not have its target loaded because the mispredicted return caused a pipeline flush. Next, the second CALL P will also fail to have its target loaded because JZ L2 caused a pipeline flush. Here we have the situation where a chain of consecutive jumps makes the pipeline flush repeatedly because the first jump was mispredicted. The BTB entry for JZ L2 is stored at the address of P's return instruction. This BTB entry will now be mis-applied to whatever comes after the second CALL P, but that does not give a penalty because the pipeline is flushed by the mispredicted second return.Now, let's see what happens if EAX has a nonzero value the next time: JZ L2 is always predicted to fall through because of the flush. The second CALL P has a BTB entry at the address of TEST EAX, EAX. This entry will be mis-applied to the MOV / ADD pair, predicting it to jump to P. This causes a flush which prevents JNZ L1 from loading its target. If we have been her e before, then the second CALL P will have another BTB entry at the address of DEC EAX. On the second and third iteration of the loop, this entry will also be mis-applied to the MOV / ADD pair, until it has had its state decremented to 1 or 0. This will not cause a penalty on the second iteration because the flush from JNZ L1 prevents it from loading its false target, but on the third iteration it will. The subsequent iterations of the loop have no penalties, but when it exits, JNZ L1 is mispredicted. The flush would now prevent CALL P from loading its target, were it not for the fact that the BTB entry for CALL P has already been destroyed by being mis-applied several times.

We can Improve this code by Putting in Some Nop's to Separate All Consecutive Jumps: Call P

Test Eax, EAX

NOP

JZ L2

L1: MOV [EDI], EBX

Add Edi, 4

Dec EAX

JNZ L1

L2: NOP

NOP

Call P

The extra NOP's cost 2 clock cycles, but they save much more. Furthermore, JZ L2 is now moved to the U-pipe which reduces its penalty from 4 to 3 when mispredicted. The only problem that remains is that the returns from P are always Mispredic THIS Problem CAN Only Be Solved by Replacing The Call To P by An Inline Macro (if You Have Enough Cache).

The lesson to learn from this example is that you should always look carefully for consecutive jumps and see if you can save time by inserting some NOP's. You should be particularly aware of those situations where misprediction is unavoidable, such as loop exits and returns from procedures Which Are Called from Varying Locations. If You Have Something Useful to Put in, Instead of The...

Multiway branches (case statements) may be implemented either as a tree of branch instructions or as a list of jump addresses. If you choose to use a tree of branch instructions, then you have to include some NOP's or other instructions to separate the consecutive branches ......................................... ..

22.1.4 Tight Loops (PPLAIN)

In a small loop you will often access the same BTB entry repeatedly with small intervals. This never causes a stall. Rather than waiting for a BTB entry to be updated, the PPlain somehow bypasses the pipeline and gets the resulting state from the last jump before . it has been written to the BTB This mechanism is almost transparent to the user, but it does in some cases have funny effects: You can see a branch prediction going from state 0 to state 1, rather than to state 3, if the zero has not yet been written to the BTB. This happens if the loop has no more than four instruction pairs. In loops with only two instruction pairs you may sometimes have state 0 for two consecutive iterations without going out of the BTB. In such small loops it also happens in rare cases that the prediction uses the state resulting from two iterations ago, rather than from the last iteration. These funny effects will usually not have any negative effects on performance.22.2 Branch prediction in PMMX, PPro, PII and PIII

22.2.1 BTB Organization (PMMX, PPRO, PII AND PIII)

The branch target buffer (BTB) of the PMMX has 256 entries organized as 16 ways * 16 sets. Each entry is identified by bits 2-31 of the address of the last byte of the control transfer instruction it belongs to. Bits 2-5 define the set, and bits 6-31 are stored in the BTB as a tag. Control transfer instructions which are spaced 64 bytes apart have the same set-value and may therefore occasionally push each other out of the BTB. Since there are 16 ways Per Set, this won't happen too offten.

The branch target buffer (BTB) of the PPro, PII and PIII has 512 entries organized as 16 ways * 32 sets. Each entry is identified by bits 4-31 of the address of the last byte of the control transfer instruction it belongs to. bits 4-8 define the set, and all bits are stored in the BTB as a tag. Control transfer instructions which are spaced 512 bytes apart have the same set-value and may therefore occasionally push each other out of the BTB. Since there are 16 ways per set, this will not happen too often.The PPro, PII and PIII allocate a BTB entry to any control transfer instruction the first time it is executed. The PMMX allocates it the first time it jumps. A branch instruction which never Jumps Will Stay Out of the btb on the pmmx. As soon as it has jumped, it will stay in the btb, eveniff it never jumps again.

22.2.2 Misprediction Penalty (PMMX, PPRO, PII AND PIII)

In the PMMX, The Penalty for Misprediction of a Conditional Jump IS 4 Clocks IN The U-PIPE, AND 5 Clocks IT IT IS Executed in The V-PIPE. For All Other Control Transfer Instructions IT IS 4 Clocks.

In the PPro, PII and PIII, the misprediction penalty is very high due to the long pipeline. A misprediction usually costs between 10 and 20 clock cycles. It is therefore very important to be aware of poorly predictable branches when running on PPro, PII and PIII.

22.2.3 Pattern Recognition for Conditional Jumps (PMMX, PPRO, PII AND PIII)

These processors have an advanced pattern recognition mechanism which will correctly predict a branch instruction which, for example, is taken every fourth time and falls through the other three times. In fact, they can predict any repetitive pattern of jumps and nojumps with a period of Up to FIVE, AND MANY PATTERNS WITH HIGHER Periods.The Mechanism Is A So-Called "Two-Level Adaptive Branch Prediction Scheme", Invented By T.-Y. Yeh and Yn Patt. it is based on the same kind of two- bit counters as described above for the PPlain (but without the assymmetry flaw). The counter is incremented when the jump is taken and decremented when not taken. There is no wrap-around when counting up from 3 or down from 0. A branch instruction is predicted to be taken when the corresponding counter is in state 2 or 3, and to fall through when in state 0 or 1. An impressive improvement is now obtained by having sixteen such counters for each BTB entry. It selects one of these sixteen counters Based on the history of the branch instruction for the last four executions. If, for example, the branch instruction jumps once and then falls through three times, then you have the history bits 1000 (1 = jump, 0 = nojump). This will make it Use counter Number 8 (1000 binary = 8) for predicing the next time and update counter 8 afterwards.

If the sequence 1000 is always followed by a 1, then counter number 8 will soon end up in its highest state (state 3) so that it will always predict a 1000 sequence to be followed by a 1. It will take two deviations from this pattern to change the prediction. The repetitive pattern 100010001000 will have counter 8 in state 3, and counter 1, 2 and 4 in state 0. The other twelve counters will be unused.22.2.4 Perfectly predicted patterns (PMMX, PPro, PII and PIII)

A Repetitive Branch Pattern Is Predicted Perfectly by this Mechanism if Every PERIOD IS UNIQUE. BELOW IS A List of Repetitive Branch Patterns Which Are PredicTED Perfect

period perfectly predicted patterns 1-5all6000011, 000101, 000111, 00101170000101, 0000111, 0001011800001011, 00001111, 00010011, 00010111, 001011019000010011, 000010111, 000100111, 000101101100000100111, 0000101101, 0000101111, 0000110111, 0001010011, 00010111011100001001111, 00001010011, 00001011101, 0001010011112000010100111, 000010111101, 000011010111, 000100110111, 000100111011130000100110111, 0000100111011, 00001010011111400001001101111, 00001001111011, 00010011010111, 00010011101011, 00010110011101, 0001011010011115000010011010111, 000010011101011, 000010100110111, 000010100111011, 000010110011101, 000010110100111, 000010111010011, 000011010010111160000100110101111, 0000100111101011, 0000101100111101, 0000101101001111

When reading this table, you should be aware that if a pattern is predicted correctly than the same pattern reversed (read backwards) is also predicted correctly, as well as the same pattern with all bits inverted Example:. In the table we find the pattern : 0001011. Reversing this pattern gives: 1101000. Inverting all bits gives: 1110100. Both reversing and inverting: 0010111. These four patterns are all recognizable Rotating the pattern one place to the left gives:. 0010110. This is of course not a new pattern, only a phase shifted version of the same pattern. All patterns which can be derived from one of the patterns in the table by reversing, inverting and rotating are also recognizable. For reasons of brevity, these are not listed.It takes two periods for the pattern recognition mechanism to learn a regular repetitive pattern after the BTB entry has been allocated. The pattern of mispredictions in the learning period is not reproducible. This is probably because the BTB entry conta Ined Something Prior To Allocated According to a Random Scheme, There is Little Chance of Predicting What Happens During The Initial Learning Period.

22.2.5 Handling Deviations from A Regular Pattern (PMMX, PPRO, PII AND PIII)

The branch prediction mechanism is also extremely good at handling 'almost regular' patterns, or deviations from the regular pattern. Not only does it learn what the regular pattern looks like. It also learns what deviations from the regular pattern look like. If deviations are Always of the Same Type, The IT Will Remember What Coms After The Irregular Event, and The Deviation Will Cost Only One Misprediction.

EXAMPLE:

000111001011110111000101110010111000111000101110000

^ ^ In this sequence, a 0 means nojump, a 1 means jump. The mechanism learns that the repeated sequence is 000111. The first irregularity is an unexpected 0, which I have marked with a ^. After this 0 the next three jumps may be mispredicted, because it has not learned what comes after 0010, 0101, and 1011. After one or two irregularities of the same kind it has learned that after 0010 comes a 1, after 0101 comes 1, and after 1011 comes 1. This Means That After at Most Two Irregularity of the Same Kind, It Has Learned to Handle This Kind of Irregularity With Only One Misprediction.

The prediction mechanism is also very effective when alternating between two different regular patterns. If, for example, we have the pattern 000111 (with period 6) repeated many times, then the pattern 01 (period 2) many times, and then return to the 000111 pattern, then the mechanism does not have to relearn the 000111 pattern, because the counters used in the 000111 sequence have been left un-touched during the 01 sequence. After a few alternations between the two patterns, it has also learned to handle The Changes of Pattern with Only One Misprediction for Each Time The Pattern is Switch.

22.2.6 Patterns Which Are Not PredicTED Perfectly (PMMX, PPRO, PII AND PIII)

The SimpleSt Branch Pattern Which Cannot Be Predicted Perfectly Is A Branch Which is Taken on Every Pattern IS:

000001000001000001

^^ ^^ ^^

AB AB AB

The sequence 0000 is alternatingly followed by a 0, in the positions marked a above, and by a 1, in the positions marked b. This affects counter number 0 which will count up and down all the time. If counter 0 happens to start in state 0 or 1, then it will alternate between state 0 and 1. This will lead to a misprediction in position b. If counter 0 happens to start in state 3, then it will alternate between state 2 and 3 which will cause a misprediction in position a. The worst case is when it starts in state 2. It will alternate between state 1 and 2 with the unfortunate consequence that we get a misprediction both in position a and b. (This is analogous to the worst case for the PPlain explained above). Which of these four situations we will get depends on the history of the BTB entry prior to allocation to this branch. This is beyond our control because of the random allocation method.In principle, it is possible to avoid the worst case situation WHERE WE HAVE TWO MISPREDICTIONS PE r cycle by giving it an initial branch sequence which is specially designed for putting the counter in the desired state. Such an approach can not be recommended, however, because of the considerable extra code complexity required, and because whatever information we have put into the counter Is Likey to Be Lost During The next Timer Interrupt or Task Switch.

22.2.7 Completely Random Patterns (PMMX, PPRO, PII AND PIII)

............

The Following Table Lists The Experimental Fraction of Mispredictions for a Completely Random Sequence of Jumps and nojumps:

fraction of jumps / nojumps fraction of mispredictions 0.001 / 0.9990.0010010.01 / 0.990.01010.05 / 0.950.05250.10 / 0.900.1100.15 / 0.850.1710.20 / 0.800.2350.25 / 0.750.3000.30 / 0.700.3620.35 / 0.650.4180.40 / 0.600.4620.45 / 0.550.4900.50/0.500.500The fraction of mispredictions is slightly higher than it would be without pattern recognition because the processor keeps trying to find repeated patterns in a sequence which has no regularities.

22.2.8 Tight Loops (PMMX)

The branch prediction is not reliable in tiny loops where the pattern recognition mechanism does not have time to update its data before the next branch is met. This means that simple patterns, which would normally be predicted perfectly, are not recognized. Incidentally, some patterns which normally would not be recognized, are predicted perfectly in tight loops. for example, a loop which always repeats 6 times would have the branch pattern 111110 for the branch instruction at the bottom of the loop. This pattern would normally have one or two mispredictions per iteration, but in a tight loop it has none. The same applies to a loop which repeats 7 times. Most other repeat counts are predicted poorer in tight loops than normally. This means that a loop which iterates 6 or 7 times should preferably .

To find out whether a loop will behave as 'tight' on the PMMX you may follow the following rule of thumb:. Count the number of instructions in the loop If the number is 6 or less, then the loop will behave as tight If. you have more than 7 instructions, then you can be reasonably sure that the pattern recognition functions normally. strangely enough, it does not matter how many clock cycles each instruction takes, whether it has stalls, or whether it is paired or not. Complex integer instructions do not count. A loop can have lots of complex integer instructions and still behave as a tight loop. A complex integer instruction is a non-pairable integer instruction which always takes more than one clock cycle. Complex floating point instructions and MMX instructions still count as one. Note, that this rule of thumb is heuristic and not completely reliable. In important cases you may want to do your own testing. you can use performance monitor counter number 35H for the PMMX to count branch mispredictions. Test results may not be completely deterministic, because branch predictions may depend on the history of the BTB entry prior to allocation.Tight loops on PPro, PII and PIII are predicted normally, and take minimum two clock cycles per iteration.

22.2.9 Indirect Jumps and Calls (PMMX, PPRO, PII AND PIII)

............... ..

22.2.10 Jecxz and Loop (PMMX)

There is no pattern recognition for these two instructions in the PMMX. They are simply predicted to go the same way as last time they were executed. These two instructions should be avoided in time-critical code for PMMX. (In PPro, PII and PIII THEY Are Predicted Using Pattern Recognition, But The Loop Instruction Is Still Inferior To Dec ECX / JNZ). 22.2.11 Returns (PMMX, PPRO, PII AND PIII)

The PMMX, PPro, PII and PIII processors have a Return Stack Buffer (RSB) which is used for predicting return instructions. The RSB works as a First-In-Last-Out buffer. Each time a call instruction is executed, the corresponding return address is pushed into the RSB. and each time a return instruction is executed, a return address is pulled out of the RSB and used for prediction of the return. This mechanism makes sure that return instructions are correctly predicted when the same subroutine is called from Several DiffERENT LOCATIONS.

In order to make sure this mechanism works correctly, you must make sure that all calls and returns are matched. Never jump out of a subroutine without a return and never use a return as an indirect jump if speed is critical.

The RSB can hold four entries in the PMMX, sixteen in the PPro, PII and PIII. In the case where the RSB is empty, the return instruction is predicted in the same way as an indirect jump, ie it is expected to go to the Same Target As It Did Last Time.

On the PMMX, when subroutines are nested deeper than four levels then the innermost four levels use the RSB, whereas all subsequent returns from the outer levels use the simpler prediction mechanism as long as there are no new calls. A return instruction which uses the RSB still occupies a BTB entry. Four entries in the RSB of the PMMX does not sound of much, but it is probably sufficient. Subroutine nesting deeper than four levels is certainly not unusual, but only the innermost levels matter in terms of speed, except possibly for recursive procedures.On the PPro, PII and PIII, when subroutines are nested deeper than sixteen levels then the innermost 16 levels use the RSB, whereas all subsequent returns from the outer levels are mispredicted. Recursive subroutines should therefore not go deeper than 16 Levels.

22.2.12 static prediction in pmmx

A Control Transfer Instruction Which Has Not Been Seen Before or Which Is Not in The BTB IS ALWAYS PREDICTED TO Fall Through On The Pmmx. It Doesn't Matter WHether It Goes Forward or Backwards.

A branch instruction will not get a BTB entry if it always falls through. As soon as it is taken once, it will get into the BTB and stay there no matter how many times it falls through. A control transfer instruction can only go out of .

Any Control Transfer Instruction Which Jumps To The Address Immediately Following Itself Will Not get a btb entry. EXAMPLE:

JMP Short LL

LL:

This Instruction Will Never Get A BTB Entry and Therefore Always Have A Misprediction Penalty.

22.2.13 Static Prediction In PPRO, PII AND PIII

On PPro, PII and PIII, a control transfer instruction which has not been seen before or which is not in the BTB is predicted to fall through if it goes forwards, and to be taken if it goes backwards (eg a loop). Static prediction takes longer time than dynamic prediction on these processors.If your code is unlikely to be cached then it is preferred to have the most frequently executed branch fall through in order to improve prefetching.

22.2.14 Close Jumps (PMMX)

On The Pmmx, There Is A Risk That Two Control Transfer Instructions Will Share Too Close To each other. The obvious result is what.

The BTB entry for a control transfer instruction is identified by bits 2-31 of the address of the last byte in the instruction. If two control transfer instructions are so close together that they differ only in bits 0-1 of the address, then we Have The Problem of a Shared BTB Entry. EXAMPLE:

Call P

JNC Short L

If the last byte of the CALL instruction and the last byte of the JNC instruction lie within the same dword of memory, then we have the penalty. You have to look at the output list file from the assembler to see whether the two addresses are separated By A DWORD Boundary or Not. (A DWORD Boundary is an address divisible by 4).

There are various ways to solve this problem: 1. Move the code sequence a little up or down in memory so that you get a dword boundary between the two addresses 2. Change the short jump to a near jump (with 4 bytes displacement). so that the end of the instruction is moved further down. There is no way you can force the assembler to use anything but the shortest form of an instruction so you have to hard-code the near branch if you choose this solution. 3. Put in some instruction between the CALL and the JNC instructions. This is the easiest method, and the only method if you do not know where DWORD boundaries are because your segment is not dword aligned or because the code keeps moving up and down as you make Changes in The Preceding Code: Call P

Mov Eax, Eax; Two Bytes Filler to Be Safe

JNC Short L

If You Want to Avoid Problems on The PPLAIN TOO, THEN PUT IN Two Nop's Instead to Prevent Pairing (See Section 22.1.3 Above).

THE RET INSTRUCTION IS PARTICLARLARLARLARLY THIS This Problem Because IT IS Only One Byte Long:

Jnz next

RET

Here You May NEED UP THREE BYTES OF FILLERS:

Jnz next

NOP

Mov Eax, EAX

RET

22.2.15 Consecutive Calls or Returns (PMMX)

THE IS a Penalty WHEN THE FIRE INSTRUCTION PAIR FOLLOWING The Target Label of a Call Contains Another Call instruction or if a return Follows Immediately After ANOTHER RETURN. EXAMPLE:

Func1 Proc Near

NOP; Avoid Call After Call

NOP

Call Func2

Call Func3

Nop; Avoid Return AfTer Return

RET

Func1 ENDP

Two NOP's are required before CALL FUNC2 because a single NOP would pair with the CALL. One NOP is enough before the RET because RET is unpairable. No NOP's are required between the two CALL instructions because there is no penalty for call after return. (On the PPlain you would need two NOP's here too) .The penalty for chained calls only occurs when the same subroutines are called from more than one location (probably because the RSB needs updating). chained returns always have a penalty. There is sometimes a small Stall for a Jump after a call, but no penalty for return at call; call at return; jump, call, or return atter jump; or jump after return.

22.2.16 Chained Jumps (PPRO, PII AND PIII)

A jump, call, or return can not be executed in the first clock cycle after a previous jump, call, or return. Therefore, chained jumps will take two clock cycles for each jump, and you may want to make sure that the processor has something Else to do in Parallel. for the Same Reason, a Loop Will Take At Least Two Clock Cycles Per Iteration On these Processors.

22.2.17 Designing for Branch Predictabiligy (PMMX, PPRO, PII AND PIII)

Multiway branches (switch / case statements) are implemented either as an indirect jump using a list of jump addresses, or as a tree of branch instructions. Since indirect jumps are poorly predicted, the latter method may be preferred if easily predicted patterns can be expected And you have enough to use the the former method, the it is recommended That You Put the list of jump address.

You may want to reorganize your code so that branch patterns which are not predicted perfectly can be replaced by other patterns which are. Consider, for example, a loop which always executes 20 times. The conditional jump at the bottom of the loop is taken 19 times and falls through every 20'th time. This pattern is regular, but not recognized by the pattern recognition mechanism, so the fall-through is always mispredicted. You may make two nested loops by four and five, or unroll the loop by four and let it execute 5 times, in order to have only recognizable patterns. This kind of complicated schemes are only worth the extra code on the PPro, PII and PIII processors where mispredictions are very expensive. For higher loop counts there is no reason to do Anything About The Single Misprediction.22.3. Avoiding Jumps (All Processors)

There Can Be Many Reasons Why You May Want Reduce The Number of Jumps, Calls and Returns:

jump mispredictions are very expensive, there are various penalties for consecutive or chained jumps, depending on the processor, jump instructions may push one another out of the branch target buffer because of the random replacement algorithm, a return takes 2 clocks on PPlain and PMMX, calls and returns generate 4 uops on PPro, PII and PIII. on PPro, PII and PIII, instruction fetch may be delayed after a jump (chapter 15), and retirement may be slightly less effective for taken jumps then for other uops (chapter 18 ).

Calls and returns can be avoided by replacing small procedures with inline macros. And in many cases it is possible to reduce the number of jumps by restructuring your code. For example, a jump to a jump should be replaced by a jump to the final target . In some cases this is even possible with conditional jumps if the condition is the same or is known. A jump to a return can be replaced by a return. If you want to eliminate a return to a return, then you should not manipulate the stack pointer because that would interfere with the prediction mechanism of the return stack buffer. Instead, you can replace the preceding call with a jump. For example CALL PRO1 / RET can be replaced by JMP PRO1 if PRO1 ends with the same kind of RET. You May Also Eliminate A Jump by Dublicating The Code Jumped to. This Can Be Useful if You Have A Two-Way Branch Inside a loop or Before A Return. EXAMPLE:

A: CMP [EAX 4 * EDX], ECX

Je B

Call x

JMP C

B: Call Y

C: Inc EDX

JNZ A

MOV ESP, EBP

POP EBP

RET

THE JUMP TO C May Be Eliminated by Dublicating The Loop Epilog:

A: CMP [EAX 4 * EDX], ECX

Je B

Call x

Inc EDX

JNZ A

JMP D

B: Call Y

C: Inc EDX

JNZ A

D: MOV ESP, EBP

POP EBP

RET

The most often executed branch should come first here. The jump to D is outside the loop and therefore less critical. If this jump is executed so often that it needs optimizing too, then replace it with the three instructions following D.

22.4. Avoiding Conditional Jumps by Using Flags (All Processors)

The most important jumps to eliminate are conditional jumps, especially if they are poorly predictable. Sometimes it is possible to obtain the same effect as a branch by ingenious manipulation of bits and flags. For example you may calculate the absolute value of a signed number without Branching: CDQ

XOR EAX, EDX

Sub Eax, EDX

(On PPLAIN AND PMMX, USE MOV EDX, EAX / SAR EDX, 31 INSTEAD OF CDQ).

The Carry Flag Is Particularly Useful for this Kind of Tricks: Setting Carry if a value is Zero: CMP [Value], 1 setting carry if a value is not zero: xor Eax, Eax / Cmp Eax, [value] incrementing a counter Carry: ADC EAX, 0 SETTING A BIT for Each Time The Carry Is Set: RCL EAX, 1 Generating A Bit Mask IF Carry IS Set: SBB Eax, Eax Setting A Bit On AN Arbitrary All Bits ON AN Arbitrary Condition: xor eax, eax / setncond Al / Dec Eax (Remember to Reverse The Condition In the Last Example)

The Following Example Finds The Minimum of Two Unsigned NumBers: IF (B

SUB EBX, EAX

SBB ECX, ECX

And ECX, EBX

Add Eax, ECX

The next esample chooses between two numbers: if (a! = 0) a = b; Else A = C;

CMP EAX, 1

SBB EAX, EAX

XOR ECX, EBX

And Eax, ECX

XOR EAX, EBX

Whether or not such tricks are worth the extra code depends on how predictable a conditional jump would be, whether the extra pairing or scheduling opportunities of the branch-free code can be utilized, and whether there are other jumps following immediately after which could suffer the Penalties of Consecutive Jumps.

22.5. Replacing Conditional Jumps by Conditional Moves (PPRO, PII AND PIII)

The PPro, PII and PIII processors have conditional move instructions intended specifically for avoiding branches because branch misprediction is very time-consuming on these processors. There are conditional move instructions for both integer and floating point registers. For code that will run only on these processors you may replace poorly predictable branches with conditional moves whenever possible. If you want your code to run on all processors then you may make two versions of the most critical parts of the code, one for processors that support conditional move instructions and one for those that Don't (See Chapter 27.10 For How To Detect IF CONDitional Moves Are Supported).

The misprediction penalty for a branch may be so high that it is advantageous to replace it with conditional moves even when it costs several extra instructions. But a conditional move instruction has the disadvantage that it makes dependency chains longer. The conditional move waits for both register operands to be ready even though only one of them is needed A conditional move is waiting for three operands to be ready:.. the condition flag and the two move operands You have to consider if any of these three operands are likely to be delayed by dependency chains or cache misses. If the condition flag is available long before the move operands then you may as well use a branch, because a possible branch misprediction could be resolved while waiting for the move operands. In situations where you have to wait long for A Move Operand That May Not Be Needed After All, The Branch Will Be Faster Than The Conditional Move Despite a Possible Misprediction Penalty. The Opposite Situation IS WHE N The Condition Flag Is Delayed While Both Move Operands Areailable Early. in This Situation The Conditional Misprediction Is Likey.23. Reducing Code Size (All Processors)

AS Explained in Chapter 7, The Cache IS 8 OR 16 KB. If You Have Problems Keeping The Critical Parts of Your Code Wtem, Then You May Consider Reducing The size of your code.

32 bit code is usually bigger than 16 bit code because addresses and data constants take 4 bytes in 32 bit code and only 2 bytes in 16 bit code. However, 16 bit code has other penalties such as prefixes and problems with accessing adjacent words simultaneously ( see chapter 10.2 above). Some other methods for reducing the size or your code are discussed below.Both jump addresses, data addresses, and data constants take less space if they can be expressed as a sign-extended byte, ie if they are within The interval from -128 to 127.

For Jump Addresses this Means That SHORT JUMPS TAKE TWO BYTES OF CODE, WHEREAS JUMPS BEYOND 127 BYTES TAKE 5 BYTES IF UNCONDITIONAL AND 6 BYTES IF CONDITIONAL.

Likewise, Data Addresses Take Less Space If a Displacement Between -128 and 127. EXAMPLE: MOV EBX, DS: [100000] / Add EBX, DS: [100004]; 12 BYtes Reduce TO: MOV EAX, 100000 / MOV EBX, [EAX] / Add EBX, [EAX 4]; 10 BYTES

The advantage of using a pointer obviously increases if you use it many times. Storing data on the stack and using EBP or ESP as pointer will thus make your code smaller than if you use static memory locations and absolute addresses, provided of course that your data .

Data Constants May Also Take Less Space If 127. Most Instructions with Immediate Operands Have A Short Form Where The Operands Have A Sign-Extended Single Byte. Examples:

Push 200; 5 Bytes

Push 100; 2 Bytes

Add EBX, 128; 6 bytes

SUB EBX, -128; 3 bytes

THE MOSTANTANT INSTRUCTION WITH DOESN '的' s s s....

May Be Changed to:

XOR Eax, Eax; 2 Bytes

And

Mov Eax, 1; 5 Bytes

May Be Changed to:

XOR Eax, Eax / Inc Eax; 3 Bytes

OR:

Push 1 / Pop Eax; 3 Bytes

And

Mov Eax, -1; 5 Bytes

May Be Changed to:

OR Eax, -1; 3 bytes

If the same address or constant is used more than once then you may load it into a register. A MOV with a 4-byte immediate operand may sometimes be replaced by an arithmetic instruction if the value of the register before the MOV is known. Example :

MOV [MEM1], 200; 10 bytes

MOV [MEM2], 200; 10 bytes

MOV [MEM3], 201; 10 bytes

Mov Eax, 100; 5 Bytes

MOV EBX, 150; 5 bytes

Assuming That Mem1 and Mem3 Are Both With, THS BETES OF MEM2, this May Be change to:

Mov EBX, OFFSET MEM2; 5 BYTES

Mov Eax, 200; 5 Bytes

MOV [EBX MEM1-MEM2], EAX; 3 Bytes

MOV [EBX], EAX; 2 bytes

Inc Eax; 1 Byte

MOV [EBX MEM3-MEM2], EAX; 3 Bytes

Sub Eax, 101; 3 Bytes

Lea EBX, [EAX 50]; 3 Bytes

Be aware of the agi stall in the Lea Instruction (for PPLAIN AND PMMX).

You may also consider that different instructions have different lengths The following instructions take only one byte and are therefore very attractive:.. PUSH reg, POP reg, INC reg32, DEC reg32 INC and DEC with 8 bit registers take 2 bytes, so INC EAX Is Shorter Than Inc.

XCHG EAX, reg is also a single-byte instruction and thus takes less space than MOV EAX, reg, but it is slower.Some instructions take one byte less when they use the accumulator than when they use any other register Examples.:

Mov Eax, DS: [100000] Is Smaller Than Mov EBX, DS: [100000]

Add Eax, 1000 IS Smaller Than Add Ebx, 1000

Instructions with pointers take one byte less when they have only a base pointer (not ESP) and a displacement than when they have a scaled index register, or both base pointer and index register, or ESP as base pointer Examples.:

Mov Eax, [Array] [EBX] Is Smaller Than Mov Eax, [Array] [EBX * 4]

Mov Eax, [EBP 12] Is Smaller Than Mov Eax, [ESP 12]

INSTRUCTIONS with EBP AS BASE POINTER AND no Displacement and no index take one byte more Than Other registers:

Mov Eax, [EBX] Is Smaller Than Mov Eax, [EBP], but BUT

MOV EAX, [EBX 4] is Same Size AS MOV EAX, [EBP 4].

Instructions with a scaInd Index Pointer and no base point displacement, Even WHEN IT IS 0:

Lea Eax, [EBX EBX] IS SHORTER THAN LEA EAX, [2 * EBX].

24. Scheduling floating points (PPLAIN AND PMMX)

FLOATING Point Instructions Cannot Pair The Way Integer Instructions Case, Defined by The Following Rules:

the first instruction (executing in the U-pipe) must be FLD, FADD, FSUB, FMUL, FDIV, FCOM, FCHS, or FABS. the second instruction (in V-pipe) must be FXCH the instruction following the FXCH must be a FLOATING POINT INSTRUCTION, OTHERWISE THE FXCH WILL Pair Imperfectly and take an extra clock cycle.

This Special Pairing Is Important, AS Will Be Explained Shortly.

While Floating Point Instructions in General Cannot Be Paired, I. ONE INSTRUCTION HAS FINISHED. EXAMPLE: FADD ST (1), ST (0); Clock Cycle 1-3

Fadd ST (2), ST (0); Clock Cycle 2-4

Fadd St (3), ST (0); Clock Cycle 3-5

Fadd St (4), ST (0); Clock Cycle 4-6

Obviously, two instructions can not overlap if the second instruction needs the result of the first. Since almost all floating point instructions involve the top of stack register, ST (0), there are seemingly not very many possibilities for making an instruction independent of the result of previous instructions. The solution to this problem is register renaming. The FXCH instruction does not in reality swap the contents of two registers, it only swaps their names. Instructions which push or pop the register stack also work by renaming. Floating point register renaming has been highly optimized on the Pentiums so that a register may be renamed while in use Register renaming never causes stalls -. it is even possible to rename a register more than once in the same clock cycle, as for example when you pair FLD or FCOMPP WITH FXCH.

By The Proper Use of Fxch Instructions You Obtain A Lot of overlapping in Your Floating Point Code. EXAMPLE:

FLD [A1]; Clock Cycle 1

Fadd [A2]; Clock Cycle 2-4

FLD [B1]; Clock Cycle 3

Fadd [B2]; Clock Cycle 4-6

FLD [C1]; Clock Cycle 5

Fadd [C2]; Clock Cycle 6-8

FXCH ST (2); Clock Cycle 6

Fadd [A3]; Clock Cycle 7-9

FXCH ST (1); Clock Cycle 7

Fadd [B3]; Clock Cycle 8-10

FXCH ST (2); Clock Cycle 8

Fadd [C3]; Clock Cycle 9-11

FXCH ST (1); Clock Cycle 9

Fadd [A4]; Clock Cycle 10-12FXCH ST (2); Clock Cycle 10

Fadd [B4]; Clock Cycle 11-13

FXCH ST (1); Clock Cycle 11

Fadd [C4]; Clock Cycle 12-14

FXCH ST (2); Clock Cycle 12

In the above example we are interleaving three independent threads. Each FADD takes 3 clock cycles, and we can start a new FADD in each clock cycle. When we have started an FADD in the 'a' thread we have time to start two new FADD instructions in the 'b' and 'c' threads before returning to the 'a' thread, so every third FADD belongs to the same thread. We are using FXCH instructions every time to get the register that belongs to the desired thread into ST ( 0). As you can see in the example above, this generates a regular pattern, but note well that the FXCH instructions repeat with a period of two while the threads have a period of three. This can be quite confusing, so you have to 'Play Computer' in Order to Know Which Registers Are Where.

All versions of the instructions FADD, FSUB, FMUL, and FILD take 3 clock cycles and are able to overlap, so that these instructions may be scheduled using the method described above. Using a memory operand does not take more time than a register operand if The Memory Operand is in The Level 1 Cache and Properly Aligned.

By now you must be used to rules having exceptions, and the overlapping rule is no exception: You can not start an FMUL instruction one clock cycle after another FMUL instruction, because the FMUL circuitry is not perfectly pipelined It is recommended that you put another instruction. IN BETWEEN TWO FMUL's. EXAMPLE:

FLD [A1]; Clock Cycle 1

FLD [B1]; Clock Cycle 2

FLD [C1]; Clock Cycle 3

FXCH ST (2); Clock Cycle 3Fmul [A2]; Clock Cycle 4-6

FXCH; Clock Cycle 4

Fmul [b2]; Clock Cycle 5-7 (Stall)

FXCH ST (2); Clock Cycle 5

Fmul [C2]; Clock Cycle 7-9 (Stall)

FXCH; Clock Cycle 7

FSTP [A3]; Clock Cycle 8-9

FXCH; Clock Cycle 10 (Unpaired)

FSTP [B3]; Clock Cycle 11-12

FSTP [C3]; Clock Cycle 13-14

Here you have a stall before fmul [b2] and before fmul [c2] Because Another Fmul Started In The Preceding Clock Cycle. You Can Improve This Core by Putting Fld Instructions in Between The Fmul's:

FLD [A1]; Clock Cycle 1

Fmul [A2]; Clock Cycle 2-4

FLD [B1]; Clock Cycle 3

Fmul [b2]; Clock Cycle 4-6

FLD [C1]; Clock Cycle 5

Fmul [c2]; clock cycle 6-8

FXCH ST (2); Clock Cycle 6

FSTP [A3]; Clock Cycle 7-8

FSTP [B3]; Clock Cycle 9-10

FSTP [C3]; Clock Cycle 11-12

In Other Cases You May Put Fadd, Fsub, or Anything Else In Between Fmul's To Avoid The Stalls.

Overlapping floating point instructions requires of course that you have some independent threads that you can interleave. If you have only one big formula to execute, then you may compute parts of the formula in parallel to achieve overlapping. If, for example, you want to Add Six Numbers, THEN YOURS INTO THREADS with THREE NUMBERS in Each, Andd The Two Threads in the end:

FLD [A]; Clock Cycle 1

Fadd [b]; Clock Cycle 2-4

FLD [C]; Clock Cycle 3

Fadd [D]; Clock Cycle 4-6

FXCH; Clock Cycle 4

Fadd [e]; Clock Cycle 5-7

FXCH; Clock Cycle 5

Fadd [f]; Clock Cycle 7-9 (Stall)

Fadd; Clock Cycle 10-12 (Stall)

Here WE HAVE A One Clock Stall Before Fadd [F] Because It Is Waiting for the Result of Fadd [D] And A Two Clock Stall Before The Last Fadd Because It Is Waiting for The Result of Fadd [f]. The Latter Stall Can BE Hidden by Filling in Some Integer Instructions, But The First Stall CAN NOT BECAUSE An Integer Instruction At this place would make the fxch pair imperfect.

The first stall can be avoided by having three threads rather than two, but that would cost an extra FLD so we do not save anything by having three threads rather than two unless there are at least eight numbers to add.

Not all floating point instructions can overlap. And some floating point instructions can overlap more subsequent integer instructions than subsequent floating point instructions. The FDIV instruction, for example, takes 39 clock cycles. All but the first clock cycle can overlap with integer instructions, but Only the last two clock cycles can overlap with floating point instructions. EXAMPLE:

FDIV; Clock Cycle 1-39 (U-PIPE)

FXCH; Clock Cycle 1-2 (V-PIPE, IMPERFECT PAIRING)

SHR EAX, 1; Clock Cycle 3 (U-PIPE)

INC EBX; Clock Cycle 3 (V-PIPE)

CMC; Clock Cycle 4-5 (non-pairable)

Fadd [x]; Clock Cycle 38-40 (U-PIPE, WAITING WHILE FPU BUSY)

FXCH; Clock Cycle 38 (V-PIPE)

Fmul [Y]; Clock Cycle 40-42 (U-PIPE, WAITIF RESULT OF FDIV)

The first FXCH pairs with the FDIV, but takes an extra clock cycle because it is not followed by a floating point instruction. The SHR / INC pair starts before the FDIV is finished, but has to wait for the FXCH to finish. The FADD has to wait till clock 38 because new floating point instructions can only execute during the last two clock cycles of the FDIV. The second FXCH pairs with the FADD. The FMUL has to wait for the FDIV to finish because it uses the result of the division. If you have nothing else to put in after a floating point instruction with a large integer overlap, such as FDIV or FSQRT, then you may put in a dummy read from an address which you expect to need later in the program to make sure it is In The Level One Cache. EXAMPLE:

FDIV QWORD PTR [EBX]

CMP [ESI], ESI

Fmul Qword PTR [ESI]

Here We Use the interest overlap to pre-loading the value at [ESI] INTO The Cache While The FDIV IS Being Compute (We Don't Care What The Result of the CMP IS).

Chapter 28 Gives a Complete Listing of Floating Point Instructions, And What The can Pair or Overlap with.

There is no penalty for using a memory operand on floating point instuctions because the arithmetic unit is one step later in the pipeline than the read unit. The tradeoff of this comes when you store floating point data to memory. The FST or FSTP instruction with a memory operand takes two clock cycles in the execution stage, but it needs the data one clock earlier so you will get a one clock stall if the value to store is not ready one clock cycle in advance. This is analogous to an AGI stall. Example :

FLD [A1]; Clock Cycle 1

Fadd [A2]; Clock Cycle 2-4

FLD [B1]; Clock Cycle 3

Fadd [B2]; Clock Cycle 4-6

FXCH; Clock Cycle 4FSTP [A3]; Clock Cycle 6-7

FSTP [B3]; Clock Cycle 8-9

The FSTP [a3] stalls for one clock cycle because the result of FADD [a2] is not ready in the preceding clock cycle. In many cases you can not hide this type of stall without scheduling your floating point code into four threads or putting some integer INSTRUCTIONS in Between. The Two Clock Cycles in The Execution Stage of The FST (P) Instruction Cannot Pair or Overlap with any subssequent instructions.

Instructions with Integer Operands Such As Fiadd, Fisub, Fimul, FiDiv, Ficom May BE Split Up INTO Simple Operations in Order To Improve Overlapping. Example:

Fild [A]; Clock Cycle 1-3

Fimul [b]; Clock Cycle 4-9

Split Up Into:

Fild [A]; Clock Cycle 1-3

Fild [b]; Clock Cycle 2-4

Fmul; Clock Cycle 5-7

In this Example, You Save Two Clocks by overlapping the two.

25. Loop Optimization (All Processors)

When analyzing a program you often find that most of the time consumption lies in the innermost loop. The way to improve the speed is to carefully optimize the most time-consuming loop using assembly language. The rest of the program may be left in high- Level language.

In all the following examples it is assumed that all data are in the level 1 cache. If the speed is limited by cache misses then there is no reason to optimize the instructions. Rather, you should concentrate on organizing your data in a way that minimizes Cache Misses (See Chapter 7).

25.1. Loops in pplain and pmmx

A loop generally contains a counter controlling how many times to iterate, and often array access reading or writing one array element for each iteration. I have chosen as example a procedure which reads integers from an array, changes the sign of each integer, and stores The results in another array.ac Language Code for this Procedure Would Be:

Void ChangeSign (int * a, int * b, int N) {

INT I;

For (i = 0; i

Translating to assembly, we might Write the Procedure Like this:

EXAMPLE 1.1:

_ChangeSign Proc Near

PUSH ESI

Push EDI

A EQU DWORD PTR [ESP 12]

B EQU DWORD PTR [ESP 16]

N EQU DWORD PTR [ESP 20]

MOV ECX, [N]

Jecxz L2

MOV ESI, [A]

Mov EDI, [B]

CLD

L1: LODSD

NEG EAX

Stosd

LOOP L1

L2: POP EDI

POP ESI

Ret; (no extra pop if _cdecl calling convention)

_ChangeSign Endp

This Looks Like a Nice Solution, But it is not optimal because IT Uses Slow Non-Pairable Instructions. It Takes 11 Clock Cycles Per orthy. All Data Are..

Using Pairable Instructions Only (PPLAIN and PMMX)

EXAMPLE 1.2:

MOV ECX, [N]

MOV ESI, [A]

Test ECX, ECX

JZ Short L2

Mov EDI, [B]

L1: MOV EAX, [ESI]; U

XOR EBX, EBX; V (PAIRS)

Add ESI, 4; U

SUB EBX, EAX; V (PAIRS)

MOV [EDI], EBX; U

Add EDI, 4; V (PAIRS)

Dec ECX; U

JNZ L1; V (PAIRS)

L2:

Here we have used pairable instructions only, and scheduled the instructions so that everything pairs. It now takes only 4 clock cycles per iteration. We could have obtained the same speed without splitting the NEG instruction, but the other unpairable instructions should be split up. Using The Same Register for counter and index

EXAMPLE 1.3:

MOV ESI, [A]

Mov EDI, [B]

MOV ECX, [N]

XOR EDX, EDX

Test ECX, ECX

JZ Short L2

L1: MOV EAX, [ESI 4 * EDX]; U

NEG EAX; U

MOV [EDI 4 * EDX], EAX; U

INC EDX; V (PAIRS)

CMP EDX, ECX; U

JB L1; V (PAIRS)

L2:

Letting the counter end at zero (PPLAIN and PMMX)

We want to get rid of the CMP instruction in example 1.3 by letting the counter end at zero and use the zero flag for detecting when we are finished as we did in example 1.2. One way to do this would be to execute the loop backwards taking the last array elements first. However, data caches are optimized for accessing data forwards, not backwards, so if cache misses are likely, then you should rather start the counter at -N and count through negative values up to zero. The base registers should Then Point to the end of the arrays rather Than the beginning:

EXAMPLE 1.4:

MOV ESI, [A]

MOV EAX, [N]

Mov EDI, [B]

XOR ECX, ECX

Lea ESI, [ESI 4 * EAX]; Point To End of Array A

SUB ECX, EAX; -N

Lea EDI, [EDI 4 * EAX]; Point to End of Array Bjz Short L2

L1: MOV EAX, [ESI 4 * ECX]; U

NEG EAX; U

MOV [EDI 4 * ECX], EAX; U

INC ECX; V (PAIRS)

JNZ L1; U

L2:

We are now down at five instructions in the loop body but it still takes 4 clocks because of poor pairing. (If the addresses and sizes of the arrays are constants we may save two registers by substituting A SIZE A for ESI and B SIZE B for edi). Now Let's See How We can IMPROVE PAIRING.

Pairing Calculation with loop overhead (PPLAIN and PMMX)

We may want to improve pairing by intermingling calculations with the loop control instructions. If we want to put something in between INC ECX and JNZ L1, it has to be something that does not affect the zero flag. The MOV [EDI 4 * ECX], EBX INSTRUCTION AFTER INC ECX Would Generate An Agi Delay, So We Have to Be More Ingenious:

EXAMPLE 1.5:

MOV EAX, [N]

XOR ECX, ECX

SHL EAX, 2; 4 * N

JZ Short L3

MOV ESI, [A]

Mov EDI, [B]

SUB ECX, EAX; - 4 * N

Add ESI, ESI; Point TO End of Array A

Add Edi, EAX; Point to End of Array B

JMP Short L2

L1: MOV [EDI ECX-4], EAX; U

L2: MOV EAX, [ESI ECX]; V (PAIRS)

XOR EAX, -1; U

Add ECX, 4; V (PAIRS)

Inc Eax; u

JNC L1; V (PAIRS)

MOV [EDI ECX-4], EAX

L3:

I have used a different way to calculate the negative of EAX here: inverting all bits and adding one The reason why I am using this method is that I can use a dirty trick with the INC instruction:. INC does not change the carry flag , whereas ADD does. I am using ADD rather than INC to increment my loop counter and testing the carry flag rather than the zero flag. It is then possible to put the INC EAX in between without affecting the carry flag. You may think that we could have used LEA EAX, [EAX 1] here instead of INC EAX, at least that does not change any flags, but the LEA instruction would have an AGI stall so that's not the best solution. Note that the trick with the INC instruction not changing the carry flag is useful only on PPlain and PMMX, but will cause a partial flags stall on PPro, PII and PIII.I have obtained perfect pairing here and the loop now takes only 3 clock cycles. Whether you want to increment the LOOP Counter by 1 (as in example 1.4) or by 4 (as in example 1.5) IS A Matter of Taste, IT Makes No Difference In Loop Timing.

Overlapping the end of one Operation with the beginning of the next (PPLAIN AND PMMX)

The method used in example 1.5 is not very generally applicable so we may look for other methods of improving pairing opportunities. One way is to reorganize the loop so that the end of one operation overlaps with the beginning of the next. I will call this convoluting the loop. A convoluted loop has an unfinished operation at the end of each loop iteration which will be finished in the next run. Actually, example 1.5 did pair the last MOV of one iteration with the first MOV of the next, but we want to Explore this Method Further:

EXAMPLE 1.6:

MOV ESI, [A]

MOV EAX, [N]

Mov EDI, [B]

XOR ECX, ECX

Lea ESI, [ESI 4 * EAX]; Point To End of Array A

SUB ECX, EAX; -N

Lea EDI, [EDI 4 * EAX]; Point to End of Array B

JZ Short L3

XOR EBX, EBX

MOV EAX, [ESI 4 * ECX]

Inc ECX

JZ Short L2

L1: SUB EBX, EAX; U

MOV Eax, [ESI 4 * ECX]; V (PAIRS)

MOV [EDI 4 * ECX-4], EBX; U

INC ECX; V (PAIRS)

MOV EBX, 0; U

JNZ L1; V (PAIRS)

L2: SUB EBX, EAX

MOV [EDI 4 * ECX-4], EBX

L3:

Here we begin reading the second value before we have stored the first, and this of course improves pairing opportunities. The MOV EBX, 0 instruction has been put in between INC ECX and JNZ L1 not to improve pairing but to avoid AGI stall.

ROLLING OUT A LOOP (PPLAIN AND PMMX)

THE MOST Generally Applicable Way to Improve Pairing Opportunities Is To Do Two Operations for Each Run and Do Half As Many Runs. This is Called Rolling Out a loop:

EXAMPLE 1.7:

MOV ESI, [A]

MOV EAX, [N]

Mov EDI, [B]

XOR ECX, ECX

Lea ESI, [ESI 4 * EAX]; Point To End of Array A

SUB ECX, EAX; -N

Lea EDI, [EDI 4 * EAX]; Point to End of Array B

JZ Short L2

TEST Al, 1; Test IF N IS ODD

JZ Short L1

MOV EAX, [ESI 4 * ECX]; N Is Odd. Do the odd one

NEG EAX

MOV [EDI 4 * ECX], EAX

INC ECX; Make Counter Even

JZ Short L2; N = 1

L1: MOV EAX, [ESI 4 * ECX]; U

MOV EBX, [ESI 4 * ECX 4]; V (PAIRS)

NEG EAX; Uneg EBX; U

MOV [EDI 4 * ECX], EAX; U

MOV [EDI 4 * ECX 4], EBX; V (PAIRS)

Add ECX, 2; U

JNZ L1; V (PAIRS)

L2:

Now we are doing two operations in parallel which gives the best pairing opportunities. We have to test if N is odd and if so do one operation outside the loop because the loop can only do an even number of operations.

............................ ...

Reorganizing a loop to remove agi stall (PPLAIN and PMMX)

EXAMPLE 1.8:

MOV ESI, [A]

MOV EAX, [N]

Mov EDI, [B]

XOR ECX, ECX

Lea ESI, [ESI 4 * EAX]; Point To End of Array A

SUB ECX, EAX; -N

Lea EDI, [EDI 4 * EAX]; Point to End of Array B

JZ Short L3

TEST Al, 1; Test IF N IS ODD

JZ Short L2

MOV EAX, [ESI 4 * ECX]; N Is Odd. Do the odd one

Neg Eax; No Pairing Opportunity

MOV [EDI 4 * ECX-4], EAX

INC ECX; Make Counter Even

Jnz Short L2

NOP; Add Nop's if Jnz L2 Not Predictable

NOP

JMP Short L3; N = 1

L1: NEG EAX; U

NEG EBX; U

MOV [EDI 4 * ECX-8], EAX; U

MOV [EDI 4 * ECX-4], EBX; V (PAIRS)

L2: MOV EAX, [ESI 4 * ECX]; U

MOV EBX, [ESI 4 * ECX 4]; V (PAIRS)

Add ECX, 2; U

JNZ L1; V (PAIRS) NEG EAX

NEG EBX

MOV [EDI 4 * ECX-8], EAX

MOV [EDI 4 * ECX-4], EBX

L3:

The trick is to find a pair of instructions that do not use the loop counter as index and reorganize the loop so that the counter is incremented in the preceding clock cycle. We are now down at 5 clock cycles for two operations which is close to the Best Possible.

If data caching is critical, then you may improve the speed further by interleaving the A and B arrays into one structured array so that each B [i] comes immediately after the corresponding A [i]. If the structured array is aligned by at least 8 THEN B [I] Will Always Be in The Same Cache Line AS A [I], SO You WRIL NEVER HAVE A Cache Miss When Writing B [I]. This may of course Have a tradeoff in Other Parts of the Program So you Have to weigh the costs against the benefits.

Rolling Out by more more Than 2 (PPLAIN AND PMMX)

You may think of doing more than two operations per iteration in order to reduce the loop overhead per operation. But since the loop overhead in most cases can be reduced to only one clock cycle per iteration, then rolling out the loop by 4 rather than by 2 Would Only Save 1/4 Clock Cycle Per Operation, Which IS Hardly Worth The Effort. Only et...

The DrawBacks of Excessive Loop Unrolling Are:

You need to calculate N MODULO R, where R is the unrolling factor, and do N MODULO R operations before or after the main loop in order to make the remaining number of operations divisible by R. This takes a lot of extra code and poorly predictable branches. and the loop body of course also becomes bigger. A Piece of code usually takes much more time the first time it executes, and the penalty of first time execution is bigger the more code you have, especially if N is small. Excessive code Size Makes The Utilization of the Cache Less Effective.Handling Multiple 8 Or 16 Bit Operands Simultaneously IN 32 Bit Registers (PPLAIN and PMMX)

If you need to manipulate arrays of 8 or 16 bit operands, then there is a problem with unrolled loops because you may not be able to pair two memory access operations. For example MOV AL, [ESI] / MOV BL, [ESI 1 .

The Following Example Adds 2 To All Elements of an Array of Bytes:

EXAMPLE 1.9:

Mov ESI, [A]; Address of Byte Array

MOV ECX, [N]; Number of Elements in byte Array

Test ECX, ECX; Test IF N IS 0

JZ Short L2

MOV Eax, [ESI]; Read First Four Bytes

L1: MOV EBX, EAX; Copy Into EBX

And Eax, 7f7f7f7fh; get Lower 7 Bits of Each Byte in EAX

XOR EBX, EAX; Get The Highest Bit of Each Byte

Add Eax, 020202H; Add Desired Value To All FourTes

XOR EBX, EAX; Combine Bits Again

MOV Eax, [ESI 4]; Read next FourTes

MOV [ESI], EBX; Store Result

Add ESI, 4; Increment Pointersub ECX, 4; DECREMENT LOOP Counter

Ja L1; LOOP

L2:

This loop takes 5 clock cycles for every 4 bytes. The array should of course be aligned by 4. If the number of elements in the array is not divisible by four, then you may padd it in the end with a few extra bytes to make The Loop Will ALWAYS Read Past The Loop Will Always Read The end of the array, so you're on the side of the array is not placed at the end of a segment to avoid a general protection ire.

Note That i Have Masked Out The Highest Bit of Each Byte To Avoid A Possible Carry from Each Byte Id The next. I am Using Xor Rather Than Adding. I am Using Xor Rather Than Add by Avoid Carry.

The ADD ESI, 4 instruction could have been avoided by using the loop counter as index as in example 1.4. However, this would give an odd number of instructions in the loop body, so there would be one unpaired instruction and the loop would still take 5 clocks. Making the branch instruction unpaired would save one clock after the last operation when the branch is mispredicted, but we would have to spend an extra clock cycle in the prolog code to setup a pointer to the end of the array and calculate -N .

The next example finds the length of a zero-terminated string by search for the first byte of zero. It is faster Than Using Rep ScaSB:

EXAMPLE 1.10:

Strlen Proc Near

MOV EAX, [ESP 4]; Get Pointer

Mov EDX, 7

Add Edx, Eax; Pointer 7 Used in the end

Push EBX

MOV EBX, [EAX]; Read First 4 bytes

Add Eax, 4; Increment Pointerl1: Lea ECX, [EBX-01010101H]; Subtract 1 from Each Byte

XOR EBX, -1; Invert All Bytes

And ECX, EBX; and these TWO

MOV EBX, [EAX]; Read next 4 bytes

Add Eax, 4; Increment Pointer

And ECX, 80808080H; Test All Sign Bits

JZ L1; No ZERO BYTES, Continue Loop

Test ECX, 00008080H; Test First Two Bytes

Jnz Short L2

SHR ECX, 16; Not in the first 2 bytes

Add Eax, 2

L2: SHL CL, 1; Use Carry Flag To Avoid A Branch

POP EBX

SBB EAX, EDX; Compute Length

RET

Strlen ENDP

Again we have used the method of overlapping the end of one operation with the beginning of the next to improve pairing. I have not unrolled the loop because it is likely to repeat relatively few times. The string should of course be aligned by 4. The .

The loop body has an odd number of instructions so there is one unpaired. Making the branch instruction unpaired rather than one of the other instructions has the advantage that it saves 1 clock cycle when the branch is mispredicted.

The TEST ECX, 00008080H instruction is non-pairable. You could use the pairable instruction OR CH, CL here instead, but then you would have to put in a NOP or something to avoid the penalties of consecutive branches. Another problem with OR CH, CL Is That It Would Cause A Partial Register Stall ON A PPRO, PII AND PIII. SO I HAVE Chosen To Keep The Unpairable Test Instruction.

Handling 4 bytes simultaneously can be quite difficult. The code uses a formula which generates a nonzero value for a byte if, and only if, the byte is zero. This makes it possible to test all four bytes in one operation. This algorithm involves the subtraction of 1 from all bytes (in the LEA instruction). I have not masked out the highest bit of each byte before subtracting, as I did in the previous example, so the subtraction may generate a borrow to the next byte, but only if it is zero, and this is exactly the situation where we do not care what the next byte is, because we are searching forwards for the first zero. If we were searching backwards then we would have to re-read the dword after detecting a .

If you want to search for a byte value other than zero, the you make you are search for, and then used the method Above to search for zero.

LOOPS with MMX Operations (PMMX)

Returning to the problem of adding two to alltes in an Array, WE May Take Advantage of The MMX Instructions:

EXAMPLE 1.11:

.DATA

ALIGN 8

Addents DQ 0202020202020202H; Specify Byte To Add Eight Times

A DD?; Address of Byte Array

N DD?; Number of Items

.code

MOV ESI, [A]

MOV ECX, [N]

MOVQ MM2, [Addents]

JMP Short L2

; TOP of LOOP

L1: MOVQ [ESI-8], MM0; Store Result

L2: MOVQ MM0, MM2; LOAD Addentspaddb MM0, [ESI]; Add Eight Bytes in One Operation

Add ESI, 8

Dec ECX

JNZ L1

MOVQ [ESI-8], MM0; Store Last Result

EMMS

The Store Instruction is Moved to After the loop Control Instructions in Order To Avoid A Store Stall.

This loop takes 4 clocks because the PADDB instruction does not pair with ADD ESI, 8. (A MMX instruction with memory access can not pair with a non-MMX instruction or with another MMX instruction with memory access). We could get rid of ADD ESI, 8 by Using ECX AS INDEX, But That Would Give An Agi Stall.

Since the loop overhead is considierable we might want to unroll the loop:

EXAMPLE 1.12:

.DATA

ALIGN 8

Addents DQ 0202020202020202H; Specify Byte to Add Eight

Times

A DD?; Address of Byte Array

N DD?; Number of Items

.code

MOVQ MM2, [Addents]

MOV ESI, [A]

MOV ECX, [N]

MOVQ MM0, MM2

MOVQ MM1, MM2

L3: Paddb MM0, [ESI]

PADDB MM1, [ESI 8]

MOVQ [ESI], MM0

MOVQ MM0, MM2

MOVQ [ESI 8], MM1

MOVQ MM1, MM2

Add ESI, 16

Dec ECX

JNZ L3

EMMS

This Unrolled Loop Takes 6 Clocks Per Iteration for Adding 16 Bytes. The Paddb Instructions Are Not Paired. The Two Threads AreaTLLAVED TO AVOID A Store Stall.

Using The MMX Instructions Has A High Penalty if you are using floating point instructions Shortly AfterWards, So There May Still Be Situations Where You Want To Use 32 Bit Registers as in Example 1.9.

Loops with floating point operations (PPLAIN and PMMX)

THE Methods of Optimizing Floating Point Loops Are Basically The Same As For Integer Loops, Althought The Floating Point Instructions Are Overlapping Rather Than Pairing.consider The C Language Code:

INT I, N; Double * X; Double * Y; Double DA;

For (i = 0; i

This Piece Of Code (Called Daxpy) HAS Been Studied Extensively Because It is the key to solid linear equations.

EXAMPLE 1.13:

DSIZE = 8; Data Size

MOV Eax, [N]; Number of Elements

Mov ESI, [x]; Pointer TO

Mov EDI, [Y]; Pointer to Y

XOR ECX, ECX

Lea ESI, [ESI DSIZE * EAX]; Point to End of X

SUB ECX, EAX; -N

Lea EDI, [EDI DSIZE * EAX]; Point to End of Y

JZ Short L3; Test for n = 0

FLD DSize PTR [DA]

Fmul Dsize PTR [ESI DSIZE * ECX]; da * x [0]

JMP Short L2; Jump INTO LOOP

L1: Fld Dsize PTR [Da]

Fmul Dsize PTR [ESI DSIZE * ECX]; DA * X [i]

FXCH; GET OLD RESULT

FSTP DSIZE PTR [EDI DSIZE * ECX-DSIZE]; store y [i]

L2: fsubr Dsize PTR [EDI DSIZE * ECX]; Subtract from Y [I]

INC ECX; Increment Index

JNZ L1; LOOP

FSTP DSIZE PTR [EDI DSIZE * ECX-DSIZE]; Store Last Result

L3:

Here we are using the same methods as in example 1.6: Using the loop counter as index register and counting through negative values up to zero The end of one operation overlaps with the beginning of the next.The interleaving of floating point operations work perfectly here. : The 2 clock stall between FMUL and FSUBR is filled with the FSTP of the previous result The 3 clock stall between FSUBR and FSTP is filled with the loop overhead and the first two instructions of the next operation An AGI stall has been avoided by.. Reading the Only Parameter That Doesn't Depend On The Index in The First Clock Cycle After The INDEX HAS BEEN Increment.

THIS SOLUTION TAKES 6 Clock Cycles Per Operation, Which is Better Than The Unrolled Solution Published by Intel!

Unrolling floating point loops (PPLAIN and PMMX)

The daxpy loop unrolled by 3 is quite completed:

EXAMPLE 1.14:

DSIZE = 8; Data Size

IF DSIZE EQ 4

Shiftcount = 2

Else

Shiftcount = 3

ENDIF

MOV Eax, [N]; Number of Elements

Mov ECX, 3 * DSIZE; Counter Bias

SHL EAX, ShiftCount; Dsize * N

JZ L4; n = 0

Mov ESI, [x]; Pointer TO

SUB ECX, EAX; (3-N) * DSIZE

Mov EDI, [Y]; Pointer to Y

SUB ESI, ECX; End of Pointer - BIAS

Sub EDI, ECX

Test ECX, ECX

FLD DSize PTR [ESI ECX]; First X

JNS Short L2; Less Than 4 Operations

L1:; main loop

Fmul Dsize PTR [Da]

FLD DSize PTR [ESI ECX DSize]

Fmul Dsize PTR [Da]

FXCH

FSUBR DSIZE PTR [EDI ECX]

FXCH

FLD DSize PTR [ESI ECX 2 * Dsize] Fmul Dsize PTR [DA]

FXCH

Fsubr Dsize PTR [EDI ECX DSize]

FXCH ST (2)

FSTP DSIZE PTR [EDI ECX]

FSUBR DSIZE PTR [EDI ECX 2 * Dsize]

FXCH

FSTP DSIZE PTR [EDI ECX DSIZE]

FLD DSIZE PTR [ESI ECX 3 * Dsize]

FXCH

FSTP DSIZE PTR [EDI ECX 2 * Dsize]

Add ECX, 3 * DSIZE

JS L1; LOOP

L2: Fmul Dsize PTR [Da]; Finish Leftover Operation

FSUBR DSIZE PTR [EDI ECX]

SUB ECX, 2 * DSIZE; Change Pointer Bias

JZ Short L3; Finished

FLD DSize PTR [Da]; start next operation

Fmul DSize PTR [ESI ECX 3 * Dsize]

FXCH

FSTP DSIZE PTR [EDI ECX 2 * Dsize]

FSUBR DSIZE PTR [EDI ECX 3 * Dsize]

Add Ecx, 1 * Dsize

JZ Short L3; Finished

FLD DSize PTR [DA]

Fmul DSize PTR [ESI ECX 3 * Dsize]

FXCH

FSTP DSIZE PTR [EDI ECX 2 * Dsize]

FSUBR DSIZE PTR [EDI ECX 3 * Dsize]

Add Ecx, 1 * Dsize

L3: FSTP DSIZE PTR [EDI ECX 2 * Dsize]

L4:

The reason why I am showing you how to unroll a loop by 3 is not to recommend it, but to warn you how difficult it is! Be prepared to spend a considerable amount of time debugging and verifying your code when doing something like this. There are several problems to take care of: in most cases, you can not remove all stalls from a floating point loop unrolled by less than 4 unless you convolute it (ie there are unfinished operations at the end of each run which are being finished in the next run). The last FLD in the main loop above is the beginning of the first operation in the next run. It would be tempting here to make a solution which reads past the end of the array and then discards the extra value in the end, as in example 1.9 and 1.10, but that is not recommended in floating point loops because the reading of the extra value might generate a denormal operand exception in case the memory position after the array does not contain a valid floating point number. To avoid this , WE HAV e to do at least one more operation after the main loop.The number of operations to do outside an unrolled loop would normally be N MODULO R, where N is the number of operations, and R is the unrolling factor. But in the case of A CONVOLUTED LOOP, WE Have to Do One More, IE (N-1) MODULO R 1, for the Abovementied Reason.

Normally, we would prefer to do the extra operations before the main loop, but here we have to do them afterwards for two reasons: One reason is to take care of the leftover operand from the convolution The other reason is that calculating the number of. Extra Operations Requires A Division IF R IS NOT A POWER OF 2, AND A DIVISION IS TIME CONSUMING. DOING The Extra Operations After The loop Saves The Division.

The next problem is to calculate how to bias the loop counter so that it will change sign at the right time, and adjust the base pointers so as to compensate for this bias. Finally, you have to make sure the leftover operand from the convolution is Handled Correctly for All Values of N.

Now That i Have Scared You by DemonStrating How Difficult It Is To Unroll by 3, I Will Show You That IT IS MUCH Easier To Unroll by 4:

EXAMPLE 1.15:

DSIZE = 8; Data Size

MOV Eax, [N]; Number of Elements

Mov ESI, [x]; Pointer TO

Mov EDI, [Y]; Pointer to Y

XOR ECX, ECX

Lea ESI, [ESI DSIZE * EAX]; Point to End of X

SUB ECX, EAX; -N

Lea EDI, [EDI DSIZE * EAX]; Point to End of Y

TEST Al, 1; Test IF N IS ODD

JZ Short L1

FLD DSize PTR [Da]; Do The Odd Operation

Fmul Dsize PTR [ESI DSIZE * ECX]

FSUBR DSIZE PTR [EDI DSIZE * ECX]

Inc ECX; Adjust Counter

FSTP DSIZE PTR [EDI DSIZE * ECX-DSIZE]

L1: Test Al, 2; Test for Possibly 2 more Operations

JZ L2

FLD DSize PTR [Da]; N mod 4 = 2 or 3. Do Two More

Fmul Dsize PTR [ESI DSIZE * ECX]

FLD DSize PTR [DA]

Fmul Dsize PTR [ESI DSIZE * ECX DSIZE]

FXCH

FSUBR DSIZE PTR [EDI DSIZE * ECX]

FXCH

FSUBR DSIZE PTR [EDI DSIZE * ECX DSIZE]

FXCH

FSTP DSIZE PTR [EDI DSIZE * ECX]

FSTP DSIZE PTR [EDI DSIZE * ECX DSize] Add ECX, 2; Counter is now Divisible by 4

L2: Test ECX, ECX

JZ L4; No more Operations

L3:; main loop:

FLD DSize PTR [DA]

FLD DSIZE PTR [ESI DSIZE * ECX]

Fmul St, ST (1)

FLD DSize PTR [ESI DSIZE * ECX DSIZE]

Fmul St, ST (2)

FLD DSIZE PTR [ESI DSIZE * ECX 2 * Dsize]

Fmul St, ST (3)

FXCH ST (2)

FSUBR DSIZE PTR [EDI DSIZE * ECX]

FXCH ST (3)

Fmul Dsize PTR [ESI DSIZE * ECX 3 * Dsize]

FXCH

FSUBR DSIZE PTR [EDI DSIZE * ECX DSIZE]

FXCH ST (2)

FSUBR DSIZE PTR [EDI DSIZE * ECX 2 * Dsize]

FXCH

FSUBR DSIZE PTR [EDI DSIZE * ECX 3 * Dsize]

FXCH ST (3)

FSTP DSIZE PTR [EDI DSIZE * ECX]

FSTP DSIZE PTR [EDI DSIZE * ECX 2 * Dsize]

FSTP DSIZE PTR [EDI DSIZE * ECX DSIZE]

FSTP DSIZE PTR [EDI DSIZE * ECX 3 * Dsize]

Add ECX, 4; Increment Index by 4

JNZ L3; LOOP

L4:

It is usually quite easy to find a stall-free solution when unrolling by 4, and there is no need for convolution. The number of extra operations to do outside the main loop is N MODULO 4, which can be calculated easily without division, simply ......................

The tradeoff of loop unrolling is that the extra operations outside the loop are slower due to incomplete overlapping and possible branch mispredictions, and the first time penalty is higher because of increased code size.

As a general recommendation, I would say that if N is big or if convoluting the loop without unrolling can not remove enough stalls, then you should unroll critical integer loops by 2 and floating point loops by 4.25.2 Loops in PPro, PII and PIII

In the previous chapter (25.1) I explained how to use convolution and loop unrolling in order to improve pairing in PPlain and PMMX. On the PPro, PII and PIII there is no reason to do this thanks to the mechanism out-of-order execution . But There is...................

I have chosen the same example as in chapter 25.1 for the previous microprocessors: a procedure which reads integers from an array, changes the sign of each integer, and stores the results in another array.

A c Language Code for this Procedure Would Be:

Void ChangeSign (int * a, int * b, int N) {

INT I;

For (i = 0; i

Translating to assembly, we might Write the Procedure Like this:

EXAMPLE 2.1:

_ChangeSign Proc Near

PUSH ESI

Push EDI

A EQU DWORD PTR [ESP 12]

B EQU DWORD PTR [ESP 16]

N EQU DWORD PTR [ESP 20]

MOV ECX, [N]

Jecxz L2

MOV ESI, [A]

Mov EDI, [B]

CLD

L1: LODSD

NEG EAX

Stosd

LOOP L1

L2: POP EDI

POP ESI

RET

_ChangeSign Endp

This looks like a nice solution, but it is not optimal because it uses the non-optimal instructions LOOP, LODSD and STOSD that generate many uops. It takes 6-7 clock cycles per iteration if all data are in the level one cache. Avoiding THESE INSTRUCTIONS WE GET:

EXAMPLE 2.2:

MOV ECX, [N]

Jecxz L2

MOV ESI, [A]

Mov EDI, [B]

Align 16L1: MOV Eax, [ESI]; Len = 2, P2Resiweax

Add ESI, 4; LEN = 3, P01RWESIWF

NEG EAX; LEN = 2, p01rweaxwf

MOV [EDI], EAX; Len = 2, P4REAX, P3redi

Add EDI, 4; LEN = 3, P01Rwediwf

Dec ECX; LEN = 1, p01rwecxwf

JNZ L1; LEN = 2, P1RF

L2:

The comments are interpreted as follows: the MOV EAX, [ESI] instruction is 2 bytes long, it generates one uop for port 2 that reads ESI and writes to (renames) EAX This information is needed for analyzing the possible bottlenecks..

Let's first analyze the instruction decoding (chapter 14):.. One of the instructions generates 2 uops (MOV [EDI], EAX) This instruction must go into decoder D0 There are three decode groups in the loop so it can decode in 3 clock Cycles.

Next, let's look at the instruction fetch (chapter 15): If an ifetch boundary prevents the first three instructions from decoding together then there will be three decode groups in the last ifetch block so that the next iteration will have the ifetch block starting at the first instruction where we want it, and we will get a delay only in the first iteration. A worse situation would be a 16-byte boundary and an ifetch boundary in one of the last three instructions. According to the ifetch table, this will generate a delay of 1 clock and cause the next iteration to have its first ifetch block aligned by 16, so that the problem continues through all iterations. The result is a fetch time of 4 clocks per iteration rather than 3. There are two ways to prevent this situation: the first method is to control where the ifetch blocks lie on the first iteration; the second method is to control where the 16-byte boundaries are The latter method is the easiest Since the entire loop has only 15 b.. ytes of code you can avoid any 16-byte boundary by aligning the loop entry by 16, as shown above. This will put the entire loop into a single ifetch block so that no further analysis of instruction fetching is needed.The third problem to look AT is Register Read Stalls (Chapter 16). No register is read in this loop wels Being Written to at Least A Few Clock Cycles Before, So Thee Can Be No Register Read Stalls.

THE FOURTER 17). Counting the uops for the Different Ports We get: Port 0 OR 1: 4 UOPS Port 1 Only: 1 UOP Port 2: 1 UOP Port 3: 1 UOP Port 4: 1 UOP Assuming That THE UOPS That Can Go To Either Port 0 OR 1 Are Distributed Optimally, The Execution Time Will BE 2.5 Clocks Per Iteration.

The last analysis is retirement (chapter 18). Since the number of uops in the loop is not divisible by 3, the retirement slots will not be used optimally when the jump has to retire in the first slot. The time needed for retirement is the number of uops divided by 3, and rounded up to nearest integer. This gives 3 clocks for retirement.In conclusion, the loop above can execute in 3 clocks per iteration if the loop entry is aligned by 16. I am assuming that the conditional jump IS Predicted Every Time Except on the exit of the loop (Chapter 22.2).

USING THE SAME Register for counter and index and letting the counter End At Zero (PPRO, PII AND PIII)

EXAMPLE 2.3:

MOV ECX, [N]

MOV ESI, [A]

Mov EDI, [B]

Lea ESI, [ESI 4 * ECX]; Point To End of Array A

Lea EDI, [EDI 4 * ECX]; Point to End of Array B

NEG ECX; -N

JZ Short L2

ALIGN 16

L1: MOV EAX, [ESI 4 * ECX]; len = 3, P2Resirecxweax

NEG EAX; LEN = 2, p01rweaxwf

MOV [EDI 4 * ECX], EAX; LEN = 3, P4REAX, P3redirecx

INC ECX; LEN = 1, p01rwecxwf

JNZ L1; LEN = 2, P1RF

L2:

Here we have reduced the number of uops to 6 by using the same register as counter and index. The base point the the the end of the arrays soed to ing...............

Decoding: The Loop So IT Will Decode in 2 Clocks.

Instruction fetch:. A loop always takes at least one clock cycle more than the the number of 16 byte blocks Since there are only 11 bytes of code in the loop it is possible to have it all in one ifetch block By aligning the loop entry. by 16 we can make sure that we do not get more than one 16-byte block so that it is possible to fetch in 2 clocks.Register read stalls: The ESI and EDI registers are read, but not modified inside the loop They. will therefore be counted as permanent register reads, but not in the same triplet. Register EAX, ECX, and flags are modified inside the loop and read before they are written back so they will cause no permanent register reads. The conclusion is that there are No register read stalls.

Execution: Port 0 OR 1: 2 UOPS Port 1: 1 UOP Port 2: 1 UOP Port 3: 1 UOP Port 4: 1 UOP Execution Time: 1.5 Clocks.

Retirement: 6 uops = 2 clocks.

Conclusion: this loop takes Only 2 Clock Cycles Per ity.

If you use absolute addresses instead of esi and edi tells Because it cannot be contained in a single 16-byte block.

UNROLLING A LOOP (PPRO, PII AND PIII)

Doing more than one operation in each run and doing correspondingly fewer runs is called loop unrolling. In previous processors you would unroll loops to get parallel execution by pairing (chapter 25.1). In PPro, PII and PIII this is not needed because the out- of-order execution mechanism takes care of that. There is no need to use two different registers either, because register renaming takes care of this. The purpose of unrolling here is to reduce the loop overhead per iteration.

The Following Example Is The Same As Example 2.2, But Unrolled by 2, Which Means That You Do Two Operations Per Itexample 2.4:

MOV ECX, [N]

MOV ESI, [A]

Mov EDI, [B]

SHR ECX, 1; N / 2

JNC Short L1; Test IF N WAS ODD

Mov Eax, [ESI]; Do The Odd One First

Add ESI, 4

NEG EAX

MOV [EDI], EAX

Add Edi, 4

L1: Jecxz L3

ALIGN 16

L2: MOV Eax, [ESI]; len = 2, p2resiweax

NEG EAX; LEN = 2, p01rweaxwf

MOV [EDI], EAX; Len = 2, P4REAX, P3redi

MOV Eax, [ESI 4]; len = 3, p2resiweax

NEG EAX; LEN = 2, p01rweaxwf

MOV [EDI 4], EAX; LEN = 3, P4REAX, P3redi

Add ESI, 8; LEN = 3, P01RWESIWF

Add EDI, 8; LEN = 3, P01Rwediwf

Dec ECX; LEN = 1, p01rwecxwf

JNZ L2; LEN = 2, P1RF

L3:

In example 2.2 the loop overhead (ie adjusting pointers and counter, and jumping back) was 4 uops and the 'real job' was 4 uops. When unrolling the loop by two you do the 'real job' twice and the overhead once, so you get 12 uops in all. This reduces the overhead from 50% to 33% of the uops. Since the unrolled loop can do only an even number of operations you have to check if N is odd and if so do one operation outside the loop .

Analyzing instruction fetching in this loop we find that a new ifetch block begins in the ADD ESI, 8 instruction, forcing it into decoder D0. This makes the loop decode in 5 clock cycles and not 4 as we wanted. We can solve this problem by Coding the preceding instruction in a longer version. Change MoV [EDI 4], EAX TO:

MOV [EDI 9999], EAX; Make Instruction with long displaceorg $ -4

DD 4; Rewrite Displacement to 4

This will force a new ifetch block to begin at the long MOV [EDI 4], EAX instruction, so that decoding time is now down at 4 clocks. The rest of the pipeline can handle 3 uops per clock so that the expected execution time IS 4 Clocks Per ity, or 2 Clocks Per Operation.

Testing this solution shows that it actually takes a little more. My measurements showed approximately 4.5 clocks per iteration. This is probably due to a sub-optimal reordering of the uops. Possibly, the ROB does not find the optimal execution-order for the UOPS But Submits Theme. This Problem WAS NOT PREDICTED, AND ONLY TESTING CAN REVEAL SUCH A Problem. We May Help the Rob by doing some of the reordering manually:

EXAMPLE 2.5:

ALIGN 16

L2: MOV Eax, [ESI]; len = 2, p2resiweax

MOV EBX, [ESI 4]; len = 3, P2ResiWebx

NEG EAX; LEN = 2, p01rweaxwf

MOV [EDI], EAX; Len = 2, P4REAX, P3redi

Add ESI, 8; LEN = 3, P01RWESIWF

NEG EBX; LEN = 2, p01rwebxwf

MOV [EDI 4], EBX; LEN = 3, P4REBX, P3redi

Add EDI, 8; LEN = 3, P01Rwediwf

Dec ECX; LEN = 1, p01rwecxwf

JNZ L2; LEN = 2, P1RF

L3:

The loop now executes in 4 clocks per iteration. This solution also solves the problem with instruction fetch blocks. The cost is that we need an extra register because we can not take advantage of register renaming.

Rolling Out by More Than 2

Loop unrolling is recommended when the loop overhead constitutes a high proportion of the total execution time. In example 2.3 the overhead is only 2 uops, so the gain by unrolling is little, but I will show you how to unroll it anyway, just for the Exercise.The 'Real Job' IS 4 uops and the overhead 2. unrolling by Two We get 2 * 4 2 = 10 uops. The Rounded Up to an Integer, That IS 4 Clock Cycles. THIS CALCULATION SHOWS That Nothing is get by unrolling this by two. Unrolling by fours We get:

EXAMPLE 2.6:

MOV ECX, [N]

SHL ECX, 2; Number of Bytes to Handle

MOV ESI, [A]

Mov EDI, [B]

Add ESI, ECX; Point to End of Array A

Add Edi, Ecx; Point To End of Array B

NEG ECX; -4 * n

Test ECX, 4; Test IF N IS ODD

JZ Short L1

MOV EAX, [ESI ECX]; N IS ODD. Do The Odd ONE

NEG EAX

MOV [EDI ECX], EAX

Add ECX, 4

L1: Test ECX, 8; TEST IF N / 2 IS ODD

JZ Short L2

MOV EAX, [ESI ECX]; N / 2 IS ODD. DO Two Extra

NEG EAX

MOV [EDI ECX], EAX

MOV EAX, [ESI ECX 4]

NEG EAX

MOV [EDI ECX 4], EAX

Add ECX, 8

L2: JECXZ SHORT L4

ALIGN 16

L3: MOV EAX, [ESI ECX]; LEN = 3, P2Resirecxweax

NEG EAX; LEN = 2, p01rweaxwf

MOV [EDI ECX], EAX; LEN = 3, P4REAX, P3redirecx

MOV EAX, [ESI ECX 4]; len = 4, p2resirecxweax

NEG EAX; LEN = 2, p01rweaxwf

MOV [EDI ECX 4], EAX; LEN = 4, P4REAX, P3redirecxmov Eax, [ESI ECX 8]; len = 4, P2Resirecxweax

MOV EBX, [ESI ECX 12]; len = 4, p2resirecxweax

NEG EAX; LEN = 2, p01rweaxwf

MOV [EDI ECX 8], EAX; LEN = 4, P4REAX, P3redirecx

NEG EBX; LEN = 2, p01rweaxwf

MOV [EDI ECX 12], EBX; LEN = 4, P4REAX, P3redirecx

Add ECX, 16; Len = 3, P01RWecxWF

JS L3; LEN = 2, P1RF

L4:

The iFetch Blocks Are Where WE Want Them. Decode Time IS 6 Clocks.

Register read stalls is a problem here because ECX has retired near the end of the loop and we need to read both ESI, EDI, and ECX. The instructions have been reordered in order to avoid reading ESI near the bottom so that we can avoid a Register Read Stall. in Other Words, The Reason for REORDERING INSTRUCTIONS AND USE An Extra Register (EBX) IS NOT The Same AS in The PreviOS Example.

There Are 12 uops and the loop Executes in 6 Clocks Per ity, or 1.5 Clocks Per Operation.

It may be tempting to unroll loops by a high factor in order to get the maximum speed. But since the loop overhead in most cases can be reduced to something like one clock cycle per iteration then unrolling the loop by 4 rather than by 2 would save only 1/4 clock cycle per operation which is hardly worth the effort. only if the loop overhead is high compared to the rest of the loop and N is very big should you think of unrolling by 4. unrolling by more than 4 does not make Sense.

The DrawBacks of Excessive Loop Unrolling Are:

You need to calculate N MODULO R, where R is the unrolling factor, and do N MODULO R operations before or after the main loop in order to make the remaining number of operations divisible by R. This takes a lot of extra code and poorly predictable branches. and the loop body of course also becomes bigger. A Piece of code usually takes much more time the first time it executes, and the penalty of first time execution is bigger the more code you have, especially if N is small. Excessive code size makes the utilization of the code cache less effective.Using an unrolling factor which is not a power of 2 makes the calculation of N MODULO R quite difficult, and is generally not recommended unless N is known to be divisible by R. Example 1.14 shows HOW to unroll by 3.

Handling Multiple 8 or 16 Bit Operands Simultaneously IN 32 Bit Registers (PPRO, PII AND PIII)

IT IS Sometimes Possible to Handle Four Bytes At A Time In The Same 32 Bit Register. The Following Example Adds 2 To All Elements of An Array Of Bytes:

EXAMPLE 2.7:

Mov ESI, [A]; Address of Byte Array

MOV ECX, [N]; Number of Elements in byte Array

Jecxz L2

ALIGN 16

DB 7 DUP (90H); 7 NOP's for controling alignment

L1: MOV Eax, [ESI]; Read Four Bytes

MOV EBX, EAX; Copy Into EBX

And Eax, 7f7f7f7fh; get Lower 7 Bits of Each Byte in EAX

XOR EBX, EAX; Get The Highest Bit of Each Byte

Add Eax, 020202H; Add Desired Value To All FourTes

XOR EBX, EAX; Combine Bits Again

MOV [ESI], EBX; Store Result

Add ESI, 4; Increment Pointer

SUB ECX, 4; Decrement Loop Counter

Ja L1; LOOP

L2:

Note that I have masked out the highest bit of each byte to avoid a possible carry from each byte into the next one when adding. I am using XOR rather than ADD when putting in the high bit again to avoid carry. The array should of course be aligned by 4.This loop should ideally take 4 clocks per iteration, but it takes somewhat more due to the dependency chain and difficult reordering. On PII and PIII you can do the same more effectively using MMX registers.

The next esample finds the length of a zero-terminated string by search for the first byte of zero. It is much faster Than Using repne scaSB:

EXAMPLE 2.8:

_STRLEN PROC NEAR

Push EBX

MOV EAX, [ESP 8]; Get Pointer to String

Lea Edx, [EAX 3]; Pointer 3 Used in the End

L1: MOV EBX, [EAX]; Read First 4 bytes

Add Eax, 4; Increment Pointer

LEA ECX, [EBX-01010101H]; Subtract 1 from Each Byte

Not ebx; invert all bytes

And ECX, EBX; and these TWO

And ECX, 80808080H; Test All Sign Bits

JZ L1; No ZERO BYTES, Continue Loop

MOV EBX, ECX

SHR EBX, 16

Test ECX, 00008080H; Test First Two Bytes

CMOVZ ECX, EBX; Shift Right IF IN THE FIRST 2 BYTES

Lea EBX, [EAX 2]

CMOVZ EAX, EBX

SHL CL, 1; Use Carry Flag to Avoid Branch

SBB EAX, EDX; Compute Length

POP EBX

RET

_STRLEN ENDP

This loop Takes 3 Clocks for Each Iteration Testing 4 Bytes. The string shouth of course be aligned by 4. The code allow, so The string sale sale note beast at the end of a segment.

Handling 4 bytes simultaneously can be quite difficult. This code uses a formula which generates a nonzero value for a byte if, and only if, the byte is zero. This makes it possible to test all four bytes in one operation. This algorithm involves the subtraction of 1 from all bytes (in the LEA ECX instruction). I have not masked out the highest bit of each byte before subtracting, as I did in example 2.7, so the subtraction may generate a borrow to the next byte, but only if it is zero, and this is exactly the situation where we do not care what the next byte is, because we are searching forwards for the first zero. If we were searching backwards then we would have to re-read the dword after detecting a zero, and then test all four bytes to find the last zero, or use BSWAP to reverse the order of the bytes. If you want to search for a byte value other than zero, then you may XOR all four bytes with the value you are Searching for, and then use the method Above to search for zero.loops w ITH MMX Instructions (PII and PIII)

Using mmx instructions we can compare 8 bytes in one Operation:

EXAMPLE 2.9:

_STRLEN PROC NEAR

Push EBX

MOV EAX, [ESP 8]

Lea Edx, [EAX 7]

PXOR MM0, MM0

L1: MOVQ MM1, [EAX]; len = 3 p2reaxwmm1

Add Eax, 8; Len = 3 P01REAX

PCMPEQB MM1, MM0; LEN = 3 p01RMM0RMM1

MOVD EBX, MM1; LEN = 3 P01RMM1WEBX

PSRLQ MM1, 32; LEN = 4 P1RMM1

MOVD ECX, MM1; LEN = 3 p01rmm1wecx

OR ECX, EBX; LEN = 2 p01Recxrebxwf

JZ L1; LEN = 2 P1RF

MOVD ECX, MM1

Test EBX, EBX

CMOVZ EBX, ECX

Lea ECX, [EAX 4]

CMOVZ EAX, ECX

MOV ECX, EBX

SHR ECX, 16

Test BX, BXCMOVZ EBX, ECX

Lea ECX, [EAX 2]

CMOVZ EAX, ECX

SHR BL, 1

SBB EAX, EDX

EMMS

POP EBX

RET

_STRLEN ENDP

This loop has 7 uops for port 0 and 1 which gives an average execution time of 3.5 clocks per iteration. The measured time is 3.8 clocks which shows that the ROB handles the situation reasonably well despite a dependency chain that is 6 uops long. Testing 8 Bytes in Less Than 4 Clocks Is Incredibly Much Faster Than Repne ScaSB.

Loops with floating point instructions (PPRO, PII AND PIII)

The Methods for Optimizing Floating Point Loops Are Basical The Same As for Integer Loops, But You Should Be More Aware of Dependency Chains Because of the long latencies of instruction execution.

Consider The C Language Code:

INT I, N; Double * X; Double * Y; Double DA;

For (i = 0; i

This Piece Of Code (Called Daxpy) HAS Been Studied Extensively Because It is the key to solid linear equations.

EXAMPLE 2.10:

DSIZE = 8; Data Size (4 or 8)

MOV ECX, [N]; Number of Elements

Mov ESI, [x]; Pointer TO

Mov EDI, [Y]; Pointer to Y

JECXZ L2; TEST for n = 0

FLD DSize PTR [Da]; Load Da Outside Loop

ALIGN 16

DB 2 DUP (90H); 2 NOP's for alignment

L1: FLD DSize PTR [ESI]; len = 3 p2Resiwst0

Add ESI, DSIZE; LEN = 3 p01resi

Fmul St, ST (1); len = 2 p0rst0rst1

Fsubr Dsize PTR [EDI]; len = 3 p2redi, p0rst0

FSTP DSIZE PTR [EDI]; LEN = 3 P4RST0, P3redi

Add Edi, DSize; Len = 3 p01redi

Dec ECX; LEN = 1 p01Recxwf

JNZ L1; LEN = 2 p1rffstp st; Discard DA

L2:

The dependency chain is 10 clock cycles long, but the loop takes only 4 clocks per iteration because it can begin a new operation before the previous one is finished. The purpose of the alignment is to prevent a 16-byte boundary in the last ifetch block .

EXAMPLE 2.11:

DSIZE = 8; Data Size (4 or 8)

MOV ECX, [N]; Number of Elements

Mov ESI, [x]; Pointer TO

Mov EDI, [Y]; Pointer to Y

Lea ESI, [ESI DSIZE * ECX]; Point to End of Array

Lea EDI, [EDI DSIZE * ECX]; Point to End of Array

NEG ECX; -N

JZ Short L2; Test for n = 0

FLD DSize PTR [Da]; Load Da Outside Loop

ALIGN 16

L1: FLD DSize PTR [ESI DSIZE * ECX]; LEN = 3 P2ResirecxWST0

Fmul St, ST (1); len = 2 p0rst0rst1

FSUBR DSIZE PTR [EDI DSIZE * ECX]; len = 3 p2redirecx, p0rst0

FSTP DSIZE PTR [EDI DSIZE * ECX]; LEN = 3 P4RST0, P3redirecx

Inc ECX; LEN = 1 p01Recxwf

JNZ L1; LEN = 2 P1RF

FSTP st; discard da

L2:

Here we have used the same trick as in example 2.3. Ideally, this loop should take 3 clocks, but measurements say approximately 3.5 clocks due to the long dependency chain. Unrolling the loop does not save much.

Loops with XMM Instructions (PIII)

The xmm instructions on the piii allow you to operationing point number in parallel. The Operands Must be aligned by 16.

The DAXPY algorithm is not very suited for XMM instructions because the precision is poor, it may not be possible to align the operands by 16, and you need some extra code if the number of operations is not a multiple of four. I am showing the Code Here Anyway, Just To Give An Example of a Loop with XMM Instructions: Example 2.12:

MOV ECX, [N]; Number of Elements

Mov ESI, [x]; Pointer TO

Mov EDI, [Y]; Pointer to Y

SHL ECX, 2

Add ESI, ECX; Point To End of X

Add Edi, Ecx; Point To End of Y

NEG ECX; -4 * n

Mov Eax, [Da]; Load Da Outside loop

XOR EAX, 80000000H; Change Sign of Da

Push EAX

Movss XMM1, [ESP]; -DA

Add ESP, 4

Shufps XMM1, XMM1, 0; Copy -da To All Four Positions

CMP ECX, -16

Jg L2

L1: MOVAPS XMM0, [ESI ECX]; LEN = 4 2 * p2resirecxwxmm0

Add ECX, 16; LEN = 3 p01rwecxwf

Mulps XMM0, XMM1; LEN = 3 2 * p0rxmm0rxmm1

CMP ECX, -16; LEN = 3 p01Recxwf

AddPS XMM0, [EDI ECX-16]; len = 5 2 * p2redirecx, 2 * p1rxmm0

MOVAPS [EDI ECX-16], XMM0; LEN = 5 2 * P4RXMM0, 2 * P3redirecx

JNG L1; LEN = 2 P1RF

L2: JECXZ L4; Check if finished

Movaps XMM0, [ESI ECX]; 1-3 Operations Missing, Do 4 More

Mulps XMM0, XMM1

AddPS XMM0, [EDI ECX]

CMP ECX, -8

JG L3

Movlps [EDI ECX], XMM0; Store TWO More Results

Add ECX, 8

MOVHLPS XMM0, XMM0

L3: JECXZ L4

Movss [EDI ECX], XMM0; Store ONE More Result

L4:

The L1 loop takes 5-6 clocks for 4 operations. The ECX instructions have been placed before and after the MULPS XMM0, XMM1 instruction in order to avoid a register read port stall generated by the reading of the two parts of the XMM1 register together with ESI OR EDI IN The Rat. The Extra Code After L2 Takes Care of The Sitation WHERE N IS NOT DIVISIBLE BY 4. NOTE THATION CODE MAY READ THE DEND OF A and B. This May Delay The Last Operation If The Extra Memory Positions Read Do Not Contain Normal Floating Point Numbers. if Possible, Put in Some Dummy Extra To Make The Number of Operations Divisible by 4 and Leave Out The Extra Code After L2.

26. Problematic Instructions

26.1 XCHG (All Processors)

The XCHG register, [memory] instruction is dangerous. By default this instruction has an implicit LOCK prefix which prevents it from using the cache. This instruction is therefore very time consuming, and should always be avoided.

26.2 Rotates THROUGH Carry (All Processors)

RCR AND RCL with a Count Different from One Are Slow and Should Be Avoided.

26.3 String Instructions (All Processors)

String Instructions WITHOUT A Repeat Prefix Are Too Slow and Should Be Replaced by Simpler Instructions. The Same Applies to loop on all PROCESSORS AND TO JECXZ ON PPLAIN AND PMMX.

REP MOVSD AND REP Stosd Are QUITE FAST IF. ALWAYS Use The Dword Version IF Possible, And Make Sure That Both Source And Destination Are Are Aligned by 8.

Some Other Methods Of Moving Data Are Faster under Certain Conditions. See Chapter 27.8 for Details.

Note that while the REP MOVS instruction writes a word to the destination, it reads the next word from the source in the same clock cycle. You can have a cache bank conflict if bit 2-4 are the same in these two addresses. In other words, you will get a penalty of one clock extra per iteration if ESI (wordsize) -EDI is divisible by 32. The easiest way to avoid cache bank conflicts is to use the DWORD version and align both source and destination by 8. Never use MOVSB or MOVSW in optimized code, not even in 16 bit mode.REP MOVS and REP STOS can perform very fast by moving an entire cache line at a time on PPro, PII and PIII This happens only when the following conditions are met.:

both source and destination must be aligned by 8 direction must be forward (direction flag cleared) the count (ECX) must be greater than or equal to 64 the difference between EDI and ESI must be numerically greater than or equal to 32 the memory type for Both Source and Destination Must Be Either Writeback Or Write-Combining (You CAN Normal Assume this).

Under these conditions the number of uops issued is approximately 215 2 * ECX for REP MOVSD and 185 1.5 * ECX for REP STOSD, giving a speed of approximately 5 bytes per clock cycle for both instructions, which is almost 3 times as fast as .

The Byte and Word Versions Also Benefit from this Fast Mode, But They Are Less Effective Than The Dword Versions.

Rep Stosd is Optimal Under The Same Conditions as Rep Movsd.

REP LOADS, REP SCAS, and REP CMPS are not optimal, and may be replaced by loops. See example 1.10, 2.8 and 2.9 for alternatives to REPNE SCASB. REP CMPS may suffer cache bank conflicts if bit 2-4 are the same in ESI And EDI.

26.4 Bit test (all processors) BT, BTC, BTR, and BTS instructions should preferably be replaced by instructions like TEST, AND, OR, XOR, or shifts on PPlain and PMMX. On PPro, PII and PIII, bit tests with a memory Operand Should Be Avoided.

26.5 Integer Multiplication (All Processors)

An integer multiplication takes approximately 9 clock cycles on PPlain and PMMX and 4 on PPro, PII and PIII. It is therefore often advantageous to replace a multiplication by a constant with a combination of other instructions such as SHL, ADD, SUB, and LEA. Example: Imul Eax, 10 Can Be Replaced with MOV EBX, EAX / Add Eax, EAX / SHL EBX, 3 / Add Eax, EBX or Lea Eax, [EAX 4 * EAX] / Add Eax, EAX

Floating point multiplication is faster than integer multiplication on PPlain and PMMX, but the time spent on converting integers to float and converting the product back again is usually more than the time saved by using floating point multiplication, except when the number of conversions is low compared . MMX Multiplication IS Fast, But is Only Available with 16-bit Operands.

26.6 Wait Instruction (All Processors)

You can offen increase spesed by omitting the Wait instruction. The Wait Instruction Has Three Functions:

a. The old 8087 processor requires a Wait Before Every Floating Point Instruction To make Sure The Coprocessor is Ready to Receive IT.

B. Wait is buy for coordinating memory accept Between the floating point unit and the integer unit. Examples:

B.1. fistp [MEM32]

Wait; Wait for FPU to Write Before ..

Mov Eax, [MEM32]; Reading The Result with the INTEGER Unit

B.2. Fild [MEM32]

Wait; wait for fpu to read value ..

Mov [MEM32], EAX; Before Overwriting It with integer unitb.3. FLD Qword PTR [ESP]

Wait; Prevent An Accidental Interrupt from ..

Add ESP, 8; Overwriting Value on Stack

c. Wait is Sometimes Used to Check for Exceptions. It Will Generate An Interrupt IF An Unmasked Exception Bit In The Floating Point Status Word Has Been Set by a Preceding Floating Point Instruction.

Regarding a: The function in point a is never needed on any other processors than the old 8087. Unless you want your code to be compatible with the 8087 you should tell your assembler not to put in these WAIT's by specifying a higher processor A 8087. FLOATING Point Emulator Also Inserts Wait Instructions. You Should Therefore Tell your assembler not to generate Emulation code unless you need it.

Regarding b: WAIT instructions to coordinate memory access are definitely needed on the 8087 and 80287 but not on the Pentiums It is not quite clear whether it is needed on the 80387 and 80486. I have made several tests on these Intel processors and not been. able to provoke any error by omitting the WAIT on any 32 bit Intel processor, although Intel manuals say that the WAIT is needed for this purpose except after FNSTSW and FNSTCW. Omitting WAIT instructions for coordinating memory access is not 100% safe, even when writing 32 bit code, because the code may be able to run on the very rare combination of a 80386 main processor with a 287 coprocessor, which requires the WAIT. Also, I have no information on non-Intel processors, and I have not tested all Possible Hardware and Software Combinations, So There May Be Other Situations Where The Wait Is Needed.

If you want to be certain that your code will work on any 32 bit processor (including non-Intel processors) then I would recommend that you include the WAIT here in order to be safe.Regarding c: The assembler automatically inserts a WAIT for this purpose before the following instructions: FCLEX, FINIT, FSAVE, FSTCW, FSTENV, FSTSW You can omit the WAIT by writing FNCLEX, etc. My tests show that the WAIT is unneccessary in most cases because these instructions without WAIT will still generate an interrupt. On Exceptions Except for Fnclex and Fninit on The 80387. (The Iret From The Interrupt Points To The next Instruction).

Almost all other floating point instructions will also generate an interrupt if a previous floating point instruction has set an unmasked exception bit, so the exception is likely to be detected sooner or later anyway. You may insert a WAIT after the last floating point instruction in your PROGRAM to BE SURE To Catch All Exceptions.

You may still need the WAIT if you want to know exactly where an exception occurred in order to be able to recover from the situation Consider, for example, the code under b.3 above:. If you want to be able to recover from an Exception generated by the Fld Here, THEN You NEED The Wait Because An Interrupt After After Add ESP, 8 Would Overwrite the value to load. fnop may be faster Than Wait and serve the be faster.

26.7 FCOM FSTSW AX (All Processors)

The FNSTSW instruction is very slow on all processors. The PPro, PII and PIII processors have FCOMI instructions to avoid the slow FNSTSW. Using FCOMI instead of the common sequence FCOM / FNSTSW AX / SAHF will save you 8 clock cycles. You should therefore use FCOMI to Avoid Fnstsw Where Portible, Even In Cases Where IT COSTS Some Extra Code.on Processors WITHOUT FCOMI INSTRUCTIONS, The Usual Way of Doint Floating Point Comparisons IS:

FLD [A]

FCOMP [B]

FSTSW AX

SAHF

JB asmallerthanb

You May Improve This Code by Using Fnstsw Ax Rather Than Fstsw Ax and Test Ah Directly Rather Than Using The Non-Pairable Sahf (Tasm Version 3.0 Has A Bug with The Fnstsw Ax Instruction):

FLD [A]

FCOMP [B]

Fnstsw AX

SHR AH, 1

JC Asmallerthanb

Testing for Zero or Equality:

Ftst

Fnstsw AX

And Ah, 40h

Jnz iszero; (The Zero flag is inverted!)

Test if Greater:

FLD [A]

FCOMP [B]

Fnstsw AX

And Ah, 41H

JZ AgreatertHANB

Do Not Use Test Ah, 41H AS IT Is Not Pairable On PPLAIN AND PMMX.

On the PPlain and PMMX, the FNSTSW instruction takes 2 clocks, but it is delayed for an additional 4 clocks after any floating point instruction because it is waiting for the status word to retire from the pipeline. This delay comes even after FNOP which can not change . the status word, but not after integer instructions You can fill the latency between FCOM and FNSTSW with integer instructions taking up to four clock cycles A paired FXCH immediately after FCOM does not delay the FNSTSW, not even if the pairing is imperfect.:

FCOM; Clock 1

FXCH; Clock 1-2 (Imperfect Pairing)

INC DWORD PTR [EBX]; Clock 3-5

Fnstsw Ax; Clock 6-7

You may want to use FCOM rather than FTST here because FTST is not pairable. Remember to include the N in FNSTSW. FSTSW (without N) has a WAIT prefix which delays it further.It is sometimes faster to use integer instructions for comparing floating point Values, As Described in Chapter 27.6.

26.8 FPREM (All Processors)

. The FPREM and FPREM1 instructions are slow on all processors You may replace it by the following algorithm: Multiply by the reciprocal divisor, get the fractional part by subtracting the truncated value, then multiply by the divisor (see chapter 27.5 on how to truncate. )

Some documents say that these instructions may give incomplete reductions and that it is therefore necessary to repeat the FPREM or FPREM1 instruction until the reduction is complete. I have tested this on several processors beginning with the old 8087 and I have found no situation where a repetition of the fprem or fprem1 was needed.

26.9 FRNDINT (All Processors)

THIS ITRUCTION IS SLOW ON All Processors. Replace it by:

Fistp Qword PTR [Temp]

Fild Qword PTR [Temp]

This code is faster despite a possible penalty for attempting to read from [TEMP] before the write is finished. It is recommended to put other instructions in between in order to avoid this penalty. See chapter 27.5 on how to truncate.

26.10 FScale and Exponential Function (All Processors)

FSCALE is slow on all processors. Computing integer powers of 2 can be done much faster by inserting the desired power in the exponent field of the floating point number. To calculate 2N, where N is a signed integer, select from the examples below the one That Fits your Range of N:

For | N | <27-1 you can use sales:

MOV EAX, [N]

SHL EAX, 23

Add Eax, 3F800000HMOV DWORD PTR [TEMP], EAX

FLD DWORD PTR [TEMP]

For | N | <210-1 you can use double precision:

MOV EAX, [N]

SHL EAX, 20

Add Eax, 3FF00000H

Mov DWORD PTR [TEMP], 0

Mov DWORD PTR [Temp 4], EAX

FLD Qword PTR [Temp]

For | N | <214-1 Use long double precision:

MOV EAX, [N]

Add Eax, 00003FFFH

Mov DWORD PTR [TEMP], 0

MOV DWORD PTR [Temp 4], 80000000H

Mov DWORD PTR [Temp 8], EAX

FLD TBYTE PTR [TEMP]

Fscale is Offenential Functions. The Following Code Shows An Exponential Function Without The Slow Frnd And Fscale Instructions:

Extern "C" long double _cdecl exp (double x);

_exp proc near

Public _exp

FLDL2E

FLD Qword PTR [ESP 4]; X

Fmul; z = x * log2 (e)

Fist DWORD PTR [ESP 4]; Round (z)

SUB ESP, 12

Mov DWORD PTR [ESP], 0

MOV DWORD PTR [ESP 4], 80000000H

FISUB DWORD PTR [ESP 16]; Z - ROUND (Z)

MOV EAX, [ESP 16]

Add Eax, 3FFFH

MOV [ESP 8], EAX

Jle Short underflow

CMP EAX, 8000H

JGE Short overflow

F2xm1

FLD1

Fadd; 2 ^ (Z-Round (z))

FLD TBYTE PTR [ESP]; 2 ^ (Round (z))

Add ESP, 12

Fmul; 2 ^ z = e ^ x

RET

Underflow:

FSTP ST

Fldz; Return 0

Add ESP, 12

RET

Overflow:

PUSH 07F800000H; Infinity

FSTP ST

FLD DWORD PTR [ESP]; RETURN Infinity

Add ESP, 16

RET

_exp ENDP

26.11 FPTAN (All Processors)

According to the manuals, FPTAN returns two values X and Y and leaves it to the programmer to divide Y with X to get the result, but in fact it always returns 1 in X so you can save the division. My tests show that on all 32 bit Intel processors with floating point unit or coprocessor, FPTAN always returns 1 in X regardless of the argument. If you want to be absolutely sure that your code will run correctly on all processors, then you may test if X is 1, which is Faster Than Dividing with x. The y value may be very high, but never infinity, so you don't have to testiff.26.12 fsqrt (piii) in the argument is valid.26.12 fsqrt (PIII)

A fast way of calculating an approximate squareroot on the PIII is to multiply the reciprocal squareroot of x by x: SQRT (x) = x * RSQRT (x) The instruction RSQRTSS or RSQRTPS gives the reciprocal squareroot with a precision of 12 bits You. Can IMPROVE The Precision To 23 Bits by Using The Newton-Raphson Formula Described in Intel's Application Note APED INTEL'S Application Note AP-803: X0 = RSQRTSS (A) x1 = 0.5 * x0 * (3 - (a * x0)) * x0) Where x0 is THE FIRST Approximation To The Reciprocal Squareroot of A, AND X1 IS A Better Approximation. The Order of Evaluation IS IMPORTANT. You MUST USE This Formula Before Multiplying With Formula Before. Squareroot.

26.13 MOV [MEM], Acc (PPLAIN and PMMX)

THE INSTRUCTIONS MOV [MEM], Al Mov [MEM], AX MOV [MEM], EAX Are Treated by The Pairing Circumulator. Thus the Following Instructions do not Pair:

MOV [MyData], EAX

MOV EBX, EAX

This problem occurs only with the short form of the MOV instruction which can not have a base or index register, and which can only have the accumulator as source. You can avoid the problem by using another register, by reordering your instructions, by using a Pointer, or by hard-coding the general form of the mo Write the general form of mal form of mov [MEM], EAX:

DB 89H, 05H

DD Offset DS: MEM

IN 16 bit mode you can write the general form of mov [MEM], AX:

DB 89H, 06H

DW Offset DS: MEM

To Use al instead of (e) ax, you replace 89h with 88h

This flaw has not been fixed in the pmmx.

26.14 Test INSTRUCTION (PPLAIN and PMMX)

The Test Instruction with An Immediate Operand Is Only Pairable If The Destination IS Al, Ax, or Eax.

Test Register, Register and Test Register, Memory Is Always Pairable.

EXAMPLES:

Test ECX, ECX; Pairable

TEST [MEM], EBX; Pairable

Test EDX, 256; Not Pairable

Test DWORD PTR [EBX], 8000H; Not Pairable

To make it pairable, Use any of the folowing methods:

Mov Eax, [EBX] / TEST EAX, 8000H

MOV EDX, [EBX] / and EDX, 8000H

MOV Al, [EBX 1] / TEST Al, 80H

MOV Al, [EBX 1] / Test Al, Al; (Result In Sign Flag)

(The reason for this non-pairability is probably that the first byte of the 2-byte instruction is the same as for some other non-pairable instructions, and the processor can not afford to check the second byte too when determining pairability.)

26.15 bit scan (PPLAIN and PMMX)

BSF and BSR Are The Poorest Optimized Instructions on the PPLAIN AND PMMX, Taking Approximately 11 2 * N Clock Cycles, Where n is the number of zeros skipped.

The Following Code Emulates BSR ECX, EAX:

Test Eax, EAX

JZ Short BS1

MOV DWORD PTR [Temp], Eaxmov DWORD PTR [Temp 4], 0

Fild Qword PTR [Temp]

FSTP QWORD PTR [TEMP]

Wait; Wait Only Needed for Compatibility with Old 80287 Processor

MOV ECX, DWORD PTR [TEMP 4]

SHR ECX, 20; ISOLATE EXPONENT

SUB ECX, 3ffh; Adjust

Test Eax, Eax; Clear Zero Flag

BS1:

The Following Code Emulates BSF ECX, EAX:

Test Eax, EAX

JZ Short BS2

XOR ECX, ECX

MOV DWORD PTR [Temp 4], ECX

SUB ECX, EAX

And Eax, ECX

Mov DWORD PTR [TEMP], EAX

Fild Qword PTR [Temp]

FSTP QWORD PTR [TEMP]

Wait; Wait Only Needed for Compatibility with Old 80287 Processor

MOV ECX, DWORD PTR [TEMP 4]

SHR ECX, 20

Sub ECX, 3FFH

Test Eax, Eax; Clear Zero Flag

BS2:

................... ..

26.16 FLDCW (PPRO, PII AND PIII)

The PPRO, PII AND PIII HAVE A Serious Stall After The Fldcw Instruction IF FLLOWED by Any Floating Point Instruction Which Reads The Control Word (Which Almost All Floating Point Instructions Do).

When C or C code is compiled it often generates a lot of FLDCW instructions because conversion of floating point numbers to integers is done with truncation while other floating point instructions use rounding. After translation to assembly, you can improve this code by using rounding instead of Truncation WHERE POSSIBLE, or by Moving The Fldcw Out of a loop Where truncation is needed inside the loop.

See Chapter 27.5 on How to Convert Floating Point Numbers to Integers Whitout Changing The Control Word.

27. Special Topics27.1 Lea Instruction (All Processors)

The LEA instruction is useful for many purposes because it can do a shift, two additions, and a move in just one instruction taking one clock cycle Example:. LEA EAX, [EBX 8 * ECX-1000] is much faster than MOV EAX , ECX / SHL EAX, 3 / Add Eax, EBX / SUB EAX, 1000 The Lea Instruction Can Also BE Used to do an address. The Source and Destination Need Not Have The Same Word Size, So Lea EAX , [Bx] is a Possible Replacement for Movzx Eax, BX, Although Suboptimal On Most Processors.

You Must Be Aware, HoWever, That The Lea Instruction Will Suffer Agi Stall on The PPLAIN AND PMMX IT Uses A Base or Index Register Which Has Been Written To in The Preceding Clock Cycle.

Since the LEA instruction is pairable in the v-pipe on PPlain and PMMX and shift instructions are not, you may use LEA as a substitute for a SHL by 1, 2, or 3 if you want the instruction to execute in the V-pipe .

The 32 bit processors have no documented addressing mode with a scaled index register and nothing else, so an instruction like LEA EAX, [EAX * 2] is actually coded as LEA EAX, [EAX * 2 00000000] with an immediate displacement of 4 bytes. you may reduce the instruction size by instead writing LEA EAX, [EAX EAX] or even better ADD EAX, EAX. The latter code can not have an AGI delay in PPlain and PMMX. If you happen to have a register which is zero (Like a loop counter a loop), the you may use it as a base register to reduce the code size:

Lea Eax, [EBX * 4]; 7 BYTES

Lea EAX, [ECX EBX * 4]; 3 bytes

27.2 DiVision (All Processors)

Division is quite time consuming. On PPro, PII and PIII an integer division takes 19, 23, or 39 clocks for byte, word, and dword divisors respectively. On PPlain and PMMX an unsigned integer division takes approximately the same, while a signed integer division takes somewhat more. It is therefore preferable to use the smallest operand size possible that will not generate an overflow, even if it costs an operand size prefix, and use unsigned division if possible.Integer division by a constant (all processors)

Integer Division By A Power of Two Can Be Done by Shifting Right. Dividing An Unsigned Integer by 2N:

SHR EAX, N

Dividing a signed integer by 2n:

CDQ

And EDX, (1 shl n) -1; or SHR EDX, 32-N

Add Eax, EDX

Sar Eax, N

The SHR Alternative Is Shorter Than The and if n> 7, but can on or Go To Execution Port 0 (U or V-PIPE).

Dividing by a constant can be done by multiplying with the reciprocal. To calculate the unsigned integer division q = x / d, you first calculate the reciprocal of the divisor, f = 2r / d, where r defines the position of the binary decimal point (Radix Point). The Multiply X with f and Shift Right R positions. The maximum value of R IS 32 B, WHERE B Is The Number of Binary Digits IN D Minus 1. (b Is The Highest Integer for Which 2B <= d). USE R = 32 B TO Cover The Maximum Range for The Value of The Dividend X.

This method needs some refinement in order to compensate for rounding errors The following algorithm will give you the correct result for unsigned integer division with truncation, i.e. the same result as the DIV instruction gives (Thanks to Terje Mathisen who invented this method).:

B = (The Number of Significant Bits IN D) - 1

R = 32 B

f = 2R / d

IF f is an integer life d is a power of 2: goto case A.

If f is not an integer, then check if the Fractional Part of F is <0.5

IF The Fractional Part of F <0.5: Goto Case B.

IF The Fractional Part of F> 0.5: Goto Case C.

Case A: (D = 2b)

Result = x SHR B

Case B: (Fractional Part of F <0.5)

Round f Down to nearest Integer

Result = ((x 1) * f) SHR R

Case C: (FRActional Part of F> 0.5)

Round f Up to nearest Integer

Result = (x * f) SHR R

Example: Assume That You Want to Divide By 5. 5 = 00000101b. B = (Number of Significant Binary Digits) - 1 = 2 r = 32 2 = 34 f = 234/5 = 3435973836.8 = 0cccccccc.ccc ... Hexadecimal) The Fractional Part Is Greater Than A Half: Use Case C. Round f Up to 0cccccccdh.

The Following Code Divides EAX by 5 and Returns The Result in Edx:

Mov edx, 0cccccccdh

Mul EDX

SHR EDX, 2

Afx Contains The Product Shifted Right 32 Places. Since R = 34 You Have to Shift. To Divide By 10 You Just Change The Last Line to SHR EDX, 3.

In Case B you Would Have:

INC EAX

Mov EDX, F

Mul EDX

SHR EDX, B

This Code Works for All Values of X Except 0ffffffh Which Gives Zero Because of overflow in the incswures = 0ffffffh is Possible, The Change The code TO:

Mov EDX, F

Add Eax, 1

JC DOVERFL

Mul EDX

Doverfl: SHR EDX, B

IF The Value of X Is Limited, Then You May USE A Lower Value of R, I.E. Fewer Digits. There Can Be Several Reasons To Use a Lower Value of R:

You May SET R = 32 to Avoid The SHR EDX, B in The end. You May SET R = 16 b AND Use a Multiplication Instruction That Gives A 32 Bit Result Rather Than 64 Bits. This will free the Edx Register: Imul EAX , 0CCCDH / SHR EAX, 18 YOU May Choose a Value of R That Gives Case C Rather Than Case B in Order To Avoid The Inc Eax Instructionthe Maximum Value for x in these case is at Least 2R-B, Sometimes Higher. You have to To Do a systematic testiff.

You May Want To Replace The Slow Multiplication Instruction with Faster Instructions as Explained in Chapter 26.5.

The following example divides EAX by 10 and returns the result in EAX. I have chosen r = 17 rather than 19 because it happens to give a code, which is easier to optimize, and covers the same range for x. F = 217/10 = 3333H, Case B: Q = (x 1) * 3333h:

Lea EBX, [EAX 2 * EAX 3]

Lea ECX, [EAX 2 * EAX 3]

SHL EBX, 4

MOV EAX, ECX

SHL ECX, 8

Add Eax, EBX

SHL EBX, 8

Add Eax, ECX

Add Eax, EBX

SHR EAX, 17

A Systematic Test Shows That this Code Works Correctly for ALL X <10004H.

Repeated Integer Division By The Same Value (All Processors)

If the divisor is not known at assembly time, but you are dividing repeatedly with the same divisor, then you may use the same method as above. The code has to distinguish between case A, B and C and calculate f before doing the divisions.

The code that follows shows how to do multiple divisions with the same divisor (unsigned division with truncation). First call SET_DIVISOR to specify the divisor and calculate the reciprocal, then call DIVIDE_FIXED for each value to divide by the same divisor..data

Reciprocal_divisor dd?; Rounded Reciprocal Divisor

CASE A: -1, Case B: 1, Case C: 0

Bshift dd?; Number of Bits in Divisor - 1

.code

Set_divisor proc Near; Divisor in EAX

Push EBX

MOV EBX, EAX

BSR ECX, EAX; B = Number of Bits in Divisor - 1

Mov EDX, 1

JZ Error; Error: Divisor is Zero

SHL EDX, CL; 2 ^ B

MOV [Bshift], ECX; Save B

CMP EAX, EDX

MOV EAX, 0

Je Short Case_a; Divisor Is A Power Of 2

DIV EBX; 2 ^ (32 b) / D

SHR EBX, 1; Divisor / 2

XOR ECX, ECX

CMP EDX, EBX; Compare RemainDer with Divisor / 2

SetBe Cl; 1 if Case B

MOV [Correction], ECX; Correction for Rounding Errors

XOR ECX, 1

Add Eax, Ecx; Add 1 IF Case C

MOV [Reciprocal_Divisor], Eax; Rounded Reciprocal Divisor

POP EBX

RET

Case_a: MOV [Correction], - 1; Remember That We Have Case A

POP EBX

RET

Set_divisor endp

Divide_fixed proc near; Dividend in Eax, Result in Eax

MOV EDX, [Correction]

MOV ECX, [Bshift]

Test EDX, EDX

JS Short Dshift; Divisor Is Power of 2

Add Eax, EDX; CORRECT for Rounding Error

JC Short Doverfl; Correct For Overflow

Mul [Reciprocal_Divisor]; Multiply with Reciprocal Divisor

MOV EAX, EDX

Dshift: SHR Eax, Cl; Adjust for Number of Bits

RET

Doverfl: MOV Eax, [Reciprocal_Divisor]; DIVIDEND = 0FFFFFFFH

SHR EAX, Cl; Do Division by Shifting

RET

Divide_fixed endp

This Code Gives The Same Result As The Div Instruction for 0 <= x <232, 0

If PowerS of 2 Occur So So Seldom That It Is Not Worth Optimizing for Theme

IF The processimizor.

FLOATING POINT DIVISION (All Processors)

. Floating point division takes 38 or 39 clock cycles for the highest precision You can save time by specifying a lower precision in the floating point control word (On PPlain and PMMX, only FDIV and FIDIV are faster at low precision; on PPro, PII and PIII, this Also Applies To Fsqrt. No Other Instructions Can Be Speeded Up this Way).

Parallel Division (PPLAIN and PMMX)

. On PPlain and PMMX, it is possible to do a floating point division and an integer division in parallel to save time On PPro, PII and PIII this is not possible, because integer division and floating point division use the same circuitry Example:. A = A1 / A2; B = B1 / B2

Fild [b1]

Fild [b2]

Mov Eax, [A1]

MOV EBX, [A2]

CDQ

Fdiv

Div EBX

Fistp [b]

MOV [A], EAX

(Make Sure You Set The Floating Point Control Word To The Desired Rounding Method)

Using reciprocal instruction for fast division (PIII) On PIII you can use the fast reciprocal instruction RCPSS or RCPPS on the divisor and then multiply with the dividend. However, the precision is only 12 bits. You can increase the precision to 23 bits by using The Newton-Raphson Method Described in Intel's Application Note AP-803: X0 = RCPSS (D) x1 = X0 * (2 - D * x0) = 2 * x0 - d * x0 * x0 Where x0 is the first approximation to the Reciprocal Of The Divisor, D, AND X1 IS A Better Approximation. You MUST USE This Formula Before Multiplying with The DivideD:

Movaps XMM1, [DiviSors]; Load Divisors

RCPS XMM0, XMM1; ApproxImate Reciprocal

Mulps XMM1, XMM0; Newton-Raphson Formula

Mulps XMM1, XMM0

AddPS XMM0, XMM0

SUBPS XMM0, XMM1

Mulps XMM0, [DIVIDENDS]; Results in XMM0

This makes four divisions in 18 clock cycles with a precision of 23 bits. Increasing the precision further by repeating the Newton-Raphson formula in the floating point registers is possible, but not very advantageous.

If you want to use this method for integer divisions then you have to check for rounding errors. The following code makes four divisions with truncation on packed word size integers in approximately 42 clock cycles. It gives exact results for 0 <= dividend <7FFFFH and 0

MOVQ MM1, [DiviSors]; Load Four Divisors

MOVQ MM2, [Dividends]; Load Four Dividends

Punpckhwd MM4, MM1; Unpack Divisors to DWORDS

PSRAD MM4, 16

PUNPCKLWD MM3, MM1

PSRAD MM3, 16

CVTPI2PS XMM1, MM4; Convert Divisors to Float, Upper Two Operands

Movlhps XMM1, XMM1

CVTPI2PS XMM1, MM3; Convert Lower Two Operands

Punpckhwd MM4, MM2; Unpack Dividends to DWORDSPSRAD MM4, 16

PUNPCKLWD MM3, MM2

PSRAD MM3, 16

CVTPI2PS XMM2, MM4; Convert DivideS to float, Upper Two Operands

Movlhps XMM2, XMM2

CVTPI2PS XMM2, MM3; Convert Lower TWO Operands

RCPS XMM0, XMM1; Approximate Reciprocal of Divisors

Mulps XMM1, XMM0; Improve Precision with Newton-Raphson Method

PCMPEQW MM4, MM4; Make Four Integer 1's in the meantime

PSRLW MM4, 15

Mulps XMM1, XMM0

AddPS XMM0, XMM0

SUBPS XMM0, XMM1; Reciprocal Divisors with 23 Bit Precision

Mulps XMM0, XMM2; Multiply with Divide

CVTTPS2PI MM0, XMM0; Truncate Lower TWO RESULTS

MOVHLPS XMM0, XMM0

CVTTPS2PI MM3, XMM0; Truncate Upper Two Results

PackssDW MM0, MM3; Pack The Four Results INTO MM0

MOVQ MM3, MM1; Multiply Results with DiviSors ...

Pmullw MM3, MM0; To Check for Rounding Errors

Paddsw MM0, MM4; Add 1 To Compensate for Later Subtraction

Paddsw MM3, MM1; Add Divisor. This SHOULD BE> DIVIDEND

PCMPGTW MM3, MM2; Check IF TOO SMALL

Paddsw mm0, mm3; Subtract 1 if not Too Small

MOVQ [quotients], mm0; save the four results

This Code Checks if the result is too small and makes the appropriate correary to check if the result is too big.

Avoiding Division (All Processors)

Obviously, you should always try to minimize the number of divisions. Floating point division with a constant or repeated division with the same value should of course be done by multiplying with the reciprocal. But there are many other situations where you can reduce the number of Divisions. for example: if (a / b> c) ... Can Be Rewritten As if (a> b * c) ... WHEN B Is posTIVE, AND THE OPPOSITE WHEN B IS NEGATIVE.A / B C / D Can BE REWRITTEN AS (A * D C * B) / (b * d)

IF you are selling integer division, the you sell be aware That The Rounding Errors May Be Different When You Rewrite the formulas.

27.3 Freeing floating point registers (All Processors)

You have to free all used floating point registers before except for any register buying for the result.

The fastest way of freeing one register is FSTP ST The fastest way of freeing two registers is FCOMPP on PPlain and PMMX;. On PPro, PII and PIII you may use either FCOMPP or two times FSTP ST, whichever fits best into the decoding sequence.

IT is not recommended to use ffree.

27.4 Transitions Between Floating Point and MMX Instructions (PMMX, PII and PIII)

You Must Issue An Emms Instruction After your Last MMX INSTRUCTION IF The is a Possibility That Floating Point Code Follows Later.

On PMMX there is a high penalty for switching between floating point and MMX instructions. The first floating point instruction after an EMMS takes approximately 58 clocks extra, and the first MMX instruction after a floating point instruction takes approximately 38 clocks extra.

ON PII AND PIII THERE IS No Such Penalty. The Delay After Emms Can Be Hidden by Putting In Integer Instructions Between Emms and The First Floating Point Instruction.

27.5 Converting from Floating Point To Integer (All Processors) All convers from floating point to integer, and vice versa, must go via a memory location:

Fistp DWORD PTR [TEMP]

Mov Eax, [TEMP]

On PPro, PII and PIII, this code is likely to have a penalty for attempting to read from [TEMP] before the write to [TEMP] is finished because the FIST instruction is slow (see chapter 17). It does not help to put in a WAIT (see chapter 26.6). It is recommended that you put in other instructions between the write to [TEMP] and the read from [TEMP] if possible in order to avoid this penalty. This applies to all the examples that follow .

The specifications for the C and C language requires that conversion from floating point numbers to integers use truncation rather than rounding. The method used by most C libraries is to change the floating point control word to indicate truncation before using an FISTP instruction and changing it back again afterwords. This method is very slow on all processors. On PPro, PII and PIII, the floating point control word can not be renamed, so all subsequent floating point instructions must wait for the FLDCW instruction to retire.

Whenever you have a conversion from floating point to integer in C or C , you should think of whether you can use rounding to nearest integer instead of truncation. If your standard library does not have a fast round function then make your own using the code Examples listed below.

................... ..

You may use various tricks for truncating without changing the control word, as illustrated in the examples below. These examples presume that the control word is set to default, i.e. rounding to nearest or even.Rounding to nearest or even

; extern "C" int Round (double x);

_Round Proc Near

Public _Round

FLD Qword PTR [ESP 4]

Fistp DWORD PTR [ESP 4]

MOV EAX, DWORD PTR [ESP 4]

RET

_Round Endp

Truncation Towards Zero

EXTERN "C" Int truncate (double x);

_Truncate Proc Near

Public _truncate

FLD Qword PTR [ESP 4]; X

SUB ESP, 12; Space for Local Variables

Fist dword PTR [ESP]; Rouded Value

FST DWORD PTR [ESP 4]; Float Value

FISUB DWORD PTR [ESP]; Subtract Rounded Value

FSTP DWORD PTR [ESP 8]; DIFCERENCE

POP Eax; Rounded Value

POP ECX; Float Value

POP EDX; DIFCERENCE (FLOAT)

Test ECX, ECX; Test Sign of X

JS Short Negative

Add EDX, 7FFFFFFFFH; Product Carry if Difference <-0

SBB EAX, 0; Subtract 1 IF X-ROUND (x) <-0

RET

NEGATIVE:

XOR ECX, ECX

Test EDX, EDX

Setg cl; 1 if Difference> 0

Add Eax, Ecx; Add 1 IF X-Round (x)> 0

RET

_TRUNCATE ENDP

Truncation Towards Minus Infinity

; extern "c" int ifloor (double x);

_IFLOOR Proc Near

Public _iFloor

FLD Qword PTR [ESP 4]; X

SUB ESP, 8; Space for Local Variables

Fist dword PTR [ESP]; Rouded Value

FISUB DWORD PTR [ESP]; Subtract Rounded Value

FSTP DWORD PTR [ESP 4]; DIFCERENCE

POP Eax; Rounded ValuePop Edx; Difference (Float)

Add EDX, 7FFFFFFFFH; Product Carry if Difference <-0

SBB EAX, 0; Subtract 1 IF X-ROUND (x) <-0

RET

_IFLOOR ENDP

THESE Procedures Work for -231

The PIII has instructions for truncation of single precision floating point numbers:. CVTTSS2SI and CVTTPS2PI These instructions are very useful if the single precision is satisfactory, but if you are converting a float with higher precision to single precision in order to use these truncation instructions then You Have The Problem That The Number May Be Rounded Up in The Conversion To Single Precision.

Alternative to fistp instruction (PPLAIN and PMMX)

Converting a floating point number to integer is normally done like this:

Fistp DWORD PTR [TEMP]

Mov Eax, [TEMP]

An Alternative Method IS:

.DATA

ALIGN 8

TEMP DQ?

Magic DD 59C00000H; F.P. Representation Of 2 ^ 51 2 ^ 52

.Code

Fadd [Magic]

FSTP QWORD PTR [TEMP]

MOV Eax, DWORD PTR [TEMP]

Adding the 'magic number' of 251 252 has the effect that any integer between -231 and 231 will be aligned in the lower 32 bits when storing as a double precision floating point number. The result is the same as you get with FISTP for all rounding methods except truncation towards zero. The result is different from FISTP if the control word specifies truncation or in case of overflow. You may need a WAIT instruction for compatibility with the old 80287 processor, see chapter 26.6.

This method is not faster than using FISTP, but it gives better scheduling opportunities on PPlain and PMMX because there is a 3 clock void between FADD and FSTP which may be filled with other instrucions. You may multiply or divide the number by a power of 2 in the same operation by doing the opposite to the magic number. You may also add a constant by adding it to the magic number, which then has to be double precision.27.6 Using integer instructions to do floating point operations (all processors)

Integer instructions are generally faster than floating point instructions, so it is often advantageous to use integer instructions for doing simple floating point operations The most obvious example is moving data Example:.. FLD QWORD PTR [ESI] / FSTP QWORD PTR [EDI] Change TO: MOV EAX, [ESI] / MOV EBX, [ESI 4] / MOV [EDI], EAX / MOV [EDI 4], EBX

Testing if a floating point value is zero:

The floating point value of zero is usually represented as 32 or 64 bits of zero, but there is a pitfall here: The sign bit may be set Minus zero is regarded as a valid floating point number, and the processor may actually generate a zero! . with the sign bit set if for example multiplying a negative number with zero So if you want to test if a floating point number is zero, you should not test the sign bit Example:. FLD DWORD PTR [EBX] / FTST / FNSTSW AX / aND AH, 40H / JNZ IsZero Use integer instructions instead, and shift out the sign bit: MOV EAX, [EBX] / ADD EAX, EAX / JZ IsZero If the floating point number is double precision (QWORD) then you only have to Test Bit 32-62. if The Lower Half Will Also BE ZERO IT IS A NORMAL FLOATING POINT NUMBER.

Testing if NEGATIVE:

A floating point number is negative if the sign is set and at Least One Other Bit IS Set. EXAMPLE: MOV EAX, [Numbertotest] / CMP Eax, 80000000H / Ja IsnegativeManipulating The Sign bit:

You can change the sign of a floating point number simply by flipping the sign bit. EXAMPLE: XOR BYTE PTR [A] (Type A) - 1, 80H

LIKEWISE You Get The Absolute Value of a Floating Point Number by Simply Anding Out The Sign Bit.

Comparing number:

Floating point numbers are stored in a unique format which allows you to use integer instructions for comparing floating point numbers, except for the sign bit. If you are certain that two floating point numbers both are normal and positive then you may simply compare them as integers EX / AND AH, 1 / JNZ S: MOV EAX, [A] / MOV EBX, [B] / CMP EAX, EBX / JB Asmallerthanb this Method ONLY Works if The Two Numbers Have The Same Precision and You Are Certain That None of The Numbers Have The Sign Bit Set.

If Negative Numbers Are Possible, THEN You Have To Convert The Negative NumBers To 2-Complement, And Do A Signed Compare:

Mov Eax, [A]

MOV EBX, [B]

MOV ECX, EAX

MOV EDX, EBX

SAR ECX, 31; COPY SIGN BIT

And Eax, 7FFFFFFFH; Remove Sign Bit

Sar EDX, 31

And EBX, 7FFFFFFH

XOR EAX, ECX; Make 2-Complement IF Sign Bit WAS SET

XOR EBX, EDX

Sub Eax, ECX

SUB EBX, EDX

CMP EAX, EBX

Jl asmallerthanb; Signed Comparison

This Method Works for All Normal floating point numbers, incru DING -0.

27.7 Using floating point instructions to do integer Operations (PPLAIN and PMMX)

Integer Multiplication (PPLAIN and PMMX)

Floating point multiplication is faster than integer multiplication on the PPlain and PMMX, but the price for converting integer factors to float and converting the result back to integer is high, so floating point multiplication is only advantageous if the number of conversions needed is low compared with the number of multiplications. (It may be tempting to use denormal floating point operands to save some of the conversions here, but the handling of denormals is very slow, so this is not a good idea!) On the PMMX, MMX multiplication instructions are faster than integer multiplication, and can be pipelined to a throughput of one multiplication per clock cycle, so this may be the best solution for doing fast multiplication on the PMMX, if you can live with 16 bit precision.

Integer Multiplication IS Faster Than Floating Point On PPRO, PII AND PIII.

Integer Division (PPLAIN and PMMX)

Floating point division is not faster than integer division, but you can do other integer operations (including integer division, but not integer multiplication) while the floating point unit is working on the division (See example above).

Converting Binary To Decimal NumBers (All Processors)

27.8 Moving Blocks of Data (All Processors)

..................... ..

ON PPLAIN AND PMMX it is faster to move 8 bytes at a time Using floating point registers if the destination is not in the cache:

Top: Fild Qword Ptr [ESI]

Fild Qword PTR [ESI 8]

FXCH

Fistp Qword PTR [EDI] Fistp Qword PTR [EDI 8]

Add ESI, 16

Add EDI, 16

Dec ECX

JNZ TOP

The source and destination should of course be aligned by 8. The extra time used by the slow FILD and FISTP instructions is compensated for by the fact that you only have to do half as many write operations. Note that this method is only advantageous on the PPlain and PMMX and only if the destination is not in the level 1 cache. You can not use FLD and FSTP (without I) on arbitrary bit patterns because denormal numbers are handled slowly and certain bit patterns are not preserved unchanged.

On the PMMX Processor it is faster to use mmx instructions to move extreme bytes at a time if the destination is not in the cache:

Top: MOVQ MM0, [ESI]

MOVQ [EDI], MM0

Add ESI, 8

Add EDI, 8

Dec ECX

JNZ TOP

There is no need to unroll this loop or Optimize It Further if Cache Misses Are Expected, Because Memory Access Is The Bottleneck Here, NOT INSTRUCTION EXECUTION.

ON PPRO, PII AND PIII Processors The Rep Movsd Instruction IS Particularly Fast When The Following Conditions Are Met (See Chapter 26.3):

On the PII it is faster to use MMX registers if the above conditions are not met and the destination is likely to be in the level 1 cache. The loop may be rolled out by two, and the source and destination should of course be aligned by 8.

On the Piii The Fastest Way of Moving Data Is To Use The Movaps Instruction IF The Above Condition Is in The Level 1 Or Level 2 Cache: Sub EDI, ESI

Top: MOVAPS XMM0, [ESI]

MOVAPS [ESI EDI], XMM0

Add ESI, 16

Dec ECX

JNZ TOP

Unlike Fld, Movaps Can Handle Any Bit Pattern WITHOUT Problems. Remember That Source And Destination Must Be Aligned by 16.

If the number of bytes to move is not divisible by 16 then you may round up to the nearest number divisible by 16 and put some extra space at the end of the destination buffer to receive the superfluous bytes. If this is not possible then you have To move the remaining bytes by Other Methods.

On the PIII you also have the option of writing directly to RAM memory without involving the cache by using the MOVNTQ or MOVNTPS instruction. This can be useful if you do not want the destination to go into a cache. MOVNTPS is only slightly faster than Movntq.

27.9 Self-Modifying Code (All Processors)

The penalty for executing a piece of code immediately after modifying it is approximately 19 clocks for PPlain, 31 for PMMX, and 150-300 for PPro, PII and PIII. The 80486 and earlier processors require a jump between the modifying and the modified code in Order to flush the cache.

To get permission to modify code in a protected operating system you need to call special system functions: In 16-bit Windows call ChangeSelector, in 32-bit Windows call VirtualProtect and FlushInstructionCache (or put the code in a data segment).

Self-modifying code is not considered Good Programming Practice, But it. be justified means.

27.10 Detecting Processor Type (All Processors)

I think it is fairly obvious by now that what is optimal for one microprocessor may not be optimal for another. You may make the most critical parts of you program in different versions, each optimized for a specific microprocessor and selecting the desired version at run time after detecting which microprocessor the program is running on. If you are using instructions that are not supported by all microprocessors (ie conditional moves, FCOMI, MMX and XMM instructions) then you must first check if the program is running on a microprocessor that supports these Instructions. The Subroutine Below Checks The Type of Microprocessor and the Features Support .; Define CPUID INSTRUCTION IF NOT KNOWN BY Assembler:

CPUID Macro

DB 0FH, 0A2H

ENDM

; C Prototype:

(EXTERN "C" long int detectProcessor (Void);

Return Value:

(BITS 8-11 = Family (5 for PPLAIN AND PMMX, 6 for PPRO, PII AND PIII)

; bit 0 = floating point instructions supported. INSTRUCTIONS Supported

; Bit 15 = Conditional Move and FCOMI INSTRUCTIONS Supported

; bit 23 = mmx instructions supported

; Bit 25 = XMM Instructions Supported

_DetectProcessor Proc Near

Public_DetectProcessor

Push EBX

PUSH ESI

Push EDI

Push EBP

Detect IF CPUID Instruction Supported by Microprocessor:

Pushfd

POP EAX

MOV EBX, EAX

XOR Eax, 1 shl 21; Check IF CPUID BIT CAN TOGGLE

Push EAX

POPFD

Pushfd

POP EAX

XOR EAX, EBX

And Eax, 1 shl 21

JZ Short Dpend; CPUID Instruction Not Supported

XOR EAX, EAX

CPUID; Get Number of CPUID Functions

Test Eax, EAX

JZ Short Dpend; CPUID Function 1 NOT Supported

MOV Eax, 1

CPUID; Get Family and Featuresand EAX, 000000F00H; Family

And EDX, 0FFFFF0FFH; Features Flags

OR EAX, EDX; Combine Bits

Dpend: POP EBP

POP EDI

POP ESI

POP EBX

RET

_DetectProcessor ENDP

Note that some operating systems do not allow XMM instructions Information on how to check for operating system support of XMM instructions can be found in Intel's application note AP-900:. "Identifying support for Streaming SIMD Extensions in the Processor and Operating System" More. Information ON Microprocessor Identification Can Be Found In Intel's Application Note AP-485: "Intel Processor Identification and The CPUID Instruction".

To Code The Conditional Move, MMX, XMM instructions etc. On an asseber That Doesn't Have these Instructions Use the macros at www.agner.org/assem/macros.zip

28. List of instruction Timings for PPLAIN AND PMMX

28.1 Integer Instructions

Explanations: Operands: r = register, m = memory, i = immediate data, sr = segment register m32 = 32 bit memory operand, etc.

CLOCK CYCLES: The Numbers Are Minimum Values. Cache Misses, Misalignment, And Exceptions May Increase The Clock Count ConsideLy.

Pairability: u = pairable in u-pipe, v = pairable in v-pipe, uv = pairable in Either Pipe, np = not pairable.

Instruction Operands Clock cycles Pairability NOP 1uvMOVr / m, r / m / i1uvMOVr / m, sr1npMOVsr, r / m> = 2 b) npMOVm, accum1uv h) XCHG (E) AX, r2npXCHGr, r3npXCHGr, m> 15npXLAT 4npPUSHr / i1uvPOPr1uvPUSHm2npPOPm3npPUSHsr1 b NPPOPSR> = 3 b) NPPUSHF 3-5NPPOPF 4-6nppusha POPA 5-9 I) NPPUSHAD POPAD 5nplahf Sahf 2NPMovsx Movzxr, R / M3 a) NPLEAR, M1UVLDS LES LGS LGS LSSM4 C) NpAdd Sub and OR, R / I1UVAdd SUB AND OR XORr, m2uvADD SUB AND OR XORm, r / i3uvADC SBBr, r / i1uADC SBBr, m2uADC SBBm, r / i3uCMPr, r / i1uvCMPm, r / i2uvTESTr, r1uvTESTm, r2uvTESTr, i1f) TESTm, i2npINC DECr1uvINC DECm3uvNEG NOTr / m1 / 3npMUL IMULr8 / r16 / m8 / m1611npMUL IMULall other versions9 d) npDIVr8 / m817npDIVr16 / m1625npDIVr32 / m3241npIDIVr8 / m822npIDIVr16 / m1630npIDIVr32 / m3246npCBW CWDE 3npCWD CDQ 2npSHR SHL SAR SALr, i1uSHR SHL SAR SALm, i3uSHR SHL SAR SALr / m, CL4 / 5npROR ROL RCR RCLR / M, 11/3 uror Rolr / M, I (> <1) 1 / 3n Pror Rolr / m, CL4 / 5nPRCR RCLR / M, I (> <1) 8 / 10nPrCr RCLR / M, CL7 / 9NPSHLD SHRDR , I / CL4 a) NpsHLD SHRDM, I / CL5 A) NPBTR, R / I4 A) N PBTM, I4 A) NPBTM, I9 a) NPBTR BTS BTCR, R / I7 A) NPBTR BTS BTCM, I8 A) NPBTR BTS BTCM, R14 A) NPBSF BSRR, R / M7-73 a) Npsetccr / m1 / 2 a) Npjmp Callshort / NEAR1 E) VJMP Callfar> = 3 E) NPConditional Jumpshort / NPCAL1 / 4/5/6 E) VCALL JMPR / M2 / 5 Enpretn 2/5 Enpretni3 / 6 E) NPRETF 4/7 E) NPRETFI5 / 8 E ) Npj (e) cxzshrt4-11 e) NPLOOPSHORT5-10 E) NPBOOPSHORT5-10 E) NPBoundr, M8NPCLC STC CMC CLDS 2NPCLI STI 6-9NPLODS 2NPREP LODS 7 3 * NG) NPSTOS 3NPREP STOS 10 NG) NPMOVS 4NPREP MOVS 12 NG) NPSCAS 4NPREP (N) E SCAS 9 4 * NG) NPCMPS 5NPREP (N) E CMPS 8 4 * NG) NPBSWAP 1 a) NPCPUID 13-16 A) NPRDTSC 6-13 A) J) NP

Notes: a) this instruction has a 0FH prefix which takes one clock cycle extra to decode on a PPlain unless preceded by a multicycle instruction (see chapter 12) b) versions with FS and GS have a 0FH prefix see note a c... Versions WITH SS, FS, AND GS HAVE A 0FH Prefix. See Note A. D) Versions WITH TWO OPERANDS AND NO IMMEDITE HAVE A 0FH Prefix, See Note A. E) SEE Chapter 22 f) Only Pairable If Register IS Accumulator. see chapter 26.14. g) add one clock cycle for decoding the repeat prefix unless preceded by a multicycle instruction (such as CLD. see chapter 12). h) pairs as if it were writing to the accumulator. see chapter 26.14. i) 9 If SP Divisible by 4. See 10.2 J) ON PPLAIN: 6 in Priviled or Real Mode, 11 in Nonpriviled, Error In Virtual Mode. on PMMX: 8 and 13 Clocks Respectively.28.2 Floating Point Instructions

EXPLANATIONS: OPERANDS: R = Register, M = Memory, M32 = 32 Bit Memory Operand, ETC.

CLOCK CYCLES: The Numbers Are Minimum Values. Cache Misses, Misalignment, DenorMal Operands, And Exceptions May Increase The Clock Count consireRably.

Pairability: = Pairable with fxch, np = not pairable with fxch.

I-OV: OVERLAP with INTEGER INSTRUCTIONS. I-OV = 4 Means That The Last Four Clock Cycles CAN overlap with subsequent Integer Instructions.

FP-OV: Overlap with floating point instructions. fp-ov = 2 Means That The Last Two Clock Cycles CAN overlap with subsequent floating point instructioneds (Wait IS Considered A Floating Point Instruction "

Instruction Operand Clock cycles Pairability i-ov fp-ov FLDr / m32 / m641 00FLDm803np00FBLDm8048-58np00FST (P) r1np00FST (P) m32 / m642 m) np00FST (P) m803 m) np00FBSTPm80148-154np00FILDm3np22FIST (P) m6np00FLDZ FLD1 2np00FLDPI FLDL2E etc . 5 s) NP22FNSTSWAX / M166 Q) NP00FLDCWM168NP00FNSTCWM162NP00FADD (P) R / M3 22FSUB (R) (P) R / M3 22Fmul (P) R / M3 22 N) FDIV (R) (P) R / M19 / 33/39 p) 38 o) 2FCHS Fabs 1 00Fcom (p) (p) fucomr / m1 00fiadd Fisub (R) M6NP22FIMULM6NP22FIDIV (R) M22 / 36/42 p) NP38 o) 2FICOMM4NP00ftst 1NP00FXAM 17-21NP40FPREM 16- 64np22FPREM1 20-70np22FRNDINT 9-20np00FSCALE 20-32np50FXTRACT 12-66np00FSQRT 70np69 o) 2FSIN FCOS 65-100 r) np22FSINCOS 89-112 r) np22F2XM1 53-59 r) np22FYL2X 103 r) np22FYL2XP1 105 r) np22FPTAN 120-147 r) np36 o) 0FPATAN 112-134 r) np22FNOP 1np00FXCHr1np00FINCSTP FDECSTP 2np00FFREEr2np00FNCLEX 6-9np00FNINIT 12-22np00FNSAVEm124-300np00FRSTORm70-95np00WAIT 1np00Notes: m) The value to store is needed one clock cycle in advance n) 1 if the overlapping instruction is also an FMUL.. o) Cannot Overlap Integ er multiplication instructions. p) FDIV takes 19, 33, or 39 clock cycles for 24, 53, and 64 bit precision respectively. FIDIV takes 3 clocks more. The precision is defined by bit 8-9 of the floating point control word. q ) The first 4 clock cycles can overlap with preceding integer instructions. See chapter 26.7. r) clock counts are typical. Trivial cases may be faster, extreme cases may be slower. s) may be up to 3 clocks more when output needed for FST , FCHS, or Fabs.

28.3 mmx instructions (pmmx)

A list of MMX instruction timings is not needed because they all take one clock cycle, except the MMX multiply instructions which take 3. MMX multiply instructions can be overlapped and pipelined to yield a throughput of one multiplication per clock cycle.The EMMS instruction takes only one clock cycle, but the first floating point instruction after an EMMS takes approximately 58 clocks extra, and the first MMX instruction after a floating point instruction takes approximately 38 clocks extra. There is no penalty for an MMX instruction after EMMS on the PMMX (but A Possible Small Penalty on the PII and PIII).

There is no penalty for using a memory operand in an MMX instruction because the MMX arithmetic unit is one step later in the pipeline than the load unit. But the penalty comes when you store data from an MMX register to memory or to a 32 bit register : The Data Have To Be Ready One Clock Cycle In Advance. This is analogous to the floating point store instructions.

All MMX INSTRUCTIONS Except Emms Are Pairable In Either Pipe. Pairing Rules for MMX Instructions Are Described in Chapter 10.

29. List of instruction Timings and Micro-Op Breakdown for PPRO, PII and PIII

Explanations: Operands: r = register, m = memory, i = immediate data, sr = segment register, m32 = 32 bit memory operand, etc.

Micro-Ops: The Number of Micro-Ops That The INSTRUCTION GENERATES for Each Execution Port. P0: Port 0: Alu, ETC. P1: Port 1: Alu, Jumps P01: Instructions That Can Go To Either Port 0 OR 1, Whichever IS Vacant First. P2: Port 2: Load Data, ETC. P3: Port 3: Address Generation For Store P4: Port 4: Store Data

Delay: This is the delay that the instruction generates in a dependency chain (This is not the same as the time spent in the execution unit Values may be inaccurate in situations where they can not be measured exactly, especially with memory operands.) The.. numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NANs and infinity increase the delays by 50-150 clocks, except in XMM move, shuffle and boolean instructions Floating point overflow, underflow, denormal or NAN results give a similar delay.Throughput:.. The maximum throughput for several instructions of the same kind for example, a throughput of 1/2 for FMUL means that a new FMUL instruction can Start Executing Every 2 Clock Cycles.

29.1 Integer instructionsInstructionOperandsmicro-opsdelaythroughput p0p1p01p2p3p4 NOP 1 MOVr, r / i 1 MOVr, m 1 MOVm, r / i 11 MOVr, sr 1 MOVm, sr 1 11 MOVsr, r8 5 MOVsr, m71 8 MOVSX MOVZXr, r 1 MOVSX MOVZXr, m 1 cmovcr, R1 1 Cmovccr, M1 11 XCHGR, R 3 XCHGR, M 4111High B) xLAT 11 Pushr / I 1 11 POPR 11 POP (e) SP 21 PUSHM 1111 POPM 5111 Pushsr 2 11 POPSR 81 PUSHF (D) 3 11 11 POPF (d) 10 61 Pusha (d) 2 88 POPA (D) 28 lahf sahf 1 Lear, M1 1 C) LDS LES LFS LGS LSSM 83 Add Sub and OR, R / I 1 Add Sub and ORR, M 11 Add Sub and ORM, R / I 1111 ADC SBBR, R / I 2 ADC SBBR, M 21 ADC SBBM, R / I 3111 CMP TESTR, R / I 1 CMP TESTM, R / I 11 INC DEC NEG NOTR 1 Inc Dec Dec NEG NOTM 1111 AAS DAA DAS 1 AAD 1 2 4 AAM 112 15 MUL Imulr, (R), (i) 1 41 / 1mul Imul (r), M1 1 41 / 1DIV IDIVR163 1 231 / 21DIV IDIVR323 1 391 / 37DIV IDIVM82 11 191 / 12DIV IDIVM162 11 231 / 21D IV IDIVM322 11 391 / 37CBW CWDE 1 CWD CDQ 1 Shr SHL Sar Rolr, I / CL1 Shr SHL Sar Rolm, I / CL1 111 RCR RCLR, 11 1 RCR RCLR8, I / CL4 4 RCR RCLR16 / 32, I / CL3 3 RCR RCLM, 11 2111 RCR RCLM8, I / CL4 3111 RCR RCLM16 / 32, I / CL4 2111 SHLD SHRDR, R, I / CL2 SHRDM, R, I / CL2 1111 BTR, R / I 1 BTM, R / I1 61 BTR BTS BTCR, R / I 1 BTR BTS BTCM, R / I1 6111 BSF BSRR, R 11 BSF BSRR,

M 111 setccr 1 setccm 1 11 JMPSHORT / NEAR 1 1 1 / 2JMPM (Near) 1 1 1 / 2JMPM (FAR) 212 Conditional Jumpshort / Near 1 1 / 2Callnear 11 11 1 / 2Callfar28122 Callr 12 11 1 / 2Callm (Near) 14111 1 / 2CallM (FAR) 28222 Retn 121 1/2 Retni 131 1/2 RETF 233 RETFI233 J (E) CXZSHORT 11 Loopshort218 Loop (N) Eshort218 Enteri, 0 12 11 ENTERA, BCA. 18 4B B-12B Leave 21 Boundr, M7 62 CLC STC CMC 1 CLD 4 CLI 9 STI 17 INTO 5 LODS 2 Rep Stos CA. 5N a) MOVS 1311 Rep MOVS CA. 6N A) SCAS 12 Rep (N) E SCAS 12 7N CMPS 42 rep (N) E CMPS 12 9N BSWAP 1 1 CPUID 23-48 RDTSC 31 in 18> 300 OUT 18> 300 Prefetchnta d) M 1 prefetcht1 d) m 1 prefetcht2 d) M 1 sfnce d) 1 1 1 / 6Notes: a) Faster Under Certain Conditions: See Chapter 26.3. b) See Chapter 26.1 C) 3 IF constant without base or index register d) PIII ONLY.

29.2 Floating point instructionsInstructionOperandsmicro-opsdelaythroughput p0p1p01p2p3p4 FLDr1 FLDm32 / 64 1 1 FLDm802 2 FBLDm8038 2 FST (P) r1 FST (P) m32 / m64 111 FSTPm802 22 FBSTPm80165 22 FXCHr 03/1 f) FILDm3 1 5 FIST (P) m2 115 FLDZ 1 FLD1 FLDPI FLDL2E ETC.2 FCMOVCCR2 2 FNSTSWAX3 7 FNSTSWM161 11 FLDCWM161 11 10 FNSTCWM161 11 FADD (P) FSUB (R) (P) R1 31 / 1FADD (P) FSUB (R) (P) M1 1 3-41 / 1Fmul (P) R1 51/2 g) Fmul (p) M1 1 5-61 / 2 g) FDIV (R) (P) R1 38 H) 1 / 37FDIV (R) (P) M1 1 38 h) 1 / 37FABS 1 FCHS 3 2 FCOM (P) fucomr1 1 FCOM (P) fucomm1 1 1 FCOMPFMPP 1 1 1 FCOMI (P) fucomi (P) R1 1 FCOMI (P) fucomi (P) M1 1 1 FIADD FISUB (R) M6 1 FIMULM6 1 FIDIV (R) M6 1 FICOM (P) M6 1 ftst 1 1 fxam 1 2 fprem 23 fprem1 33 frndint 30 fscale 56 Fxtract 15 FSQRT 1 69E, i) FSIN FCOS 17-97 27-103E) fsincos 18-110 29 -130E) F2XM1 17-48 66E) FYL2X 36-54 103E) FYL2XP1 31-53 98-1 07E) FPTAN 21-102 13-143E) Fpatan 25-86 44-143e) Fnop 1 FincStp fdecstp 1 ffreer1 ffreepr2 fnclex 3 Fninit 13 Fnsave 141 FRSTOR 72 WAIT 2

Notes: e) not pipelined f) FXCH generates 1 micro-op that is resolved by register renaming without going to any port g) FMUL uses the same circuitry as integer multiplication Therefore, the combined throughput of mixed floating point and integer multiplications is.. 1 FMUL 1 IMUL per 3 clock cycles h) FDIV delay depends on precision specified in control word:. precision 64 bits gives delay 38, precision 53 bits gives delay 32, precision 24 bits gives delay 18. Division by a power of 2 takes 9 clocks. Throughput is 1 / (delay-1). i) faster for lower precision.29.3 MMX instructions (PII and PIII) InstructionOperandsmicro-opsdelaythroughput p0p1p01p2p3p4 MOVD MOVQr, r 1 2 / 1MOVD MOVQr64, m32 / 64 1 1 / 1MOVD MOVQm32 / 64, R64 1 1 1 / 1PADD PSUB PCMPR64, R64

1 1 / 1Padd PSUB PCMPR64, M64 11 1 / 1pmul PMADDR64, R641 31 / 1Pmul Pmaddr64, M641 1 31 / 1Pand Pandn PXRR64, R64 1 2 / 1Pand Pandn Porpxorr64, M64 11 1 / 1PSRA PSRL PSLLR64, R64 / I 1 1 / 1PSRA PSRL PSLLR64, M64 1 1 1 1 / 1Pack PUNPCKR64, R64 1 1 1Pack PUNPCKR64, M64 1 1 1 / 1EMMS 11 6 K) MASKMOVQ D) R64, R64 1 1 12-81 / 30-1 / 2pmovMskb D) R32 , R64 1 1 1 / 1Movntq D) M64, R64 1 1 1/30-1 / 1PSHUFW D) R64, R64, I 1 1 1 1 1 PSHUFW D) R64, M64, I 1 1 2 1 / 1PEXTRW D) R32, R64, I 1 1 2 1 / 1Pisrw D) R64, R32, I 1 1 1 1 Pisrw D) R64, M16, I 1 1 2 1 / 1PAVGB PAVGW D) R64, R64 1 2 / 1PAVGB PAVGW D) R64, M64 1 2 1 / 1PminUB PMAXUB PMINSW PMAXSW D) R64, R64 1 1 2 / 1Pminub PMaxub Pminsw Pmaxsw D) R64, M64 1 1 2 1 / 1Pmulhuw D) R64, R64 1 3 1 / 1Pmulhuw D) R64, M64 1 1 4 1 / 1PSADBW D) R64, R64 2 1 5 1 / 2PSADBW D) R64, M64 2 1 1 6 1/2

Notes:. D) PIII only k) you may hide the delay by inserting other instructions between EMMS and any subsequent floating point instruction.29.4 XMM instructions (PIII) InstructionOperandsmicro-opsdelaythroughput p0 p1 p01 p2 p3 p4 MOVAPSr128, r128 2 11 / 1MOVAPSr128, m128 2 21 / 2MOVAPSm128, r128 2231 / 2MOVUPSr128, m128 4 21 / 4MOVUPSm128, r128 1 4431 / 4MOVSSr128, r128 1 11 / 1MOVSSr128, m32 11 11 / 1MOVSSm32, r128 1111 / 1MOVHPS MOVLPSr128, m64 1 11 / 1MOVHPS MOVLPSm64, r128 1111 / 1MOVLHPS MOVHLPSr128, r128 1 11 / 1MOVMSKPSr32, r1281 11 / 1MOVNTPSm128, r128 2 2 1 / 15-1 / 2CVTPI2PSr128, r64 2 31 / 1CVTPI2PSr128, m64 2 1 41 / 2CVTPS2PI CVTTPS2PIr64, r128 2 31 / 1CVTPS2PIr64, m128 1 2 41 / 1CVTSI2SSr128, r32 2 1 41 / 2CVTSI2SSr128, m32 2 2 51 / 2CVTSS2SI CVTTSS2SIr32, r128 1 1 31 / 1CVTSS2SIr32, m128 1 2 41 / 2ADDPS SUBPSr128, r128 2 31 / 2ADDPS SUBPSr128, m128 2 2 31 / 2ADDSS SUBSSr128, r128 1 31 / 1Addss Subssr128, M32 1 1 31 / 1mulpsr128, R1282 41 / 2mulpsr128, M1282 2 41 / 2mulssr128, R1281 41 / 1MULSSr128, m321 1 41 / 1DIVPSr128, r1282 481 / 34DIVPSr128, m1282 2 481 / 34DIVSSr128, r1281 181 / 17DIVSSr128, m321 1 181 / 17ANDPS ANDNPS ORPS XORPSr128, r128 2 21 / 2ANDPS ANDNPS ORPS XORPSr128, m128 2 2 21 / 2MAXPS Minpsr128, R128 2 31/2

MAXPS MINPSr128, m128 2 2 31 / 2MAXSS MINSSr128, r128 1 31 / 1MAXSS MINSSr128, m32 1 1 31 / 1CMPccPSr128, r128 2 31 / 2CMPccPSr128, m128 2 2 31 / 2CMPccSSr128, r128 1 31 / 1CMPccSSr128, m32 1 1 31 / 1COMISS UCOMISSr128, r128 1 11 / 1COMISS UCOMISSr128, m32 1 1 11 / 1SQRTPSr128, r1282 561 / 56SQRTPSr128, m1282 2 571 / 56SQRTSSr128, r1282 301 / 28SQRTSSr128, m322 1 311 / 28RSQRTPSr128, r1282 21 / 2RSQRTPSr128, m1282 2 31 / 2RSQRTSSr128, r1281 11 / 1RSQRTSSr128, m321 1 21 / 1RCPPSr128, r1282 21 / 2RCPPSr128, m1282 2 31 / 2RCPSSr128, r1281 11 / 1RCPSSr128, m321 1 21 / 1SHUFPSr128, r128, i 21 21 / 2SHUFPSr128, m128, i 2 2 21 / 2UNPCKHPS UNPCKLPSr128, R128 22 31 / 2Unpckhps unpcklpsr128, M128 2 2 31 / 2LDMXCSRM3211 151 / 15stmxcsrm326 71/9FXSAVEM4096116 62 FXRSTORM409689 6830. Testing speed

The Pentium family of processors have an internal 64 bit clock counter which can be read into EDX: EAX using the instruction RDTSC (read time stamp counter) This is very useful for testing exactly how many clock cycles a piece of code takes..

The program below is useful for measuring the number of clock cycles a piece of code takes. The program executes the code to test 10 times and stores the 10 clock counts. The program can be used in both 16 and 32 bit mode on the PPlain and PMMX:

************ TEST Program for PPLAIN and PMMX: *********************

ITer EQU 10; Number of Items

Overhead EQU 15; 15 for PPLAIN, 17 for PMMX

RDTSC Macro; Define RDTSC Instruction

DB 0FH, 31H

ENDM

; ************ DATA segment: ***********************

.Data; Data segment

ALIGN 4

Counter DD 0; Loop Counter

Tics DD 0; Temporary Storage of ClockResultList Dd Iter DUP (0); List of Test RESULTS

*********** Code segment: ********************

.Code; segment

Begin: MOV [counter], 0; reset loop counterter

Testloop:; test loop

************ Do any initializations here: ********************

Finit

*********** * End of initialization *********************

RDTSC; Read Clock Counter

MOV [Tics], EAX; Save Count

Non-Pairable Filler

Rept 8

NOP; Eight Nop's to Avoid Shadowing Effect

ENDM

*********** Put instructions to test he: ******************************

FLDPI; this is only an esample

FSQRT

RCR EBX, 10

FSTP ST

**************** * End of instructions to test *****************************

CLC; non-pairable filler with shadow

RDTSC; Read Counter Again

Sub Eax, [Tics]; Compute Difference

Sub Eax, Overhead; Subtract Clocks Used by Fillers ETC.

Mov Edx, [Counter]; Loop Counter

Mov [ResultList] [EDX], EAX; Store Result In Table

Add Edx, Type ResultList; Increment Counter

MOV [counter], edx; store counter

CMP EDX, ITER * (Type Resultlist)

JB Testloop; Repeat ITER Times

; Insert Here Code To Read Out The VALUES in Resultlist

The 'filler' instructions before and after the piece of code to test are are included in order to get consistent results on the PPlain. The CLD is a non-pairable instruction which has been inserted to make sure the pairing is the same the first time as the subsequent times. The eight NOP instructions are inserted to prevent any prefixes in the code to test to be decoded in the shadow of the preceding instructions on the PPlain. Single byte instructions are used here to obtain the same pairing the first time as the subsequent times. The CLC after the code to test is a non-pairable instruction which has a shadow under which the 0FH prefix of the RDTSC can be decoded so that it is independent of any shadowing effect from the code to test on the PPlain.On The PMMX you may want to insert XOR EAX, EAX / CPUID before the instructions to test if you want the FIFO instruction buffer to be empty, or some time-consuming instruction (f.ex. CLI or AAD) if you want the FIFO buffer To Be Full (CPUID H As No Shadow Under Which Prefixes of Subsequent Instructions CAN Decode.

On the PPro, PII and PIII you have to insert XOR EAX, EAX / CPUID before and after each RDTSC to prevent it from executing in parallel with anything else, and remove the filler instructions. (CPUID is a serializing instruction which means that it flushes The Pipeline and Waits for All Pending Operations To Finish Before Proceeding. This is useful for testing purposes.)

The RDTSC instruction can not execute in virtual mode on the PPlain and PMMX, so if you are running DOS programs you must run in real mode. (Press F8 while booting and select "safe mode command prompt only" or "bypass startup files").

The complete test program is available from www.agner.org/assem/.The Pentium processors have special performance monitor counters which can count events such as cache misses, misalignments, various stalls, etc. Details about how to use the performance monitor counters are Not Covered by this Manual But Can Be Found In "Intel Architecture Software Developer's Manual", Vol. 3, Appendix A.

31. Comparison of the Different Microprocessors

The Following Table Summarizes Some Important Differences Between The Microprocessors in The Pentium Family:

PPlain PMMX PPro PII PIII code cache, kb81681616data cache, kb81681616built in level 2 cache, kb00256512 *) 512 *) MMX instructionsnoyesnoyesyesXMM instructionsnonononoyesconditional move instructruct.nonoyesyesyesout of order executionnonoyesyesyesbranch predictionpoorgoodgoodgoodgoodbranch target buffer entries256256512512512return stack buffer size04161616branch misprediction penalty3-44-510-2010-2010 -20Partial Register Stall00555FMUL LATENCY33555FMUL THROUGHPUT1 / 21/21/21/21 / 2IMUL LATENCY99444IMUL THROUGHPUT1 / 91/91/11/11/1

*) Celeron: 0-128, Xeon: 512 or More, Many Other Variants Available. ON Some Versions The Level 2 Cache Runs At Half Speed.

Comments to the Table: Code Cache Size Is Important if The Critical Part of Your Program Is Not Limited to a Small Memory Space.

Data Cache Size Is Important for All Programs That Handle More Than Small Amounts of Data In The Critical Part.

MMX and XMM instructions are useful for programs that handle massively parallel data, such as sound and image processing. In other applications it may not be possible to take advantage of the MMX and XMM instructions.

Conditional move instructructions are useful for avoiding poorly predictable conditional jumps.Out of order execution improves performance, especially on non-optimized code. It includes automatic instruction reordering and register renaming.

Processors with a good branch prediction method....................... ..

A Return Stack Buffer Improves Prediction of Return Instructions When a Subroutine IS Called Alternatingly from Different Locations.

Partial Register Stalls Make Handling of Mixed Data Sizes (8, 16, 32 Bit) More Difficult.

The latency of a multiplication instruction is the time it takes in a dependency chain. A throughput of 1/2 means that the execution can be pipelined so that a new multiplication can begin every second clock cycle. This defines the speed for handling parallel data.

Most of the Optimizations Described in This Document Have Little or No Negative Effects On Other Microprocessors, include Non-Intel Processors, But There is.

Scheduling floating point code for the PPlain and PMMX often requires a lot of extra FXCH instructions. This will slow down execution on older microprocessors, but not on the Pentium family and advanced non-Intel processors.

转载请注明原文地址:https://www.9cbs.com/read-51833.html

9cbs

New Post(0)