Optimization of floating point instructions

zhaozj2021-02-16 67

Now the compiler can optimize the floating point command, but I still want you to recommend VC, I think the optimization of VC is better, it can better utilize the pipeline line of the Pentium series processor.

· Optimization

· Try to understand your compiler to process the principle of floating point instructions, you know that you can't write a program to write a program with floating point instructions, more code is still based on advanced languages.

· Find the key to the program, such as the code, etc., which is a place that truly affects efficiency.

· Separate related code.

· Note to resolve the demand for memory bandwidth.

· Check if there is a long delayed command frequently, minus.

· If it is not necessary, try to use lower precision. In many instructions, the low accuracy will be higher. This also saves memory.

· Let your results fluctuate in the range of accuracy, exceeding the accuracy range will bring great overhead.

· Good use of FXCH instructions to optimize pipeline.

• Expand the loop when necessary, and rearrange the order of the instructions.

· Change the mode of data access, try to enable the data to be accessed in the cache.

• Improve the Pentium II and Pentium III have only one floating point unit that is flowing wire, and the reasonable computing process can improve the parallelism. In this way, you must know which instructions are routing, how much is the delay of some common instructions! Please see the following statement:

A = B C D;

E = f g h;

The simplest algorithm that uses floating point instructions can be implemented:

FLD B

Fadd C

Fadd D

FSTP a

FLD F

Fadd g

Fadd h

FSTP E

Almost every directive in the above algorithm relies on the calculation result of the previous instruction, which will continue the pipeline time. Let's take a look at the following algorithm:

FLD B

Fadd C

FLD F

Fadd g

FXCH ST (1)

Fadd D

FLD F

Faddp St (1), ST

FSTP a

FSTP E

The above code uses FXCH instructions, which is an instruction in floating point optimization. The delay of FADD is three clock cycles, and the above instruction sequences almost avoids each delay.

• FXCH instructions On the Pentium II and Pentium III processors, the FXCH instruction does not consume the consumption of the clock cycle. You can use it to access the elements of the stack depth, which makes the floating point stack more flexible.

· The loop deployment has the following benefits:

· Reduce the frequency of jumps, so that the cost of jump is not so obvious.

· Can be unified to make full use of idle registers to increase the speed of operation.

· Better arrangements, reduce its relevance, have a larger space to optimize the design of the pipeline, making it easier to arrange instructions, make it adapt to decoding and prefetch requirements.

The loop deployment is not only a few instructions, if you want to get high performance, you need to re-design the algorithm, use more resources as much as possible.

• The latency of floating-point commands Many floating point instructions have more than one clock cycle, but because Pentium II and Pentium III have the ability to perform, these delays are not necessarily obvious. However, if a command has a long latency, we must focus on it. The delay of some instructions is discussed below, and the method thereof is solved.

• The stored operation of the floating point stored a floating point command must pay an additional clock cycle to wait for its operands. After FLD, FST must wait for a clock cycle; icon fmul, fadd, usually there is a delay of three clock cycles, and the FST operations behind them must wait for an additional clock cycle, that is, it will endure four Delay of the clock cycle. Please see the example below: · Store is dependent on the prepvious load.

· FLD MEML; 1 FLD Takes 1 Clock

·; 2 fst Waits, Schedule Something Here

· FST MEM2; 3, 4 FST Takes 2 Clocks

· Fadd Meml; 1 Add Takes 3 Clocks

·; 2 Add, Schedule Something Here

·; 3 Add, Schedule Something Here

· 4 fst Waits, Schedule Something Here

· FST MEM2; 5, 2 FST Takes 2 Clocks

· "Store is Not Dependent on the prepvious loading:

· FLD MEML; 1

· FLD MEM2; 2

· FXCH ST (L); 2

· FST MEM3; 3 Store Values Loaded from Meml

· A Register May Be Used Immediately After IT HAS

Been loading (with FLD):

· FLD MEM1; L

· Fadd MEM2; 2, 3, 4

• Calculate some common instructions, such as FADD, FSUB, FMUL, etc., there are three clock cycles, if you want to use its calculations, at least two instructions are inserted behind it. For instructions such as FDIV, FSQRT, such as fdiv, FSQRT, such as FDIV, FSQRT, such as FDIV, FSQRT, will be considering inserting an integer instruction behind them. It should also be considered to minimize the use of such instructions. It seems that floating point divisions are extremely time-consuming, and AMD even uses iterative methods to calculate division and square roots.

• Multiplication operations of integers and floating point multiplication integers such as: MUL, IMUL, they are executed in floating point units, so they cannot calculate parallel with floating point instructions. Although the floating point multiplying throughput is a clock cycle, Fmul can be executed every two clock cycles, if you write two Fmul, you will get a clock cycle. Delay. Note that if the form of Fmul / FxCh / Fmul is written, the effect is the same.

转载请注明原文地址:https://www.9cbs.com/read-28205.html

9cbs

New Post(0)