MMX and SSE Optimization - SSE

xiaoxiao2021-04-11  618

Telling MMX optimization techniques for integer computing, however, the truly large fortune graphics and sound processing are mostly floating point operations, and now the requirements for floating point operations are getting higher and higher, in such a condition INTEL Finally, add SSE instructions for floating point operations to increase in Pentium III processing, so all programs for SSE instructions must be run on the Pentium III or Althon XP.

SSE newly defines eight new 128-bit registers XMM0-XMM7, which is 1 fold more than the 64 bits of MMX, each register can be loaded into 4 32-bit floating point numbers, because it is a new register, so less The MMX register and the switching work of the original float register, so there is a higher execution efficiency. It is worth noting that SSE can operate 16-bit and 8-bit integers, but this is not the mainstream of the SSE application.

Here casually mention an Intel Compiler 8.0, this compiler is indeed strong, personal feelings are about 10-20% faster than Visual C 6.0 SP6, it can be optimized for different CPUs, if you are a P4 series CPU In the compilation, add the parameter / fast / qxw / qip / qunroll40, will not think about the result, if you read a user manual, according to the method inside, change the program will have more improvements, to all Worship the ultimate optimized friend recommended this compiler. Less words, less, transfer to SSE topics again, and give a simple example:

Used in VC uses inline assembly float a [] = {1.0, 2.0, 3.0, 4.0}; float b [] = {5.0, 6.0, 7.0, 8.0}; _ asm {MOV ECX, A; MOV EDX, B; MOVAPS XMM0, [ECX]; MOVAPS XMM1, [EDX]; AddPS XMM0, XMM1; MOVAPS [ECX], XMM0;

Like MMX, you can use it without compilation. Use only one header file can be used directly in C. #include __m128 a = _mm_set_ps (1, 2, 3, 4) __m128 b = _mm_set_ps (5, 6, 7, 8) A = _mm_add_ps (a, b); this time I feel more convenient to use Intrinsics, because it has developed a lot of synthetic directives for a lot.

In the case of the following instructions of the SSE, the following, more fully, more fully, INTEL download instruction manual will be found. The following part is referenced to:


Format: AddPS XMM1, XMM2 / M128

Function: Two sets of single-precision numbers plus


DEST [31-0] = DEST [31-0] SRC / M128 [31-0]; DEST [63-32] = DEST [63-32] SRC / M128 [63-32]; DEST [95- 64] = DEST [95-64] SRC / M128 [95-64]; DEST [127-96] = DEST [127-96] SRC / M128 [127-96];


Format: AddSS XMM1, XMM2 / M32

Function: Low single precision number is added


DEST [31-0] = DEST [31-0] SRC / M32 [31-0]; DEST [63-32] = DEST [63-32]; DEST [95-64] = DEST [95-64] ; DEST [127-96] = DEST [127-96]; andNPS

Format: Andnps XMM1, XMM2 / M128

Function: XMM1 "Reverse" and XMM2 / M128 "and" operation


DEST [127-0] = NOT (DEST [127-0]) and SRC / M128 [127-0];


Format: Andps XMM1, XMM2 / M128

Function: Logic "and" operations for two registers


DEST [127-0] and = SRC / M128 [127-0];


Format: CMPPS XMM1, XMM2 / M128, IMM8

Function: Compare the value of two registers, using different comparative methods according to different values ​​of IMM8

IMM8 == 0, ==; IMM8 == 1, <; IMM8 == 2, <=; IMM8 == 3,; IMM8 == 4,! =; IMM8 == 5,! <; imm8 == 6, ! <=; IMM8 == 7,!;


IF (IMM8 = 0) THEN OP = "EQ"; Elseif (IMM8 = 1) THEN OP = "LT"; Elseif (IMM8 = 2) THEN OP = "Le"; elseif (IMM8 = 3) THEN OP = "Unord "; Elseif (IMM8 = 4) THEN OP =" ne "; Elseif (IMM8 = 5) THEN OP =" nlt "; Elseif (IMM8 = 6) THEN OP =" NLSEIF (IMM8 = 7) THEN OP = "ORD"; Fi

CMP0 = DEST [31-0] OP SRC / M128 [31-0]; CMP1 = DEST [63-32] OP SRC / M128 [63-32]; CMP2 = DEST [95-64] OP SRC / M128 [95 -64]; CMP3 = DEST [127-96] OP SRC / M128 [127-96];

IF (cmp0 = true) THEN DEST [31-0] = 0xfffffff; Else DEST [31-0] = 0x00000000; FIIF (CMP1 = true) THEN DEST [63-32] = 0xfffffff; Else Dest [63-32] = 0x00000000; FIIF (CMP2 = true) THEN DEST [95-64] = 0xfffffff; Else DEST [95-64] = 0x00000000; FIIF (CMP3 = true) Then Dest [127-96] = 0xfffffff; Else Dest [127-96 ] = 0x00000000; Fi

Others: You can use the following readability good instructions

Instructions implement CMPEQPS xmm1, xmm2; CMPPS xmm1, xmm2, 0CMPLTPS xmm1, xmm2; CMPPS xmm1, xmm2, 1CMPLEPS xmm1, xmm2; CMPPS xmm1, xmm2, 2CMPUNORDPS xmm1, xmm2; CMPPS xmm1, xmm2, 3CMPNEQPS xmm1, xmm2; CMPPS xmm1, XMM2, 4CMPNLTPS XMM1, XMM2; CMPPS XMM1, XMM2, 5cmpnleps XMM1, XMM2; CMPPS XMM1, XMM2, 6CMPORDPS XMM1, XMM2; CMPPS XMM1, XMM2, 7CMPSS

Format: cmpss XMM1, XMM2 / M32, IMM8

Function: Comparison of low single precision

Algorithm: The algorithm is similar to CMPPS, but it is only for DEST [31-0].

You can also use readability better instructions.

Instructions implement CMPEQSS xmm1, xmm2 CMPSS xmm1, xmm2, 0CMPLTSS xmm1, xmm2 CMPSS xmm1, xmm2, 1CMPLESS xmm1, xmm2 CMPSS xmm1, xmm2, 2CMPUNORDSS xmm1, xmm2 CMPSS xmm1, xmm2, 3CMPNEQSS xmm1, xmm2 CMPSS xmm1, xmm2, 4CMPNLTSS xmm1, XMM2 CMPSS XMM1, XMM2, 5cmpnless XMM1, XMM2 CMPSS XMM1, XMM2, 6CMPORDSS XMM1, XMM2 CMPSS XMM1, XMM2, 7


Format: COMISS XMM1, XMM2 / M32

Function: Compare the low position and set the identification bit


Of = 0; sf = 0; AF = 0; IF (DEST [31-0] URD SRC / M32 [31-0]) = true) THEN ZF = 1; PF = 1; Cf = 1; Elseif DEST [31-0] GTRTHAN SRC / M32 [31-0]) = true) THEN ZF = 0; PF = 0; CF = 0; Elseif ((DEST [31-0] Lessthan SRC / M32 [31-0] ) = True kil = 0; pf = 0; cf = 1; ELSE ZF = 1; pf = 0; cf = 0; Fi


Format: CVTPI2PS XMM, MM / M64

Function: 32-bit integer transition to floating point


DEST [31-0] = (FLOAT) (SRC / M64 [31-0]); DEST [63-32] = (float) (SRC / M64 [63-32]); DEST [95-64] = DEST [95-64]; DEST [127-96] = DEST [127-96]; CVTPS2PI

Format: CVTPS2PI MM, XMM / M64

Function: The low two floating point numbers are transformed into integers


DEST [31-0] = (int) (SRC / M64 [31-0]); DEST [63-32] = (int) (SRC / M64 [63-32]);


Format: CVTSI2SS XMM, R / M32

Function: 32-bit integer transitions to floating point numbers, deposit low


DEST [31-0] = (float) (R / M32); DEST [63-32] = DEST [63-32]; DEST [95-64] = DEST [95-64]; DEST [127-96] = DEST [127-96]; CVTSS2Si

Format: CVTSS2SI R32, XMM / M32

Function: The low floating point number is transformed into 32-bit integers


R32 = (int) (SRC / M32 [31-0]);


Format: CVTTPS2PI mm, XMM / M64

Function: The two floating point numbers are transformed into integers, and


DEST [31-0] = (int) (SRC / M64 [31-0]); DEST [63-32] = (int) (SRC / M64 [63-32]);


Format: CVTTSS2SI R32, XMM / M32

Function: Convert the lowest bit floating point number to an integer and is scheduled.


R32 = (int) (SRC / M32 [31-0]);


Format: DIVPS XMM1, XMM2 / M128

Function: Single precision number division operation


DEST [31-0] / (SRC / M128 [31-0]); DEST [63-32] = DEST [63-32] / (SRC / M128 [63-32]); DEST [95-64] / (SRC / M128 [95-64]); DEST [127-96] = DEST [127-96] / (SRC / M128 [127-96]);


Format: Divss XMM1, XMM2 / M32

Function: Low single precision


DEST [31-0] / (SRC / M32 [31-0]); DEST [63-32] = DEST [63-32]; DEST [95-64] = DEST [95- 64]; DEST [127-96] = DEST [127-96];


Format: EMMS

Function: Turn the floating point logo


Fputagword <- ffff



Function: Fab, MMX, and SSE status from M512byte


FP AND mmx state and streaming simd extension state = m512byte;


Format: FXSave M512byte

Function: Saves FP, MMX, and SSE status to M512byte


M512byte = fp and mmx state and streaming simd extension state;


Format: LDMXCSR M32

Function: Status control word loaded with SSE


MXCSR = M32;


Format: Maxps XMM1, XMM2 / M128

Function: Return maximum


IF (DEST [31-0] = SRC [31-0]; Elseif (src [31-0] = nan) Then Dest [31-0] = SRC [31-0 ]; Elseif (DEST [31-0]> SRC / M128 [31-0]) THEN DEST [31-0] = DEST [31-0]; Else DEST [31-0] = SRC / M128 [31-0 ]; FIIF (DEST [63-32] = nan) THEN DEST [63-32] = SRC [63-32]; Elseif (src [63-32] = nan) THEN DEST [63-32] = SRC [63 -32]; Elseif (DEST [63-32]> SRC / M128 [63-32]) THEN DEST [63-32]; Else DEST [63-32] = SRC / M128 [63 -32]; FIIF (DEST [95-64] = nan) THEN DEST [95-64] = SRC [95-64]; Elseif (src [95-64] = nan) THEN DEST [95-64] = SRC [95-64]; Elseif (Dest [95-64]) THEN DEST [95-64] = DEST [95-64]; Else Dest [95-64] = SRC / M128 [95-64]; FIIF (DEST [127-96] = nan) THEN DEST [127-96] = SRC [127-96]; Elseif (SRC [127-96] = nan) Then Dest [127-96] = SRC [127-96]; Elseif (DEST [127-96]> SRC / M128 [127-96]) THEN DEST [127-96] = DEST [127-96]; Else Dest [127-96] = SRC / M128 [127-96]; FIMAXSS

Format: maxss XMM1, XMM2 / M32

Function: Return the maximum low position

Algorithm: Similar to the above, the difference is to operate only for DEST [31-0]


Format: MINPS XMM1, XMM2 / M128

Function: Return minimum



Format: Minss XMM1, XMM2 / M32

Function: Return the minimum low position



Format: MOVAPS XMM1, XMM2 / M128 or MOVAPS XMM2 / M128, XMM1

Function: Aligned Data Transmission Directive


IF (Destination = DEST) THEN (* LOAD INSTRUCTION *) DEST [127-0] = M128; ELSE (* Move Instruction *) DEST [127 = 0] = SRC [127-0]; Fi; Else IF (Destination = M128) THEN (* STORE INSTRUCTION *) M128 = src [127-0]; Else (* Move Instruction *) DEST [127-0] = SRC [127-0]; Fi; Fi; Movhlps


Function: two numbers of highlights pass to low


DEST [127-64]; DEST [63-0] = SRC [127-64]; MOVHPS

Format: MOVHPS XMM, M64 or MOVHPS M64, XMM

Function: High Data Transfer Directive


IF (Destination = DEST) DEST [127-64] = M64; DEST [31-0] = DEST [31-0]; DEST [63-32] = DEST [63-32]; Else (* store instruction *) M64 = src [127-64]; fi; MOVLPS

Format: MOVLPS XMM, M64 or MOVLPS M64, XMM

Function: Low data transfer instruction


IF (Destination = DEST) DEST [63-0] = M64; DEST [95-64] = DEST [95-64]; DEST [127-96] = DEST [127-96]; Else (* store instruction *) m64 = ​​DEST [63-0]; FI MOVLHPS


Function: Two numbers of low positions


DEST [127-64] = SRC [63-0]; DEST [63-0] = DEST [63-0];



Function: Mask moves into 32-bit registers


R32 [0] = SRC [31]; R32 [1] = SRC [63]; R32 [2] = SRC [95]; R32 [3] = SRC [127]; R32 [7-4] = 0x0; R32 [15-8] = 0x00; R32 [31-16] = 0x0000;


Format: Movntps M128, XMM

Function: Put the data directly into memory, reduce the pressure on the cache


M128 = SRC;


Format: MovsS XMM1, XMM2 / M32 or MOVSS XMM2 / M32, XMM1

Function: Transmission instructions for lowest bit data


IF (Destination = DEST) THEN (* LOAD INSTRUCTION *) DEST [31-0] = M32; DEST [63-32] = 0x00000000; DEST [95-64] = 0x00000000; DEST [ 127-96] = 0x00000000; Else (* Move Instruction *) DEST [31-0] = SRC [31-0]; DEST [63-32] = DEST [63-32]; DEST [95-64] = DEST [95-64]; DEST [127-96]; Fielse IF (Destination = M32) THEN (* STORE INSTRUCTION *) M32 = SRC [31-0]; Else (* Move Instruction *) DEST [31-0] = SRC [31-0] DEST [63-32]; DEST [63-64] = DEST [95-64]; DEST [127-96] = DEST [ 127-96]; FIFIMOVUPS

Format: MOVUPS XMM1, XMM2 / M128 or MOVUPS XMM2 / M128, XMM1

Function: Transmission instructions for non-alignment data


IF (Destination = XMM) THEN IF (* LOAD INSTRUCTION *) DEST [127-0] = M128; ELSE (* Move Instruction *) DEST [127-0] = SRC [127-0]; Fielse if (destination = m128) THEN (* STORE INSTRUCTION *) M128 = src [127-0]; ELSE (* Move Instruction *) DEST [127-0] = SRC [127-0]; FIFI


Format: Mulps XMM1, XMM2 / M128

Function: single-precision number


DEST [31-0] = DEST [31-0] * SRC / M128 [31-0]; DEST [63-32] = DEST [63-32] * SRC / M128 [63-32]; DEST [95- 64] = DEST [95-64] * SRC / M128 [95-64]; DEST [127-96] = DEST [127-96] * SRC / M128 [127-96];


Format: Mulss XMM1, XMM2 / M32

Function: the lowest single single precision


DEST [31-0] = DEST [31-0] * SRC / M32 [31-0]; DEST [63-32] = DEST [63-32]; DEST [95-64] = DEST [95-64] ; DEST [127-96] = DEST [127-96];


Format: ORPS XMM1, XMM2 / M128

Function: ask or calculate


DEST [127-0] | = SRC / M128 [127-0]; RCPPS

Format: RCPS XMM1, XMM2 / M128

Function: Improvement of the approximate value


DEST [31-0] = Approx (1.0 / (SRC / M128 [31-0])); DEST [63-32] = Approx (1.0 / (SRC / M128 [63-32])); DEST [95- 64] = approx (1.0 / (SRC / M128 [95-64)))); DEST [127-96] = Approx (1.0 / (SRC / M128 [127-96])); RCPSS

Format: RCPSS XMM1, XMM2 / M32

Function: Seeking the approximate value of the lowest position


DEST [31-0] = Approx (1.0 / (SRC / M32 [31-0]))))); DEST [63-32] = DEST [63-32]; DEST [95-64] = DEST [95-64] ; DEST [127-96] = DEST [127-96];


Format: RSQRTPS XMM1, XMM2 / M128

Function: Improves the approximation of square roots


DEST [31-0] = Approx (1.0 / SQRT (SRC / M128 [31-0])); DEST [63-32] = Approx (1.0 / SQRT (SRC / M128 [63-32])); DEST [ 95-64] = Approx (1.0 / SQRT (SRC / M128 [95-64)); DEST [127-96] = Approx (1.0 / SQRT (SRC / M128 [127-96]));


Format: RSQRTSS XMM1, XMM2 / M32

Function: Approximate the least value of the lowest countdown square root


DEST [31-0] = Approx (1.0 / SQRT (SRC / M32 [31-0])); DEST [63-32] = DEST [63-32]; DEST [95-64] = DEST [95-64 ]; DEST [127-96] = DEST [127-96];


Format: Shufps XMM1, XMM2 / M128, IMM8

Function: chaos


FP_SELECT = (IMM8 >> 0) and 0x3; if (fp_select = 0) THEN DEST [31-0] = DEST [31-0]; Elseif (fp_select = 1) Then Dest [31-0] = DEST [63- 32]; elseif (fp_select = 2) Then Dest [31-0] = DEST [95-64]; Else Dest [31-0] = DEST [127-96]; FI

FP_SELECT = (IMM8 >> 2) and 0x3; if (fp_select = 0) Then DEST [63-32] = DEST [31-0]; Elseif (fp_select = 1) Then Dest [63-32] = DEST [63- 32]; Elseif (fp_select = 2) Then DEST [63-32] = DEST [95-64]; Else Dest [63-32] = DEST [127-96]; FI

FP_SELECT = (IMM8 >> 4) and 0x3; if (fp_select = 0) Then DEST [95-64] = SRC / M128 [31-0]; Elseif (fp_select = 1) Then Dest [95-64] = SRC / M128 [63-32]; elseif (fp_select = 2) Then Dest [95-64] = SRC / M128 [95-64]; Else Dest [95-64] = src / m128 [127-96]; FIFP_SELECT = ( IMM8 >> 6) and 0x3; if (fp_select = 0) Then DEST [127-96] = SRC / M128 [31-0]; Elseif (fp_select = 1) Then Dest [127-96] = src / m128 [63 -32]; Elseif (fp_select = 2) THEN DEST [127-96] = SRC / M128 [95-64]; Else Dest [127-96] = SRC / M128 [127-96]; FI


Format: SQRTPS XMM1, XMM2 / M128

Function: Square root


DEST [31-0] = SQRT (SRC / M128 [31-0]); DEST [63-32] = SQRT (SRC / M128 [63-32]); DEST [95-64] = SQRT (SRC / M128 [95-64]); DEST [127-96] = SQRT (SRC / M128 [127-96]);


Format: SQRTSS XMM1, XMM2 / M32

Function: Minimum digits are square root


DEST [31-0] = SQRT (SRC / M32 [31-0]); DEST [63-32] = DEST [63-32]; DEST [95-64] = DEST [95-64]; DEST [127 -96] = DEST [127-96];


Format: STMXCSR M32

Function: Store SSE Control Words


M32 = mxcsr;


Format: SUBPS XMM1, XMM2 / M128

Function: Single Jandage Solutions


DEST [31-0] - SRC / M128 [31-0]; DEST [63-32] = DEST [63-32] - SRC / M128 [63-32]; DEST [95- 64] = DEST [95-64] - SRC / M128 [95-64]; DEST [127-96] = DEST [127-96] - SRC / M128 [127-96];


Format: SUBSS XMM1, XMM2 / M32

Function: minimum number of digits


DEST [31-0] - SRC / M32 [31-0]; DEST [63-32] = DEST [63-32]; DEST [95-64] = DEST [95-64] ; DEST [127-96] = DEST [127-96];


Format: Ucomiss XMM1, XMM2 / M32

Function: Compare the low position and set the flag


Of = 0; sf = 0; AF = 0; IF (DEST [31-0] URD SRC / M32 [31-0]) = true) THEN ZF = 1; PF = 1; Cf = 1; Elseif DEST [31-0] GTRTHAN SRC / M32 [31-0]) = true) THEN ZF = 0; PF = 0; CF = 0; Elseif ((DEST [31-0] Lessthan SRC / M32 [31-0] ) = True kil = 0; pf = 0; cf = 1; else zf = 1; pf = 0; cf = 0; FIUNPCKHPS

Format: UnpckHPS XMM1, XMM2 / M128

Function: high two number alternate transmission


DEST [31-0] = DEST [95-64]; DEST [63-32] = SRC / M128 [95-64]; DEST [95-64] = DEST [127-96]; DEST [127-96] = SRC / M128 [127-96];


Format: Unpcklps XMM1, XMM2 / M128

Function: Low two number alternate transmission


DEST [31-0]; DEST [31-0]; DEST [63-32] = SRC / M128 [31-0]; DEST [95-64] = DEST [63-32]; DEST [127-96] = SRC / M128 [63-32];


Format: xorps XMM1, XMM2 / M128

Function: different or calculation


DEST [127-0] = DEST / M128 [127-0] XOR SRC / M128 [127-0]


New Post(0)