Telling MMX optimization techniques for integer computing, however, the truly large fortune graphics and sound processing are mostly floating point operations, and now the requirements for floating point operations are getting higher and higher, in such a condition INTEL Finally, add SSE instructions for floating point operations to increase in Pentium III processing, so all programs for SSE instructions must be run on the Pentium III or Althon XP.
SSE newly defines eight new 128-bit registers XMM0-XMM7, which is 1 fold more than the 64 bits of MMX, each register can be loaded into 4 32-bit floating point numbers, because it is a new register, so less The MMX register and the switching work of the original float register, so there is a higher execution efficiency. It is worth noting that SSE can operate 16-bit and 8-bit integers, but this is not the mainstream of the SSE application.
Here casually mention an Intel Compiler 8.0, this compiler is indeed strong, personal feelings are about 10-20% faster than Visual C 6.0 SP6, it can be optimized for different CPUs, if you are a P4 series CPU In the compilation, add the parameter / fast / qxw / qip / qunroll40, will not think about the result, if you read a user manual, according to the method inside, change the program will have more improvements, to all Worship the ultimate optimized friend recommended this compiler. Less words, less, transfer to SSE topics again, and give a simple example:
Used in VC uses inline assembly float a [] = {1.0, 2.0, 3.0, 4.0}; float b [] = {5.0, 6.0, 7.0, 8.0}; _ asm {MOV ECX, A; MOV EDX, B; MOVAPS XMM0, [ECX]; MOVAPS XMM1, [EDX]; AddPS XMM0, XMM1; MOVAPS [ECX], XMM0;
Like MMX, you can use it without compilation. Use only one header file can be used directly in C. #include __m128 a = _mm_set_ps (1, 2, 3, 4) __m128 b = _mm_set_ps (5, 6, 7, 8) A = _mm_add_ps (a, b); this time I feel more convenient to use Intrinsics, because it has developed a lot of synthetic directives for a lot.
In the case of the following instructions of the SSE, the following, more fully, more fully, INTEL download instruction manual will be found. The following part is referenced to: http://dwbclz.myetang.com/articles/piii/sse-ins-ref.html
AddPS
Format: AddPS XMM1, XMM2 / M128
Function: Two sets of single-precision numbers plus
algorithm:
DEST [31-0] = DEST [31-0] SRC / M128 [31-0]; DEST [63-32] = DEST [63-32] SRC / M128 [63-32]; DEST [95- 64] = DEST [95-64] SRC / M128 [95-64]; DEST [127-96] = DEST [127-96] SRC / M128 [127-96];
AddSS
Format: AddSS XMM1, XMM2 / M32
Function: Low single precision number is added
algorithm:
DEST [31-0] = DEST [31-0] SRC / M32 [31-0]; DEST [63-32] = DEST [63-32]; DEST [95-64] = DEST [95-64] ; DEST [127-96] = DEST [127-96]; andNPS
Format: Andnps XMM1, XMM2 / M128
Function: XMM1 "Reverse" and XMM2 / M128 "and" operation
algorithm:
DEST [127-0] = NOT (DEST [127-0]) and SRC / M128 [127-0];
Andps
Format: Andps XMM1, XMM2 / M128
Function: Logic "and" operations for two registers
algorithm:
DEST [127-0] and = SRC / M128 [127-0];
CMPPS
Format: CMPPS XMM1, XMM2 / M128, IMM8
Function: Compare the value of two registers, using different comparative methods according to different values of IMM8
IMM8 == 0, ==; IMM8 == 1, <; IMM8 == 2, <=; IMM8 == 3,; IMM8 == 4,! =; IMM8 == 5,! <; imm8 == 6, ! <=; IMM8 == 7,!;
algorithm:
IF (IMM8 = 0) THEN OP = "EQ"; Elseif (IMM8 = 1) THEN OP = "LT"; Elseif (IMM8 = 2) THEN OP = "Le"; elseif (IMM8 = 3) THEN OP = "Unord "; Elseif (IMM8 = 4) THEN OP =" ne "; Elseif (IMM8 = 5) THEN OP =" nlt "; Elseif (IMM8 = 6) THEN OP =" NLSEIF (IMM8 = 7) THEN OP = "ORD"; Fi
CMP0 = DEST [31-0] OP SRC / M128 [31-0]; CMP1 = DEST [63-32] OP SRC / M128 [63-32]; CMP2 = DEST [95-64] OP SRC / M128 [95 -64]; CMP3 = DEST [127-96] OP SRC / M128 [127-96];
IF (cmp0 = true) THEN DEST [31-0] = 0xfffffff; Else DEST [31-0] = 0x00000000; FIIF (CMP1 = true) THEN DEST [63-32] = 0xfffffff; Else Dest [63-32] = 0x00000000; FIIF (CMP2 = true) THEN DEST [95-64] = 0xfffffff; Else DEST [95-64] = 0x00000000; FIIF (CMP3 = true) Then Dest [127-96] = 0xfffffff; Else Dest [127-96 ] = 0x00000000; Fi
Others: You can use the following readability good instructions
Instructions implement CMPEQPS xmm1, xmm2; CMPPS xmm1, xmm2, 0CMPLTPS xmm1, xmm2; CMPPS xmm1, xmm2, 1CMPLEPS xmm1, xmm2; CMPPS xmm1, xmm2, 2CMPUNORDPS xmm1, xmm2; CMPPS xmm1, xmm2, 3CMPNEQPS xmm1, xmm2; CMPPS xmm1, XMM2, 4CMPNLTPS XMM1, XMM2; CMPPS XMM1, XMM2, 5cmpnleps XMM1, XMM2; CMPPS XMM1, XMM2, 6CMPORDPS XMM1, XMM2; CMPPS XMM1, XMM2, 7CMPSS
Format: cmpss XMM1, XMM2 / M32, IMM8
Function: Comparison of low single precision
Algorithm: The algorithm is similar to CMPPS, but it is only for DEST [31-0].
You can also use readability better instructions.
Instructions implement CMPEQSS xmm1, xmm2 CMPSS xmm1, xmm2, 0CMPLTSS xmm1, xmm2 CMPSS xmm1, xmm2, 1CMPLESS xmm1, xmm2 CMPSS xmm1, xmm2, 2CMPUNORDSS xmm1, xmm2 CMPSS xmm1, xmm2, 3CMPNEQSS xmm1, xmm2 CMPSS xmm1, xmm2, 4CMPNLTSS xmm1, XMM2 CMPSS XMM1, XMM2, 5cmpnless XMM1, XMM2 CMPSS XMM1, XMM2, 6CMPORDSS XMM1, XMM2 CMPSS XMM1, XMM2, 7
Comiss
Format: COMISS XMM1, XMM2 / M32
Function: Compare the low position and set the identification bit
algorithm:
Of = 0; sf = 0; AF = 0; IF (DEST [31-0] URD SRC / M32 [31-0]) = true) THEN ZF = 1; PF = 1; Cf = 1; Elseif DEST [31-0] GTRTHAN SRC / M32 [31-0]) = true) THEN ZF = 0; PF = 0; CF = 0; Elseif ((DEST [31-0] Lessthan SRC / M32 [31-0] ) = True kil = 0; pf = 0; cf = 1; ELSE ZF = 1; pf = 0; cf = 0; Fi
CVTPI2PS
Format: CVTPI2PS XMM, MM / M64
Function: 32-bit integer transition to floating point
algorithm:
DEST [31-0] = (FLOAT) (SRC / M64 [31-0]); DEST [63-32] = (float) (SRC / M64 [63-32]); DEST [95-64] = DEST [95-64]; DEST [127-96] = DEST [127-96]; CVTPS2PI
Format: CVTPS2PI MM, XMM / M64
Function: The low two floating point numbers are transformed into integers
algorithm:
DEST [31-0] = (int) (SRC / M64 [31-0]); DEST [63-32] = (int) (SRC / M64 [63-32]);
CVTSI2SS
Format: CVTSI2SS XMM, R / M32
Function: 32-bit integer transitions to floating point numbers, deposit low
algorithm:
DEST [31-0] = (float) (R / M32); DEST [63-32] = DEST [63-32]; DEST [95-64] = DEST [95-64]; DEST [127-96] = DEST [127-96]; CVTSS2Si
Format: CVTSS2SI R32, XMM / M32
Function: The low floating point number is transformed into 32-bit integers
algorithm:
R32 = (int) (SRC / M32 [31-0]);
CVTTPS2PI
Format: CVTTPS2PI mm, XMM / M64
Function: The two floating point numbers are transformed into integers, and
algorithm:
DEST [31-0] = (int) (SRC / M64 [31-0]); DEST [63-32] = (int) (SRC / M64 [63-32]);
Cvttss2si
Format: CVTTSS2SI R32, XMM / M32
Function: Convert the lowest bit floating point number to an integer and is scheduled.
algorithm:
R32 = (int) (SRC / M32 [31-0]);
Divps
Format: DIVPS XMM1, XMM2 / M128
Function: Single precision number division operation
algorithm:
DEST [31-0] / (SRC / M128 [31-0]); DEST [63-32] = DEST [63-32] / (SRC / M128 [63-32]); DEST [95-64] / (SRC / M128 [95-64]); DEST [127-96] = DEST [127-96] / (SRC / M128 [127-96]);
Divss
Format: Divss XMM1, XMM2 / M32
Function: Low single precision
algorithm:
DEST [31-0] / (SRC / M32 [31-0]); DEST [63-32] = DEST [63-32]; DEST [95-64] = DEST [95- 64]; DEST [127-96] = DEST [127-96];
EMMS
Format: EMMS
Function: Turn the floating point logo
algorithm:
Fputagword <- ffff
Fxrstor
FXRSTOR M512BYTE
Function: Fab, MMX, and SSE status from M512byte
algorithm:
FP AND mmx state and streaming simd extension state = m512byte;
Fxsave
Format: FXSave M512byte
Function: Saves FP, MMX, and SSE status to M512byte
algorithm:
M512byte = fp and mmx state and streaming simd extension state;
LDMXCSR
Format: LDMXCSR M32
Function: Status control word loaded with SSE
algorithm:
MXCSR = M32;
Maxps
Format: Maxps XMM1, XMM2 / M128
Function: Return maximum
algorithm:
IF (DEST [31-0] = SRC [31-0]; Elseif (src [31-0] = nan) Then Dest [31-0] = SRC [31-0 ]; Elseif (DEST [31-0]> SRC / M128 [31-0]) THEN DEST [31-0] = DEST [31-0]; Else DEST [31-0] = SRC / M128 [31-0 ]; FIIF (DEST [63-32] = nan) THEN DEST [63-32] = SRC [63-32]; Elseif (src [63-32] = nan) THEN DEST [63-32] = SRC [63 -32]; Elseif (DEST [63-32]> SRC / M128 [63-32]) THEN DEST [63-32]; Else DEST [63-32] = SRC / M128 [63 -32]; FIIF (DEST [95-64] = nan) THEN DEST [95-64] = SRC [95-64]; Elseif (src [95-64] = nan) THEN DEST [95-64] = SRC [95-64]; Elseif (Dest [95-64]) THEN DEST [95-64] = DEST [95-64]; Else Dest [95-64] = SRC / M128 [95-64]; FIIF (DEST [127-96] = nan) THEN DEST [127-96] = SRC [127-96]; Elseif (SRC [127-96] = nan) Then Dest [127-96] = SRC [127-96]; Elseif (DEST [127-96]> SRC / M128 [127-96]) THEN DEST [127-96] = DEST [127-96]; Else Dest [127-96] = SRC / M128 [127-96]; FIMAXSS
Format: maxss XMM1, XMM2 / M32
Function: Return the maximum low position
Algorithm: Similar to the above, the difference is to operate only for DEST [31-0]
Minps
Format: MINPS XMM1, XMM2 / M128
Function: Return minimum
Algorithm:
Minss
Format: Minss XMM1, XMM2 / M32
Function: Return the minimum low position
Algorithm:
Movaps
Format: MOVAPS XMM1, XMM2 / M128 or MOVAPS XMM2 / M128, XMM1
Function: Aligned Data Transmission Directive
algorithm:
IF (Destination = DEST) THEN (* LOAD INSTRUCTION *) DEST [127-0] = M128; ELSE (* Move Instruction *) DEST [127 = 0] = SRC [127-0]; Fi; Else IF (Destination = M128) THEN (* STORE INSTRUCTION *) M128 = src [127-0]; Else (* Move Instruction *) DEST [127-0] = SRC [127-0]; Fi; Fi; Movhlps
Format: MOVHLPS XMM1, XMM2
Function: two numbers of highlights pass to low
algorithm:
DEST [127-64]; DEST [63-0] = SRC [127-64]; MOVHPS
Format: MOVHPS XMM, M64 or MOVHPS M64, XMM
Function: High Data Transfer Directive
algorithm:
IF (Destination = DEST) DEST [127-64] = M64; DEST [31-0] = DEST [31-0]; DEST [63-32] = DEST [63-32]; Else (* store instruction *) M64 = src [127-64]; fi; MOVLPS
Format: MOVLPS XMM, M64 or MOVLPS M64, XMM
Function: Low data transfer instruction
algorithm:
IF (Destination = DEST) DEST [63-0] = M64; DEST [95-64] = DEST [95-64]; DEST [127-96] = DEST [127-96]; Else (* store instruction *) m64 = DEST [63-0]; FI MOVLHPS
Format: MOVLHPS XMM1, XMM2
Function: Two numbers of low positions
algorithm:
DEST [127-64] = SRC [63-0]; DEST [63-0] = DEST [63-0];
MovMSKPS
Format: MOVMSKPS R32, XMM
Function: Mask moves into 32-bit registers
algorithm:
R32 [0] = SRC [31]; R32 [1] = SRC [63]; R32 [2] = SRC [95]; R32 [3] = SRC [127]; R32 [7-4] = 0x0; R32 [15-8] = 0x00; R32 [31-16] = 0x0000;
Movntps
Format: Movntps M128, XMM
Function: Put the data directly into memory, reduce the pressure on the cache
algorithm:
M128 = SRC;
Movss
Format: MovsS XMM1, XMM2 / M32 or MOVSS XMM2 / M32, XMM1
Function: Transmission instructions for lowest bit data
algorithm:
IF (Destination = DEST) THEN (* LOAD INSTRUCTION *) DEST [31-0] = M32; DEST [63-32] = 0x00000000; DEST [95-64] = 0x00000000; DEST [ 127-96] = 0x00000000; Else (* Move Instruction *) DEST [31-0] = SRC [31-0]; DEST [63-32] = DEST [63-32]; DEST [95-64] = DEST [95-64]; DEST [127-96]; Fielse IF (Destination = M32) THEN (* STORE INSTRUCTION *) M32 = SRC [31-0]; Else (* Move Instruction *) DEST [31-0] = SRC [31-0] DEST [63-32]; DEST [63-64] = DEST [95-64]; DEST [127-96] = DEST [ 127-96]; FIFIMOVUPS
Format: MOVUPS XMM1, XMM2 / M128 or MOVUPS XMM2 / M128, XMM1
Function: Transmission instructions for non-alignment data
algorithm:
IF (Destination = XMM) THEN IF (* LOAD INSTRUCTION *) DEST [127-0] = M128; ELSE (* Move Instruction *) DEST [127-0] = SRC [127-0]; Fielse if (destination = m128) THEN (* STORE INSTRUCTION *) M128 = src [127-0]; ELSE (* Move Instruction *) DEST [127-0] = SRC [127-0]; FIFI
Mulps
Format: Mulps XMM1, XMM2 / M128
Function: single-precision number
algorithm:
DEST [31-0] = DEST [31-0] * SRC / M128 [31-0]; DEST [63-32] = DEST [63-32] * SRC / M128 [63-32]; DEST [95- 64] = DEST [95-64] * SRC / M128 [95-64]; DEST [127-96] = DEST [127-96] * SRC / M128 [127-96];
Mulss
Format: Mulss XMM1, XMM2 / M32
Function: the lowest single single precision
algorithm:
DEST [31-0] = DEST [31-0] * SRC / M32 [31-0]; DEST [63-32] = DEST [63-32]; DEST [95-64] = DEST [95-64] ; DEST [127-96] = DEST [127-96];
ORPS
Format: ORPS XMM1, XMM2 / M128
Function: ask or calculate
algorithm:
DEST [127-0] | = SRC / M128 [127-0]; RCPPS
Format: RCPS XMM1, XMM2 / M128
Function: Improvement of the approximate value
algorithm:
DEST [31-0] = Approx (1.0 / (SRC / M128 [31-0])); DEST [63-32] = Approx (1.0 / (SRC / M128 [63-32])); DEST [95- 64] = approx (1.0 / (SRC / M128 [95-64)))); DEST [127-96] = Approx (1.0 / (SRC / M128 [127-96])); RCPSS
Format: RCPSS XMM1, XMM2 / M32
Function: Seeking the approximate value of the lowest position
algorithm:
DEST [31-0] = Approx (1.0 / (SRC / M32 [31-0]))))); DEST [63-32] = DEST [63-32]; DEST [95-64] = DEST [95-64] ; DEST [127-96] = DEST [127-96];
RSQRTPS
Format: RSQRTPS XMM1, XMM2 / M128
Function: Improves the approximation of square roots
algorithm:
DEST [31-0] = Approx (1.0 / SQRT (SRC / M128 [31-0])); DEST [63-32] = Approx (1.0 / SQRT (SRC / M128 [63-32])); DEST [ 95-64] = Approx (1.0 / SQRT (SRC / M128 [95-64)); DEST [127-96] = Approx (1.0 / SQRT (SRC / M128 [127-96]));
RSQRTSS
Format: RSQRTSS XMM1, XMM2 / M32
Function: Approximate the least value of the lowest countdown square root
algorithm:
DEST [31-0] = Approx (1.0 / SQRT (SRC / M32 [31-0])); DEST [63-32] = DEST [63-32]; DEST [95-64] = DEST [95-64 ]; DEST [127-96] = DEST [127-96];
Shufps
Format: Shufps XMM1, XMM2 / M128, IMM8
Function: chaos
algorithm:
FP_SELECT = (IMM8 >> 0) and 0x3; if (fp_select = 0) THEN DEST [31-0] = DEST [31-0]; Elseif (fp_select = 1) Then Dest [31-0] = DEST [63- 32]; elseif (fp_select = 2) Then Dest [31-0] = DEST [95-64]; Else Dest [31-0] = DEST [127-96]; FI
FP_SELECT = (IMM8 >> 2) and 0x3; if (fp_select = 0) Then DEST [63-32] = DEST [31-0]; Elseif (fp_select = 1) Then Dest [63-32] = DEST [63- 32]; Elseif (fp_select = 2) Then DEST [63-32] = DEST [95-64]; Else Dest [63-32] = DEST [127-96]; FI
FP_SELECT = (IMM8 >> 4) and 0x3; if (fp_select = 0) Then DEST [95-64] = SRC / M128 [31-0]; Elseif (fp_select = 1) Then Dest [95-64] = SRC / M128 [63-32]; elseif (fp_select = 2) Then Dest [95-64] = SRC / M128 [95-64]; Else Dest [95-64] = src / m128 [127-96]; FIFP_SELECT = ( IMM8 >> 6) and 0x3; if (fp_select = 0) Then DEST [127-96] = SRC / M128 [31-0]; Elseif (fp_select = 1) Then Dest [127-96] = src / m128 [63 -32]; Elseif (fp_select = 2) THEN DEST [127-96] = SRC / M128 [95-64]; Else Dest [127-96] = SRC / M128 [127-96]; FI
SQRTPS
Format: SQRTPS XMM1, XMM2 / M128
Function: Square root
algorithm:
DEST [31-0] = SQRT (SRC / M128 [31-0]); DEST [63-32] = SQRT (SRC / M128 [63-32]); DEST [95-64] = SQRT (SRC / M128 [95-64]); DEST [127-96] = SQRT (SRC / M128 [127-96]);
SQRTSS
Format: SQRTSS XMM1, XMM2 / M32
Function: Minimum digits are square root
algorithm:
DEST [31-0] = SQRT (SRC / M32 [31-0]); DEST [63-32] = DEST [63-32]; DEST [95-64] = DEST [95-64]; DEST [127 -96] = DEST [127-96];
Stmxcsr
Format: STMXCSR M32
Function: Store SSE Control Words
algorithm:
M32 = mxcsr;
Subps
Format: SUBPS XMM1, XMM2 / M128
Function: Single Jandage Solutions
algorithm:
DEST [31-0] - SRC / M128 [31-0]; DEST [63-32] = DEST [63-32] - SRC / M128 [63-32]; DEST [95- 64] = DEST [95-64] - SRC / M128 [95-64]; DEST [127-96] = DEST [127-96] - SRC / M128 [127-96];
Subs
Format: SUBSS XMM1, XMM2 / M32
Function: minimum number of digits
algorithm:
DEST [31-0] - SRC / M32 [31-0]; DEST [63-32] = DEST [63-32]; DEST [95-64] = DEST [95-64] ; DEST [127-96] = DEST [127-96];
Ucomiss
Format: Ucomiss XMM1, XMM2 / M32
Function: Compare the low position and set the flag
algorithm:
Of = 0; sf = 0; AF = 0; IF (DEST [31-0] URD SRC / M32 [31-0]) = true) THEN ZF = 1; PF = 1; Cf = 1; Elseif DEST [31-0] GTRTHAN SRC / M32 [31-0]) = true) THEN ZF = 0; PF = 0; CF = 0; Elseif ((DEST [31-0] Lessthan SRC / M32 [31-0] ) = True kil = 0; pf = 0; cf = 1; else zf = 1; pf = 0; cf = 0; FIUNPCKHPS
Format: UnpckHPS XMM1, XMM2 / M128
Function: high two number alternate transmission
algorithm:
DEST [31-0] = DEST [95-64]; DEST [63-32] = SRC / M128 [95-64]; DEST [95-64] = DEST [127-96]; DEST [127-96] = SRC / M128 [127-96];
Unpcklps
Format: Unpcklps XMM1, XMM2 / M128
Function: Low two number alternate transmission
algorithm:
DEST [31-0]; DEST [31-0]; DEST [63-32] = SRC / M128 [31-0]; DEST [95-64] = DEST [63-32]; DEST [127-96] = SRC / M128 [63-32];
Xorps
Format: xorps XMM1, XMM2 / M128
Function: different or calculation
algorithm:
DEST [127-0] = DEST / M128 [127-0] XOR SRC / M128 [127-0]