MMX development documentation
I mmx introduction
Intel's MMXTM technology is an extension of Intel Architecture (IA) instruction set. This technology uses a single instruction Multi-Data Technology (SIMD) technology to handle multiple data elements in parallel, thereby increasing the running speed of multimedia and communication software. The MMXTM instruction set adds 57 new opcodes and a new 64-bit four-word data type.
MMXTM technology has improved many applications, such as moving images, video conferencing, two-dimensional graphics, and 3D graphics. Almost every application with repetitive and sequential integer calculations can benefit from MMXTM technology. For the processing of 8-bit, 16-bit and 32-bit data elements, the performance of the program is improved. A MMXTM directive operates 8 bytes at a time, and two instructions are completed within a clock cycle, that is, 16 data elements can be processed within a clock cycle. In addition, in order to enhance performance, MMXTM technology releases additional processor cycles for other features. Applications that previously needed other hardware support now, now only software can run. The smaller processor usage provides a condition for higher degree of concurrent techniques, and these concurrent technologies are utilized in many operating systems today. In Intel-based analysis systems, some features have increased by 50% to 400%. This order-level performance extension can be reflected in the new generation processor. In the software kernel, its speed is increased, and its amplitude is three to five times the original speed.
MMX Disadvantages: Since the MMX's arithmetic instruction must be used in the data pair, it is necessary to use the MMX instruction to excence many group paired instructions than ordinary assembly. If the operation is not particularly neat, it is a lot of time. In the pairing of data, it is said that MMX instructions are not universal, and there is a big defect. At the same time, the MMX instruction will play the greatest role when processing 16-bit data, and there is a little skill to process 8 digits. The 32-bit data is handled, and there is little acceleration capability of the MMX instruction. (Considering the time when it is consumed)
II MMX Basic Instruction Set
For details, please refer to the "Intel Architecture MMX Technical Programmer Reference Manual" Chapter 5
2.1 copy instructions
MOVQ: 64-bit data copy, if the memory is 8-bit, it is a 64-bit write, otherwise 2 32-bit write.
MOVD: 32-bit data copy, Note: If you copy from memory to the MMX register, MMX high 32-bit clear!
2.2 Packet Directive
The grouping command is unique to MMX, so we have to pay special for it. The grouping instructions can basically be divided into 2 classes, and one class is not a symbol, one type is a symbolic. Now we have introduced:
1PunpckLBW / PUNPCKLWD / PUNPCKLDQ (L represents low packet, BW8 bit, WD16 bit, DQ32 bit): It is a simple combination of two MMX registers to a 64-bit data. So it cannot convert long data to short data.
2PackUSWB converts 16-bit data into unsigned 8-bit data. So you can combine two MMX registers to synthesize a 64-bit data.
3Packsswb / PackssDW will be 32-bit - "16-bit, 16-bit -" 8 digits, all of which are symbolic data.
2.3 Operation Directive
Adding operation instruction: PADDB (W) (D): There is no additional protected addition. When the cross-border is discarded, the exceeding range of high bits, (b) (w) is 8, 16, 32-bit addition; PADDSB (W): There is a symbolic addition of crossed profound protection. When the upperflow is 0x7FFF, it is 0x8000; paddusb (w): no symbolic addition of crossed protected, when overflow is 0x7FFF, underflow It is 0x0 when it is. The subtraction operation is the same; add to SUB.
Multiplication: Pmullw / Pmulhw is a multiplication of 4 16-bit data, and the pmullw is 16 bits of results, and Pmulhw is 16 bits of results. PMADDWD Press Directive.
2.4 Logic instructions, shift instructions, and EMMS instructions
See "Intel Architecture MMX Technical Programmer Reference Manual".
III MMX Classic Treatment Strategy
1 Data input output:
When entering data, the classic processing method is to put an array "LOAD" into the MMX register. This simultaneously utilizes the ability of MMX64-bit read and write data and improves performance. Also in the output, it is also the entire "Store" in the 64-bit MMX register.
If you really can't process it, you should use the shift command. For example, 4 16-bit data within a MMX is copied to different memory variables (or 16-bit universal registers) X1, X2, X3, X4, then it can be processed:
MOVD EAX, MM1
PSRLQ MM1, 32
MOVD EBX, MM1
MOV X1, AX
MOV X2, BX
SHR EAX, 16
SHR EBX, 16
MOV X3, AX
MOV X4, BX
It can be seen that if an array form is not used, the input and output will be very troublesome.
2 data packets and methods for absolute values, etc.:
For details, please refer to the "Intel Architecture MMXTM Technology Developer Manual" Chapter 5
IV custom combination instruction
1 Eight-bit unsigned number of shifts:
In the MMX instruction set is a shift instruction without 8-bit data, but sometimes we do need, so you can use the following two instructions:
PSRLQ MM0, 1
PAND MM0, 0X7F7F7F7F7F7F7F7F
2 How to prevent the calculation process from the crossing:
For example, when it is calculated, we have (x1 x2 1) >> 1, this time x1 x2 will cross the world (8 digits), then we have to use alternatives, such as (x1 >> 1 x2 >> 1) this processing is Inifidence, it is possible to use without very precise, but if the result error 1 is not tolerated, it will be handled:
PAND MM0, 0X010101010101 / / The last digit of the data is retained
PAND MM1, 0x010101010101 / / The last digit of the data is retained
POR MM0, MM1
PaddUSB MMX, MM0 // Correction Data
(X1 >> 2 x2 >> 2): This treatment is universal
Pand mm0, 0x03030303030303 // Reserved the last two digits of data
Pand mm1, 0x03030303030303 // Reserved the last two digits of data
Paddusb MM0, MM1
PSRLQ MM0, 2
PAND MM0, 0X3F3F3F3F3F3F3F3F
Paddusb mmx, mm0
3 Symbol expansion instructions:
MM0: *, *, a, b => Now the symbol extension is mm0: (A symbol) A, (A symbol) B
MOVQ MM1, MM0
PCGTM MM1, 0 // Compare MM0, generate mm1: (A symbol) (B symbol) () ()
PUNPCKLWD MM0, MM1
4 packet instruction
In addition to basic packet instructions, we can use shift instructions and PAND POR instructions to achieve the functionality of packets, and the shift is mainly generated, so that POR MM0, MM1 can be merged with MM0 and MM1.
For example: MM0 (*, *, A, b) mm1 (0, 0, c, d)
PSLLQ MM0, 32
POR MM0, MM1 => (A, B, C, D) Of course, this example can be implemented with a normal packet instruction, but in some complex processing, such processing is necessary.
In short, you should flexibly use the MMX's existing directive to implement yourself.
V MMX programming experience
Using MMX technology, the purpose is to improve the speed of operation, so we have to pay special attention to how to increase the efficiency of code as possible. Here, I will introduce some things that need attention.
1 Improve the capacity of memory access as much as possible, we can take a look at the code below:
For (j = 0; j { D [0] = s [0]; D [1] = s [1]; D [2] = s [2]; D [3] = s [3]; D [4] = s [4]; D [5] = s [5]; D [6] = s [6]; D [7] = s [7]; D [8] = s [8]; D [9] = s [9]; D [10] = s [10]; D [11] = s [11]; D [12] = s [12]; D [13] = s [13]; D [14] = s [14]; D [15] = s [15]; S = LX2; D = LX; } __ASM { Pushf Mov EDX, DWORD PTR H XOR ECX, ECX Mov ESI, DWORD PTR S MOV EDI, DWORD PTR D MOV EAX, LX2 MOV EBX, LX Again: MOVQ MM0, Byte Ptr [ESI] MOVQ MM1, BYTE PTR [ESI 8] MOVQ byte PTR [EDI], MM0 MOVQ Byte Ptr [EDI 8], MM1 Add ESI, ESI Add Edi, EBX Add ECX, 1 CMP ECX, EDX Jl Again EMMS POPF } Just just write a few 8th, changed to 64-bit write, the test is raised by 25%, the same reason, we must write several MOVQ as much as possible, which increases 5%. The efficiency of the original C code is also very high, it does not use [] [] to address the array, but to change S = LX2; thus change the addressing of the two-dimensional array to a one-dimensional array address. This is also an efficient way as possible to reduce the complexity of addressing. Another point is that if the original simple assignment is changed to Memcpy (), it can increase the speed of approximately 10%. This also improves the relationship between data flow capacity. 2 Places that some should pay attention to: 1. Use static variables as much as possible, access such variables is very fast = accesses the speed of immediate 2. Since only one MMX shift register, (shift packet instruction) is not pairing 3. Don't use the EAX, use AX, don't use one MM1, use it immediately 4. MM1, MOVQ MM2, MM1 MOVQ MM1, MM3 (Z) can be used immediately (the z order is possible) 5. (4 or more) MOVQ as much as possible, the premise is the same MOV Do not use the same MMX register 6. MOV EAX, [ESI] ([ESI 2 * EAX]) Access addressing is particularly slow 7. The same top is also very slow (MOV CX, N; loop is very slow, if possible, to expand the loop) 8. Complete the operation in the register as much as possible, don't access memory 9. Access variables with variable names, especially static, is very fast. 10. The speed drop of the memory of the access address "Data is not aligned with 8 bits" instructions are not paired with the speed drop 11. So in the traditional code optimization method, the array is constructed, and then the operation becomes the method of challenge the table, and sometimes it will reduce the speed in the MMX technology. (This time, if you really use the surfacked, it is recommended to use the segment address offset approach)