Recently, I started to learn the assembler, used to optimize my engine, intermittently have been intermittent for more than 20 days, and have some experience, I hope to share with him :) The graphic operation is to eat system resources tiger, a large number of CPUs The clock was eaten by him. In order to accelerate my game engine graphics processing speed, I made compilation of some graphic operations and used power amazing MMX instructions. Why do you say powerful? I will understand it. The MMX instruction is suitable for a large number of string operations, and 8 64-bit registers are provided in the MMX processor, and the MMX instructions can use the operation of the data set. So, the graphic processing used to handle the one-point operation is most appropriate. Below, I use my alpha mixed code to explain that MMX's powerful processing power :) I used to write a 16-bit Alpha action article on the homepage, at that time, I will not compile, write with C The example program, when the alpha is mixed in a 912x720 area, the FPS value is only 3-4, which is really no practical value. Now, I hold MMX this sword. After some efforts, the FPS value has soared, reached 14, 4 times the whole! If under 640x480, the full screen's alpha can reach more than 20 frames! Why is there such a big performance? I started me said that the MMX register is 64-bit, which is MM0 - MM7 a total of eight. The MMX instruction allows the operation between the MMX registers in accordance with BYTE, WORD, DWORD, QD packets. This is the key to the problem! Let's take a look, our 16-bit alpha mix is a point occupied 2 Byte, in the middle of a MMX register, we can accommodate 4 such points, so a mixed calculation will be able to mix four points, this is 4 Reason for multiplier improvement! ! ! Ok, let's take a look at the actual code: where WDDest, WDRES is the purpose and source graphic data pointer, __ depth makes the ALPHA mixed depth, the value is 0x0001000100010001 * nDepth, NMMXCount is how many MMX operations I use to record, NOT_MMX_POINT It is the remainder of the copy rectangle width of 4, because once the width is not 4-byte alignment, the remaining excess points must be processed according to ordinary assembly code, here, my value is just left, not to implement, because register Not enough, you need PUSH and POP. __Mask64 is a mask of __int64 type, the value is 0x0001000100010001 * (Short) m_ncolorKey, __ mask is used to filter excess colors 0x001F001F001F001F (because the graph is 555 format). Every time you pay attention to each time you use MOVQ to move 4 points to the MMX register, then use the Word integration packet. Here, after using MMX, the original transparent color can not be processed, and now you must handle, in order to let the transparent part do not affect the target, I use a little skill, and the specific practice looks at my code comment. This is what I think of :) To see the complete code, download my complete japplib.