Since March this year, after the use of Pentium 200mmx CPU, it has been considering how to use MMX technology to speed up Alpha mixed operation, especially for currently commonly used high-color modes. The result of a Mailist discussion in the game programmed in foreign countries is MMX is not conducive to the 16-bit color Alpha hybrid. Let's take a look at the update of MMX technology relative to the ordinary instruction set to learn about this argument.
The advantage of MMX technology is that its register is 64-bit, and provides packet mode, you can use data within the register, or 4 words, or 2 double words simultaneously, which is convenient for large Data processing data processing; can be set in comparison operations simultaneously, which is the benefit of the massage of the transparent color point; the CPU of the MMX has 8 MMX registers to a certain extent, the number of 80x86 CPU registers is lacking. Defect.
However, it also has many shortcomings, such as the arithmetic instructions that cannot be operated; the instruction structure does not affect the flag bit; the constant is not immediately addressed; the instruction of the MMX system instruction set is quite poor (even NOT operations can not be directly implemented);
When the color depth is 24/32 bits, RGB accounts for 8 bits, so that the packet multiplication instructions in MMX achieve the effect of Alpha hybrid operation (MMX multiplication instructions only have two Pmulhw / Pmullw The multiplication of grouping data is taken in the multiplication of the grouping data.) This article is intended to explore the 16Bit colored fast ALPHA mixing operations, so this is not allowed.
And 16bit colors, red green, each account for 5 or 6, which is difficult to be separated, so it is not conducive to these characteristics of MMX. Of course, additional solution is the structure of Argb 4444, where 4 is Alpha channel, each pigment Half bytes, use similar methods.
I have seen the friend of the 16bit Alpha hybrid optimization algorithm last year, and I will think of this algorithm to the MMX to be attracted by MMX. Maybe you have understood that this article is here, the only problem is, we It is necessary to face a variety of defects in the MMX instruction set, which will gradually reflect in the actual programming, below, the cloud wind will introduce the algorithm, and it comes with some MMX skills (then will have a special article. MMX programming technology)
Let's take a look at the possible possible optimization of the last algorithm:
The key to the ALPHA mixture under 16bit is how to separate the RGB, allowing subsequent multiplication results to interfere with each other.
I proposed that the 16bit rrrrrggggggbbbbbbbbb is extended to 32bit variations to 00000GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGBBBBB, which will be 16 bits high, which makes the color between 5 to 6, and more than 5 Alpha levels There is no meaning, so as long as the Alpha value is 0 ~ 31, the multiplication of these three pigments will not be interfered because of the carry. And you need to operate multiple operations once a shift extension 16-bit to 32 bits, then need One and operation, the intermediate interval position is 0, and the results require the same complex inverse operation from 32 sites to 16 bits.
Improved idea is to directly separate two points staggered, i.e. separated into rrrrr000000bbbbb00000GGGGGG00000 rrrrrggggggbbbbbRRRRRGGGGGGBBBBB and 00000gggggg00000RRRRR000000BBBBB two parts, the first part 5 into the right 00000rrrrr000000bbbbb00000GGGGGG, two numbers are arithmetic on three dye simultaneously, after which a set of results When the right movement is 5 bits, you can merge with the previous group. This saves several shift operations, and the data can be read in 4 bytes, and the four-byte written, the thick look is very efficient. But in the traditional There are two points on 80x86 to make it:
The register of the CPU is not enough. This method is required to save the data. It takes 4 32-bit registers. Although Eax, EBX, ECX, EDX just enough, but this makes the alpha mix function cannot be written directly in the blit operation. Must be written A child program is called. (But don't you write, isn't there? If there is a friend, I hope I can read it, I have left the interface in the Feng soul game library, and mention the specific function in the comment. Writing) 2D games, generally use Alpha hybrid drawings instead of the rules of rectangular bitmap, there is also a transparent color judgment, if it is a double point treatment, this is not easy to implement. (But not good The method is that the length of the code is growing. :- () and MMX provide 8 registers, and there are points of group comparison, just make up for this two points, and use registers with 64-bit advantages can be simultaneous Operation 4 points. So we temporarily use MMX to implement new ideas. (If you are interested in this method, you want to operate 2 points on the traditional instruction set, you want to operate 2 points for Alpha mix, and write the actual code, please Contact me, I very much hope that the non-MMX Alpha hybrid version of the soul can be further optimized) use MMX to do this work, the principle is almost (fairly simple?), But also after reading the source and target point separation into 4 The data is placed in four registers. Two pairs of Alpha mixed, (such a pair of data, 6 pigments), finally merged with the results of the two pairs of data mixed. But from now starting we will face MMX The 8 registers are not enough :- (MMX instructions can not be used together with the 64-bit constant constant, so the mask used when splitting operations should be in the register. If the register is more, it can be connected The anti-value of the mask is also put, but it is unfortunate now that you can't be wasted now :- (Processing transparency, you can get a mask to get the point and transparent color, we will mix the point after mixing, and the original target map Point (this point should keep a backup, oh, I have retained a register) The final data write target map is obtained with the mask logic operation. Here, there is a large number of NOT operations, Intel has not been concentrated in the MMX instruction. Offer @ # $% ^ &! We have to complete it with Pandn (reflect and action). (Example: You can use PCMPEQW MM0, MM0 (yourself and yourself, and you can more equivalent ;-) Generate constant ffffffffffffffffffffffffffffffffffffffffff, with Pandn MM1, MM0 can be reversed by mm1. Here, it is no longer possible to utilize the MMX packet multiplication, (MMX cannot multiply 32-bit numbers), so we should be implemented with shift and add-off method. This way, if there is a few alpha values, you should write a few Mixed function. Finally, a function pointer array is established. Place each Alpha hybrid function into an array. We can call the corresponding function as needed when calling :-) In the wind soul 0.07, alpha mix Once a modified algorithm (the above algorithm used by 0.06, 0.07 is not) here to thank the new ideas for netizens T & P (Tapu@371.net). Amissible Alpha mixed, such as level 8, can be used in simpler Method. You can notice that 50% of Alpha, R = (R1 R2) / 2, can also be approximately equal to R1 / 2 R2 / 2. So RGB can be easily operated. Just need after shift Do a simple and action (0RrrrrgggGGGBBBB & 0111101111111111 = 0 rrrr0ggg0bbbb) Then, the two shift data is added to complete the mixing of Alpha = 50%. This method avoids split and restore data, so the speed is more Fast. Early versions of the soul, do this special treatment for 50% of Alpha. However, it is error, and the error is the deviation of 1/32 or 1/64 of each color caused by shift.
Next we can extend 50% of Alpha values to 25% 12.5% or even smaller. Now take a look at R1 * 25% R2 * 75%, it is equal to R2 R1 * 25% -R2 * 25% = R2 R1 / 4 R2 / 4. This is the same in addition to 4 operations and 2 principles: (rrrrrrgggggbbbb >> 2) & 0011100111100111. Push, X * 37.5% Y * 62.5% = (x y) / 2 y / 8 - x / 8, etc. We only need to use shift and decrease, you can complete the mixing of n pigments at the same time. Take a look at the shortcomings of this method. First of all, the error problem, each The group shift will result in an error of the maximum 1/32, and multiple operations may make the error, so the alpha level cannot be divided much. And the Alpha level is too fine, so that there are many operations. The advantages of unpidarable direct operations may be lost. And more deadly is, if you want to use MMX acceleration, then the mask usually used by the AND operation should be placed in the register (if you put in memory, and MMX cannot be addressed immediately Indirect addressing takes memory may not hit the Cache speed slower, large-scale hybrid computing speed loss too much) MMX register is only 8. So multiple masks will make obvious feels not enough, but this is not one A good way. The new Alpha wizard in the wind soul 0.07, this step algorithm changes bring about 10%, while the loss of image quality is almost no embodying :-)
Finally, it is discussed on the bitmap with Alpha channel. Every point will have different alpha values, and we should reasonably coordinate the structure of bitmaps. It is not costly to put Alpha values and color information together. This is not conducive to high speed processing. We can put all the alpha values of all points together, for 16bit colors, reasonable Alpha levels should be below Level Level 16. This allows two alpha values for each byte. Use a register as a pointer to the Alpha value area, read into the Alpha value of the corresponding point, call the corresponding mixed function operation. However, each point of this bitmap may be different alpha values, so you can't operate more simultaneously. Yunfeng has found additional acceleration method, knowing details, and seeing the decomposition ^ _ ^