32-bit code optimization common sense
There are too many articles about code optimization. Unfortunately, most of me didn't look, even though they were in my bedside (whenever I want to see it, I can't help it ... 嘿嘿) .. .
The meaning of code optimization:
The optimization of code optimization is of course small and fast, but in general, both are like fish and bear's paws, we usually find these two folds, what should be biased, then It is necessary to see our actual needs.
But some common sense is that we should keep in mind, let's talk about the specific situation we are most often in place:
1. Register clear 0 I absolutely don't want to see the following: 1) MOV EAX, 00000000h; 5 Bytes seems to be in line with logic, but you should realize more optimized Writings: 2) Sub EAX, Eax; 2 Bytes
3) XOR Eax, Eax; 2 Bytes Look at the number of bytes behind you should understand why you have to make this, in addition to this, there is no loss on the speed, they are as fast, but do you like XOR or SUB? I prefer XOR because I am very simple, because I have mathematics .... But Microsoft likes Sub .... We know that Windows is slow ... (Oh, of course, it is not true for this. !)
2. The test register is 0. I don't want to see the following code: 1) CMP Eax, 00000000H; 5 Bytes Je _Label_; 2/6 Bytes (Short / Near)
[* Note that many instructions are optimized for EAX, you have to use Eax as much as possible, such as CMP Eax, 12345678H (5 bytes) If you use other registers, 6bytes *] Let's take a look, simple comparison directive actually Use 7/11 bytes, no no no, try the following: 2) or Eax, Eax; 2 Bytes Je _Label_; 2/6 (Short / Near)
3) Test Eax, Eax; 2 Bytes Je _Label_; 2/6 (Short / Near)
Oh, only 4/8 Bytes, see how many bytes we save, 3/4 bytes ... So the next question is that you like or Test, I personally like Test, because Test Do not change any registers, do not write content to any registers, which usually makes faster execution speed on the Pentium machine. Don't be too happy, because there is also a big trust, if you want to judge EAX register, then look at the following, is it more inspirated? 4) XCHG Eax, ECX; 1 Byte Jecxz _Label_; 2 Bytes We than 2) and 3 in the case of short jumps, saving 1 byte. OH ....___...
3. Does the test register return -1 for some APIs of 0FFFFFFFFH, so how to test this value? See you may have this: 1) CMP Eax, 0ffffffh; 5 Bytes Je _Label_; 2/6 Bytes Hey, don't this, write code I think about it, so I have the following way: 2) Inc Eax; 1 byte Je _Label_; 2/6 BYTES DEC EAX; 1 Byte can save 3 Bytes and the execution speed will be faster.
4. Set the register to 0ffffffh, see if you are the author of the API, how to return -1? This? 1) MOV EAX, 0FFFFFFFFFFH; 5 Bytes
I watched the above, I won't be so XXX? Take a look at: 2) xor Eax, Eax / Sub Eax, Eax; 2 Bytes Dec Eax; 1 byte Save a word! And Writings: 3) STC; 1 Byte SBB EAX, EAX; 2 bytes This is sometimes optimized 1 byte: JNC _Label_ SBB EAX, EAX; 2 Bytes Only! _Label_: ...
Why do we use ASM? This is the reason.
5. Register clear 0 and move into the value of the low words 1) xor Eax, Eax; 2 Bytes Mov AX, Word PTR [ESI XX]; 4 BYTES ???? ---> No, this may be the most beginning of beginners The writing, I certainly, I did that I decided to rewrite it after reading Benny's article: 2) Movzx Eax, Word PTR [ESI XX]; 4 Bytes Harvest 2 Bytes!
The following 3) xor Eax, Eax; 2 Bytes Mov Al, Byte Ptr [ESI XX]; 3 Bytes
Change to: 4) Movzx Eax, Byte Ptr [ESI XX]; 4 bytes
We should use Movzx 5 as much as possible to Eax, Eax; 2 Bytes Mov AX, BX; 3 Bytes
Because the execution speed is not slow and usually saves bytes ... 6) Movzx Eax, BX; 3 Bytes
6. With regard to PUSH, the following is an optimization of the code volume, because the register operation is always fast than the memory operation. 1) MOV Eax, 50h; 5 Bytes
This is a small 1 word.
2) Push 50h; 2 Bytes Pop Eax; 1 byte When the operand is only 1 byte, the Push is only 2 Bytes, otherwise it is 5 Bytes, remember! Next problem, press 7 0 to the stack
3) Push 0; 2 bytes push 0; 2 bytes push 0; 2 bytes push 0; 2 Bytes push 0; 2 bytes push 0; 2 bytes
Take up 14 bytes, obviously unsatisfactory, optimize it 4) xor Eax, Eax; 2 bytes push eax; 1 byte push eax; 1 byte push eax; 1 byte push eax; 1 byte push eax; 1 byte push eax; 1 Byte Push Eax; 1 byte
It can be more compact, but it will be slower as follows:
5) Push 7; 2 bytes pop ECX; 1 byte _label_: push 0; 2 bytes loop _Label_; 2 bytes
Can save 7 bytes ...
Sometimes you may transfer a value from one memory address to another memory address, and save all registers:
6) Push Eax; 1 byte Mov Eax, [EBP XXXX]; 6 Bytes MOV [EBP XXXX], EAX; 6 bytes Pop Eax; 1 byte Try PUSH, POP
7) PUSH DWORD PTR [EBP XXXX]; 6 Bytes Pop DWORD PTR [EBP XXXX]; 6 BYTES 7. Multiplication When EAX has been placed, take 28h, how to write? 1) MOV ECX, 28h ; 5 BYTES MUL ECX; 2 bytes
A better way is as follows:
2) Push 28h; 2 Bytes Pop ECX; 1 byte Mul Ecx; 2 Bytes
Wow, this better ::
3) Imul Eax, Eax, 28h; 3 Bytes
Intel provides new instructions in the new CPU is not a furnish, you need your use.
8. String operation
How do you get a byte from memory? Fast scheme: 1) MOV Al / Ax / Eax, [ESI]; 2/3/2 Bytes Inc ESI; 1 byte
Small code: 2) LODSB / W / D; 1 Byte
I prefer Lod because he is small, although the speed is slow. How to reach the string end? JQWwe'S method:
9) Lea ESI, [EBP ASCIZ]; 6 bytes S_Check: Lodsb; 1 Byte Test Al, Al; 2 BYTES JNE S_CHECK; 2 BYTES
Super'S Method:
10) Lea EDI, [EBP ASCIZ]; 6 Bytes XOR Al, Al; 2 Bytes S_Check: Scasb; 1 Byte JNE S_CHECK; 2 Byte
Which one is SUPER is faster than 386, JQWerTy's faster, in Pentium, and selects by you.
9. Complex ...
Suppose you have a DWORD table, EBX point to the beginning of the table, ECX is a pointer, you want to add 1 to each DOWORD 1, see how to do: 1) Pushad; 1 byte Imul ECX, ECX, 4; 3 Bytes Add EBX, ECX ; 2 bytes Inc DWORD PTR [EBX]; 2 Bytes popad; 1 byte can optimize a little, but it seems that no one is used:
2) Inc DWORD PTR [EBX 4 * ECX]; 3 Bytes
A directive saves 6-byte, and it is faster and more easy to read, but it seems that there is no use? ... why? You can have immediate: 3) Pushad; 1 byte Imul ECX, ECX, 4; 3 BYtes Add EBX, ECX; 2 Bytes Add EBX, 1000H; 6 Bytes Inc DWOR PTR [EBX]; 2 bytes popad; 1 byte
Optimized: 4) Inc DWORD PTR [EBX 4 * ECX 1000H]; 7 bytes
Save 8 bytes!
Look at the LEA instructions, what can we do? Lea Eax, [12345678h]
What is the last result of EAX? The correct answer is 12345678h.
Suppose EBP = 1 Lea Eax, [EBP 12345678H] The result is 123456789H .... huh to compare: Lea Eax, [EBP 12345678H]; 6 Bytes ================ ========== Mov Eax, 12345678h; 5 Bytes Add Eax, EBP; 2 Bytes
5) Take a look: Mov Eax, 12345678h; 5 Bytes Add Eax, EBP; 2 Bytes Imul ECX, 4; 3 Bytes Add Eax, ECX; 2 Bytes
6) Use Lea to make some calculations that I will get benefits from volume: Lea Eax, [EBP ECX * 4 12345678H]; 7 bytes
The speed last Lea directive is faster! Does not affect the flag ... Remember the following format, use them in many places you can save time and space. Opcode
10. The following is about viral relocation optimization, fear people please bypass ... The following code You should not be unfamiliar 1) Call GDELTA GDELTA: POP EBP SUB EBP, OFFSET GDELTA
In the later code, we use Delta to avoid relocation issues Lea Eax, [EBP VARIABLE]
Such instructions are inevitable when applying memory data. If you can optimize it, I will get several-fold revenue, open your SICE or TRW or OLLYDBG and other debuggers, see: 3) Lea Eax, [EBP 401000H]; 6 Bytes, such as the following 4) Lea Eax, [EBP 10H]; 3 bytes
That is to say, if the EBP has a change in the variable, the total instruction has only 3 bytes modified the initial format to be:
5) Call GDELTA GDELTA: POP EBP
In some cases, our directive is only 3 bytes, saving 3 bytes, oh, let's take a look: 6) Lea Eax, [EBP VARIABLE - GDELTA]; 3 Bytes
And above is equivalent, but we can save 3 bytes, see CiH ...
11. Other Tips: If EAX is less than 80000000h, EDX Clear 0: ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ ----------------
1) xor Edx, EDX; 2 bytes, but faster
2) CDQ; 1 Byte, But Slower
I have been using CDQ, why not? Small size ...
Below this, do not use ESP and EBP, use other registers. --------------------------------- ---------------------------------------------------------------------------------------------------------------------------------------
1) MOV EAX, [EBP]; 3 Bytes 2) MOV Eax, [ESP]; 3 bytes
3) MOV EAX, [EBX]; 2 bytes
The order of 4 bytes in the exchange register? Use bswap ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ------------------- Mov Eax, 12345678h; 5 bytes
Bswap eax; 2 bytes
EAX = 78563412h Now
WANNA SAVE SOME BYTES REPLACIN 'CALL? --------------------------------------- 1) Call _Label_; 5 Bytes Ret; 1 byte
2) JMP _Label_; 2/5 (Short / Near)
If it is just optimized, do not need to pass parameters, try to use JMP instead of Call
How to save time while comparing REG / MEM: --------------------------------------- -
1) CMP REG, [MEM]; SLOWER
2) CMP [MEM], REG; 1 CYCLE FASTER
Take 2 except 2 How to save time and space? --------------------------------------- -------------------- 1) MOV EAX, 1000H MOV ECX, 4; 5 Bytes XOR Edx, EDX; 2 Bytes Div ECX; 2 Bytes
2) SHR EAX, 4; 3 BYTES
3) MOV ECX, 4; 5 Bytes Mul ECX; 2 bytes
4) SHL EAX, 4; 3 bytes
LOOP instruction ------------------------
1) DEC ECX; 1 byte jne _Label_; 2/6 Bytes (Short / Near)
2) loop _label_; 2 bytes
Look at: 3) JE $ 5; 2 bytes dec ECX; 1 byte jne _label_; 2 Bytes
4) Loopxx _Label_ (xx = e, ne, z or nz); 2 Bytes LOOP is small, but the speed of the CPU above 486 will be slow ...
Compare: ------------------------------------------------ --------- 1) Push Eax; 1 byte push EBX; 1 byte Pop EBX; 1 byte 2) XCHG EAX, EBX; 1 Byte3) XCHG ECX, EDX; 2 Bytes If only I want to move the value, use MOV, there will be better execution speed on Pentium: 4) MOV ECX, EDX; 2 bytes
Compare: --------------------------------------------
1) Unopened: LBL1: MOV Al, 5; 2 bytes Stosb; 1 byte Mov Eax, [EBX]; 2 Bytes Stosb; 1 Byte Ret; 1 Byte LBL2: MOV Al, 6; 2 Bytes Stosb; 1 Byte Mov EAX [EBX]; 2 bytes stosb; 1 byte ret; 1 byte ---------; 14 Bytes 2) Optimize: LBL1: MOV Al, 5; 2 Bytes LBL: Stosb; 1 Byte Mov Eax, [EBX]; 2 bytes stosb 1 byte ret; 1 byte lbl2: MOV Al, 6; 2 Bytes JMP LBL; 2 Bytes ---------; 11 Bytes Read constant variables, try directly to define directly in the instruction: ------ -------------------------
... MOV [EBP VARIABLE], EAX; 6 Bytes ... Variable DD 12345678H; 4 bytes
2) Optimization is:
Mov Eax, 12345678h; 5 BYTES VARIABLE = DWORD PTR $ - 4 ... MOV [EBP VARIABLE], EAX; 6 BYTES
Oh, I haven't seen such interesting code for a long time. The premise is to support the write attribute of the code segment when compiling. Finally introduce the undisclosed directive SALC, now the debugger support ... What mean: is CF If you are 1, you will be set to 0xFF --------------------------------------- -------------------------
1) JC _LBL1; 2 BYTES MOV Al, 0; 2 Bytes JMP _END; 2 BYTES _LBL: MOV Al, 0FFH; 2 Bytes _end: ...
2) SALC DB 0D6H; 1 Byte;) ----------------------------------------- -------------------------->
End.