32-bit code optimization common sense

xiaoxiao2021-03-06 50

32-bit code optimization common sense preface: Win32 ASM code optimization common sense is very helpful.

:) --- Crossbow's article on code optimization is too much, unfortunately, most I haven't seen it, although they are in my bedside (whenever I want to watch it, I can't help yawn. ... 嘿) Code Optimization The meaning of code optimization is of course small and fast, but in general, both are like fish and bear palms, we usually find this. In the past, what should be biased, so they have to see our actual needs. But some common sense is that we should keep in mind, let's talk about the specific situation we most often encountered: 1. Register clear 0 I absolutely don't want to see the following: 1) MOV EAX, 00000000H; 5 Bytes looks the above writing is very logical, but you should be aware of more optimized Writings: 2) Sub Eax, Eax; 2 Bytes 3 ) Xor eax, Eax; 2 bytes Look at the number of bytes behind you should understand why it is doing this, in addition to this, there is no loss on the speed, they are as fast, but do you like XOR or SUB? I It is more like to like XOR because I am not very mathematical .... But Microsoft likes Sub .... We know that Windows is slow .... (Oh, of course, it is not really reason XD! 2. Whether I don't want to see the following code for 0: 1) CMP EAX, 00000000H; 5 Bytes Je _Label_; 2/6 Bytes (Short / Near) [* Note Many instructions are optimized for EAX, You have to use Eax as much as possible, such as CMP Eax, 12345678H (5 bytes) if you use other registers, is 6bytes *] Let's take a look, simple comparison instructions actually use 7/11 bytes, no no no, try the following: 2) OR Eax, Eax; 2 Bytes Je _Label_; 2/6 (Short / Near) 3) Test Eax, Eax; 2 Bytes Je _Label_; 2/6 (Short / Near) Oh, only 4/8 Bytes, see how many bytes we save, 3/4 bytes ... So the next question is you like OR or TEST, I am personal, I prefer Test, because Test does not change any registers, do not write content to any register, which usually makes faster execution speed on the Pentium machine. Don't be too happy, Because there is also something worthy of our happiness, if you want to judge the EAX register, then look at the following, is it more inspirated? 4) Xchg Eax, ECX; 1 Byte Jecxz _Label_;

2 Bytes In the case of short jump, we save 1 byte .oh ....___... 3. The test register returns to the 0FFFFFFFH API returns -1, so how to test this value What? See you may be like this: 1) CMP EAX, 0FFFFFFFFH; 5 BYTES JE _Label_; 2/6 Bytes Hey, don't think about it, I think about it, so I have the following: 2) Inc EAX; 1 byte Je _Label_; 2/6 Bytes Dec Eax; 1 Byte can save 3 Bytes and execute the speed will be faster. 4. Set the register for 0ffffh. If you are the author of the API, how to return -1? This? 1) Mov eax, 0ffffffh; 5 Bytes I won't be so XXX again? Take a look at: 2) xor Eax, Eax / Sub Eax, Eax; 2 Bytes Dec Eax; 1 byte Save a word! : 3) STC; 1 Byte SBB EAX, EAX; 2 bytes This is sometimes optimized 1 byte: JNC _Label_ SBB EAX, EAX 2 BYTES ONLY! _Label_: ... Why use ASM? This is the reason. 5. Register clear 0 and move into the low-character value 1) xor Eax, Eax; 2 Bytes Mov AX, Word PTR [ESI XX]; 4 BYTES ???? ---> No, this may be the writer of the most beginners, I certainly, I decided to rewrite after: 2) Movzx Eax, Word PTR [ESI XX]; 4 Bytes Harvest 2 Bytes! below 3) xor Eax, Eax; 2 Bytes Mov Al, Byte PTR [ESI XX]; 3 Bytes Change to: 4) Movzx Eax, Byte PTR [ESI XX] 4 bytes We should use Movzx 5) xor Eax, Eax; 2 Bytes Mov AX, BX;

3 BYTES Since the execution speed is not slow and usually saves bytes ... 6) Movzx Eax, Bx; 3 bytes 6. About PUSH, the following is an optimization of the volume of the code, because the register operation is always better than the memory operation. 1 ) MOV EAX, 50H; 5 Bytes This is a small 1 word 2) Push 50h; 2 Bytes Pop Eax; 1 byte When the operand is only 1 byte, Push is only 2 Bytes, otherwise it is 5 Bytes, remember! Next Question, press 7 0 3) Push 0; 2 bytes push 0; 2 bytes push 0; 2 bytes push 0; 2 Bytes push 0; 2 Bytes Push 0; 2 Bytes occupying 14 words Festival, obviously unsatisfactory, optimize it 4) xor Eax, Eax; 2 Bytes Push EAX 1 byte push eax; 1 byte push eax; 1 byte push eax; 1 byte push eax; 1 byte push eax; 1 Byte can be more compact, but slowly, the form is as follows: 5) Push 7; 2 bytes pop ECX; 1 byte _label_: push 0; 2 bytes loop _label_;

2 BYTES can save 7 bytes. ... Sometimes you may transfer a value from one memory address to another memory address, and save all registers: 6) Push Eax; 1 byte Mov EAX, [EBP XXXX ]; 6 BYTES MOV [EBP XXXX], EAX; 6 Bytes Pop Eax; 1 byte Try Push, PUSH DWORD PTR [EBP XXXX]; 6 Bytes Pop DWORD PTR [EBP XXXX]; 6 bytes 7 Multiplication When EAX has been placed, take 28h, how to write? 1) MOV ECX, 28H; 5 BYTES MUL ECX; 2 Bytes Good Writings as follows: 2) Push 28h; 2 BYtes Pop ECX; 1 Byte Mul ECX; 2 Bytes Wow, this better :: 3) Imul Eax, Eax, 28h; 3 Bytes Intel provides new instructions in the new CPU is not a furnish, you need your use. 8. String Operation How do you from memory? Gain one Byte? Speed fast solution: 1) MOV Al / Ax / Eax, [ESI]; 2/3/2 BYTES INC ESI; 1 Byte code small program: 2) Lodsb / w / d; 1 byte I comparison Like LOD because he is small, although the speed is slow. How to reach the string tail? JQWwe's Method: 9) Lea ESI, [EBP ASCIZ]; 6 Bytes S_Check: Lodsb; 1 Byte Test Al, Al; 2 BYTES JNE S_CHECK ; 2 bytes super's method: 10) Lea EDI, [EBP ASCIZ]; 6 bytes XOR Al, Al;

2 bytes s_check: scaSB; 1 byte jne s_check; 2 Byte Which one? Super's 386 or less, jqwerty is faster, the volume is faster, the volume is the same, and the choice is complicated .. Suppose you have a DWORD table, EBX point to the beginning of the table, ECX is a pointer, you want to add 1 to each DOWORD 1, see how: 1) Pushad; 1 byte Imul ECX, ECX, 4; 3 Bytes Add EBX, ECX; 2 BYTES INC DWORD PTR [EBX]; 2 Bytes Popad; 1 byte can optimize a little, but it seems that no one uses: 2) Inc DWORD PTR [EBX 4 * ECX]; 3 BYTES A directive saves 6-byte, And the speed is faster, more easy to read, but it seems that there is no one? ... why? You can have immediate: 3) Pushad; 1 byte Imul ECX, ECX, 4; 3 BYtes Add EBX, ECX; 2 BYtes Add EBX; , 1000h; 6 bytes in C DWOR PTR [EBX]; 2 bytes popad; 1 byte Optimization is: 4) Inc DWORD PTR [EBX 4 * ECX 1000h]; 7 Bytes saves 8 bytes! Look at the Lea instructions to do anything? LEA EAX, [12345678H] What is the final result of EAX? The correct answer is 12345678h. Suppose EBP = 1 Lea Eax, [EBP 12345678H] The result is 123456789H .... 比比下: Lea EAX, [EBP 12345678H ]; 6 bytes =================================== MOV EAX, 12345678H;

5 BYtes Add Eax, EBP; 2 Bytes 5) Take a look: Mov Eax, 12345678H; 5 Bytes Add Eax, EBP; 2 Bytes Imul ECX, 4; 3 Bytes Add Eax, ECX; 2 Bytes 6) Use LEA to make some calculations My door will be beneficial from the volume: Lea EAX, [EBP ECX * 4 12345678H]; 7 Bytes speed Previous LEA directive faster! Do not affect the flag bit ... Remember the following format, use them in many places You can save time and space. Opcode [Base Index * Scale Displacement] 10. The following is About viral relocation optimization, fear people please bypass ... The following code You should not be unfamiliar 1) Call GDELTA GDELTA: POP EBP SUB EBP, OFFSET GDELTA In the later code we use Delta to avoid relocation issues Lea Eax, [EBP VARIABLE] is inevitable when the memory data is applied, if optimized Let's get a few more benefits, open your SICE or TRW or OLLYDBG, see: 3) Lea Eax, [EBP 401000H]; 6 Bytes If you do this 4) Lea EAX, [EBP 1 0h]; 3 BYTES, that is, if the eBP has a variable of 1 byte, the total instruction has only 3 bytes to modify the initial format to: 5) Call GDELTA GDELTA: POP EBP In some cases our instruction There is only 3 bytes, you can save 3 bytes, oh, let's take a look: 6) Lea Eax, [EBP VARIABLE - GDELTA]; 3 Bytes and the above is equivalent, but we can save 3 bytes Take a look at Cih ... 11. Other Tips: If EAX is less than 80000000H, EDX Clear 0: ----------------------------- --------------------- 1) XOR EDX, EDX; 2 BYTES, But Faster 2) CDQ;

1 Byte, But Slower I have been using CDQ, why not? Small size ... The following cases generally do not use ESP and EBP, use other registers. ------------- ------------------------------------------ 1) MOV EAX, [ EBP]; 3 BYTES 2) MOV EAX, [ESP]; 3 BYTES 3) MOV EAX, [EBX]; 2 Bytes Exchange Register 4 bytes of order? with bswap --------- ---------------------------------------------- Mov Eax, 12345678H ; 5 BYTES BSWAP EAX; 2 Bytes; Eax = 78563412H NOW WANNA SAVE SOME BYTES REPLACIN 'CALL? ----------------------------- ---------- 1) Call_Label_; 5 Bytes Ret; 1 Byte 2) JMP _Label_; 2/5 (Short / Near) If only optimization, and do not need to pass parameters, try to use JMP instead How to save time when Call compares REG / MEM: -------- ---------------------------------- 1) CMP REG, [MEM]; SLOWER 2) CMP [MEM] , REG; 1 CYCLE FASTER Take 2 except 2 How to save time and space? -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -------------------------- 1) MOV EAX, 1000H MOV ECX, 4; 5 bytes XOR EDX, EDX; 2 BYTES DIV ECX; 2 BYTES 2) SHR EAX, 4; 3 BYtes 3) MOV ECX, 4; 5 BYTES MUL ECX; 2 Bytes 4) SHL EAX, 4;

3 BYtes Loop Directive ------------------------ 1) Dec ECX; 1 byte jne _Label_; 2/6 bytes (Short / Near) 2) loop _Label_; 2 bytes again: 3) JE $ 5; 2 bytes dec Ecx; 1 byte jne _label_; 2 bytes 4) loopxx _Label_ (xx = e, ne, z or nz); 2 Bytes loop Small, but 486 The speed of the above CPU will be slower ... Compare: ------------------------------------------------------------------------------------ --------------------- 1) Push Eax; 1 Byte Push EBX; 1 byte Pop EBX; 1 Byte 2) XCHG EAX, EBX; 1 byte 3) XCHG ECX, EDX 2 Bytes If you just want to move the value, use MOV, there will be better execution speed on Pentium: 4) MOV ECX, EDX; 2 bytes Comparison: --------------- ---------------------------- 1) Unimpeded: LBL1: MOV Al, 5; 2 bytes Stosb; 1 byte Mov Eax, [EBX]; 2 bytes stosb; 1 byte ret; 1 byte lbl2: MOV Al, 6; 2 bytes stosb; 1 byte Mov Eax, [EBX];

2 bytes stosb; 1 byte ret; 1 byte ---------; 14 Bytes 2) Optimize: LBL1: MOV Al, 5; 2 Bytes LBL: Stosb; 1 Byte Mov Eax, [EBX]; 2 Bytes Stosb; 1 Byte LBL2: MOV Al, 6; 2 Bytes JMP LBL; 2 BYTES ---------; 11 Bytes Read constant variables, try directly to define directly in the instruction: --------------------------- ... MOV [EBP VARIABLE], EAX; 6 BYTES ... ... Variable DD 12345678H; 4 BYTES 2) Optimized to: Mov Eax, 12345678H; 5 Bytes Variable = DWORD PTR $ - 4 ... MOV [EBP VARIABLE], EAX; 6 Bytes Oh, I haven't seen it for a long time. To such a interesting code, the premise is to compile the write attribute to support the code segment to be set. Finally introduce the undisclonic instruction SALC, now the debugger support ... What meanings: is the CF position 1, Al places 0xFF -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -------------------- 1) JC _LBL1; 2 bytes MOV Al, 0; 2 Bytes JMP _END; 2 BYTES _LBL: MOV Al, 0FFH;

转载请注明原文地址:https://www.9cbs.com/read-85894.html

9cbs

New Post(0)