This article is in AOGO compilation station.
http://www.aogosoft.com/ "
Author: Benny / 29A
Translation rewriting: hume / cold rain
[Note: This is not the translation of the parrot, I try to convey the original text of the original article with my understanding]
There are too many articles about code optimization. Unfortunately, most of me didn't look, even though they were in my bedside (whenever I want to see it, I can't help but yawn ... 嘿嘿). This. The article is shorter.
The meaning of code optimization:
The optimization of code optimization is of course small and fast, but in general, both are like fish and bear's paws, we usually find these two folds, what should be biased, then It is necessary to see our actual needs.
But some common sense is that we should keep in mind, let's talk about the specific situation we are most often in place:
Register clear 0
I absolutely don't want to see the following:
1) MOV EAX, 00000000H 5 bytes
It seems that the above writes are very logical, but you should realize that there is more optimized ways:
2) SUB EAX, EAX 2 bytes
3) xor Eax, EAX 2 bytes
Look at the number of bytes behind you should understand why you have to make this, in addition to this, there is no loss on the speed, they are as fast, but do you like XOR or SUB? I prefer XOR, the reason is very simple , Because I am not very much ...
However, Microsoft is more like Sub .... We know that Windows is slow ... (Oh, of course, it is not really reason X-D!)
2. Is the test register is 0
I don't want to see the following code:
1) CMP EAX, 00000000H 5 BYTES
Je_Label_ 2/6 Bytes (Short / Near)
[* Note Many instructions are optimized for EAX, you have to use Eax as much as possible, such as CMP Eax, 12345678H (5 bytes)
If you use other registers, 6bytes *]
Let us see, simple comparison instructions actually use 7/11 bytes, no no no no, try the following:
2) OR EAX, EAX 2 bytes
Je _Label_ 2/6 (Short / Near)
3) Test Eax, EAX 2 bytes
Je _Label_ 2/6 (Short / Near)
Oh, only 4/8 Bytes, see how many bytes we save, 3/4 bytes ... So the next question is that you like or Test, I personally like Test, because Test Do not change any registers, do not write content to any register, which is usually faster on the Pentium machine.
Don't be happy too early, because there is also something worthy of our happiness, if you want to judge the EAX register, then see below, is it more inspirated?
4) XCHG Eax, ECX 1 byte
Jecxz _label_ 2 bytes
In the case of a short jump, we saved 1 byte .oh ....___...
3. Does the test register are 0FFFFFFFFH
Some API returns -1, so how to test this value? See you may have this:
1) CMP EAX, 0FFFFFFFH 5 BYTES
Je_Label_ 2/6 Bytes
Hey, don't do this, think about it when you write code, so I have the following way:
2) Inc EAX 1 byteje _Label_ 2/6 Bytes
Dec EAX 1 byte
Can save 3 BYTES and perform speed faster.
4. Set the register for 0ffffffh
See if you are the author of the API, how to return -1? Is this?
1) MOV EAX, 0FFFFFFFH 5 BYTES
I watched it so XXX again? Take a look at:
2) xor Eax, Eax / Sub Eax, EAX 2 bytes
Dec EAX 1 byte
Save a word! And Writing:
3) STC 1 Byte
SBB EAX, EAX 2 bytes
This can sometimes optimize 1 byte:
JNC _Label_
SBB EAX, Eax 2 Bytes Only!
_Label_: ...
Why do we use ASM? This is the reason.
5. Register clear 0 and move into the low word value
1) xor Eax, Eax 2 bytes
MOV AX, Word PTR [ESI XX] 4 bytes
???? ---> No, this may be the writing of the most beginners, I certainly, I decided to rewrite after I saw Benny articles.
for:
2) Movzx Eax, Word PTR [ESI XX] 4 Bytes
Harvest 2 Bytes!
Below
3) xor Eax, EAX 2 bytes
MOV Al, Byte Ptr [ESI XX] 3 bytes
Correctly changed to:
4) Movzx Eax, Byte Ptr [ESI XX] 4 bytes
We should use MOVZX as much as possible
5) xor Eax, Eax 2 Bytes
Mov AX, BX 3 bytes
Because the execution speed is not slow and usually saves bytes ...
6) MOVZX EAX, BX 3 bytes
6. With regard to PUSH, the following is an optimization of the code volume, because the register operation is always faster than the memory operation.
1) MOV Eax, 50h 5 bytes
This is a small 1 word.
2) Push 50h 2 bytes
POP EAX 1 byte
When the operand is only 1 byte, the Push is only 2 Bytes, otherwise it is 5 Bytes, remember!
Next problem, press 7 0 to the stack
3) Push 0 2 bytes
Push 0 2 bytes
Push 0 2 bytes
Push 0 2 bytes
Push 0 2 bytespush 0 2 bytes
Push 0 2 bytes
Take up 14 bytes, obviously unsatisfactory, optimize
4) xor Eax, Eax 2 bytes
Push eax 1 byte
Push eax 1 byte
Push eax 1 byte
Push eax 1 byte
Push eax 1 byte
Push eax 1 byte
Push eax 1 byte
It can be more compact, but it will be slower as follows:
5) Push 7 2 bytes
POP ECX 1 byte
_Label_: Push 0 2 bytes
LOOP _Label_ 2 bytes
Can save 7 bytes ...
Sometimes you may transfer a value from one memory address to another memory address, and save all registers:
6) Push EAX 1 byte
MOV EAX, [EBP XXXX] 6 bytes
MOV [EBP XXXX], EAX 6 BYTES
POP EAX 1 byte
Try Push, POP
7) Push DWORD PTR [EBP XXXX] 6 BYTES
POP DWORD PTR [EBP XXXX] 6 bytes
7. Multiplication
When EAX has been placed, take 28h, how to write?
1) MOV ECX, 28h 5 bytes
Mul ECX 2 bytes
A better way is as follows:
2) Push 28h 2 bytes
POP ECX 1 byte
Mul ECX 2 bytes
Wow, this better ::
3) Imul Eax, EAX, 28h 3 bytes
Intel provides new instructions in the new CPU is not a furnish, you need your use.
8. String operation
How do you get a byte from memory?
Fast scheme:
1) MOV Al / Ax / Eax, [ESI] 2/3/2 Bytes
INC ESI 1 Byte
Small code:
2) LODSB / W / D 1 byte
I prefer LOD because he is small, although the speed is slow.
How to reach the string tail?
JQWRTY's Method:
9) Lea ESI, [EBP ASCIZ] 6 bytes
S_Check: Lodsb 1 byte
Test Al, Al 2 Bytes
JNE S_CHECK 2 BYTES
Super'S Method:
10) Lea EDI, [EBP ASCIZ] 6 bytes
XOR Al, Al 2 Bytes
S_Check: scaSB 1 byte
JNE S_CHECK 2 BYTE
Which one is SUPER is faster than 386, JQWerTy's faster, in Pentium, and selects by you.
9. Complex ...
Suppose you have a DWORD table, EBX point to the beginning of the table, ECX is a pointer, you want to add 1 to each DOWORD 1, see how to do:
1) Pushad 1 byte
Imul ECX, ECX, 4 3 bytes
Add ebx, ECX 2 bytes
Inc DWORD PTR [EBX] 2 bytes
POPAD 1 BYTE
It can be optimized, but it seems that no one is used:
2) Inc DWORD PTR [EBX 4 * ECX] 3 bytes
A directive saves 6 bytes, and it is faster and more easy to read, but it seems that there is no use? ... why?
You can also have immediate:
3) Pushad 1 byte
Imul ECX, ECX, 4 3 bytes
Add ebx, ECX 2 bytes
Add ebx, 1000h 6 bytes
Inc DWOR PTR [EBX] 2 bytes
POPAD 1 BYTE
Optimization is:
4) Inc DWORD PTR [EBX 4 * ECX 1000h] 7 bytes
Save 8 bytes!
Look at the LEA instructions, what can we do?
Lea Eax, [12345678H]
What is the last result of EAX? The correct answer is 12345678h.
Suppose EBP = 1
Lea Eax, [EBP 12345678H]
The result is 123456789H .... huh, more:
Lea Eax, [EBP 12345678H] 6 BYTES
==========================
Mov Eax, 12345678h 5 bytes
Add Eax, EBP 2 bytes
5) Take a look:
Mov Eax, 12345678h 5 bytesadd Eax, EBP 2 bytes
Imul ECX, 4 3 bytes
Add Eax, ECX 2 bytes
6) Use Lea to make some calculations that I will get good at volume:
Lea Eax, [EBP ECX * 4 12345678h] 7 bytes
The speed last Lea directive is faster! Does not affect the flag ... Remember the following format, in many places to use them to save time and space.
Opcode [Base Index * Scale Displacement]
10. The following is about the optimization of viral relocation, please bypass by the drug ...
You should not be unfamiliar with the code below
1) Call GDELTA
GDELTA: POP EBP
Sub EBP, Offset GDELTA
In the later code, we use Delta to avoid relocation problems.
Lea Eax, [EBP VARIABLE]
Such instructions are inevitable when applying memory data. If you can optimize it, I will get several-fold revenue, open your SICE or TRW or OLLYDBG, see:
3) Lea Eax, [EBP 401000H] 6 bytes
If it is the following
4) Lea Eax, [EBP 10h] 3 bytes
That is to say, if the eBP is backed by 1 byte, the total instruction only has 3 bytes.
Modify the initial format becomes:
5) Call GDELTA
GDELTA: POP EBP
In some cases, our directive is only 3 bytes, saving 3 bytes, hey, let's take a look:
6) Lea Eax, [EBP VARIABLE - GDELTA] 3 BYTES
And above is equivalent, but we can save 3 bytes, see Cih ...
11. Other tips:
If EAX is less than 80000000H, EDX Clear 0:
--------------------------------------------------
1) xor edx, edx 2 bytes, but fateer
2) CDQ 1 Byte, But Slower
I have been using CDQ, why not? Small size ...
Below this, do not use ESP and EBP, use other registers.
-------------------------------------------------- ---------
1) MOV EAX, [EBP] 3 bytes
2) MOV Eax, [ESP] 3 bytes
3) MOV Eax, [EBX] 2 bytes
The order of 4 bytes in the exchange register? Use BSWAP
-------------------------------------------------- -------
Mov Eax, 12345678h 5 bytes
Bswap eax 2 bytes
Eax = 78563412h now
WANNA SAVE SOME BYTES REPLACIN 'CALL?
---------------------------------------
1) Call_Label_ 5 bytesret 1 byte
2) JMP _Label_ 2/5 (Short / Near)
If it is just optimized, do not need to pass parameters, try to use JMP instead of Call
How to save time when comparing REG / MEM:
------------------------------------------
1) CMP REG, [MEM] SLOWER
2) CMP [MEM], REG 1 CYCLE FASTER
Take 2 except 2 How to save time and space?
-------------------------------------------------- ------------
1) MOV EAX, 1000H
Mov ECX, 4 5 bytes
xor edx, edx 2 bytes
Div ECX 2 bytes
2) SHR EAX, 4 3 bytes
3) MOV ECX, 4 5 bytes
Mul ECX 2 bytes
4) SHL EAX, 4 3 bytes
LOOP instruction
---------------------------------------------------------------------------------------------------------------------------------------
1) DEC ECX 1 BYTE
JNE _Label_ 2/6 Bytes (Short / Near)
2) loop _label_ 2 bytes
Look at:
3) JE $ 5 2 bytes
Dec ECX 1 Byte
JNE _Label_ 2 bytes
4) Loopxx_Label_ (xx = e, ne, z or nz) 2 bytes
LOOP is small, but the speed of execution of 486 or more will be slow ...
Comparison:
-------------------------------------------------- -------
1) Push Eax 1 Byte
Push ebx 1 byte
POP EAX 1 byte
POP EBX 1 BYTE
2) XCHG EAX, EBX 1 BYTE
3) XCHG ECX, EDX 2 bytes
If you just want to move the value, use MOV, there will be better execution speed on Pentium:
4) MOV ECX, EDX 2 bytes
Comparison:
------------------------------------------ 1) Not optimized:
LBL1: MOV Al, 5 2 bytes
Stosb 1 byte
Mov Eax, [EBX] 2 bytes
Stosb 1 byte
Ret 1 byte
LBL2: MOV Al, 6 2 bytes
Stosb 1 byte
Mov Eax, [EBX] 2 bytes
Stosb 1 byte
Ret 1 byte
---------
14 BYTES
2) Optimize:
LBL1: MOV Al, 5 2 bytes
LBL: Stosb 1 byte
Mov Eax, [EBX] 2 bytes
Stosb 1 byte
Ret 1 byte
LBL2: MOV Al, 6 2 bytes
JMP LBL 2 bytes
---------
11 bytes
Read constant variables, try the direct definition directly in the instruction:
-----------------------------
Mov Eax, [EBP VARIABLE] 6 BYTES
...
MOV [EBP VARIABLE], EAX 6 BYTES
...
...
Variable DD 12345678H 4 bytes
2) Optimization is:
Mov Eax, 12345678h 5 bytes
Variable = DWORD PTR $ - 4
...
...
MOV [EBP VARIABLE], EAX 6 BYTES
Oh, I haven't seen such interesting code for a long time, the premise is to compile the write attribute to the code segment to be set.
Finally, introduce Unapplicared instructions SALC, now the debugger support ... What meanings: The CF position 1 is set to 0xFF
-------------------------------------------------- ----------------
1) JC _LBL1 2 bytes
Mov al, 0 2 bytes
JMP _END 2 bytes
_LBL: MOV Al, 0FFH 2 bytes
_END: ...
2) SALC DB 0D6H 1 byte