[Repost] 32-bit code optimization common sense

xiaoxiao2021-03-06  81

This article is in AOGO compilation station.

http://www.aogosoft.com/ "

Author: Benny / 29A

Translation rewriting: hume / cold rain

[Note: This is not the translation of the parrot, I try to convey the original text of the original article with my understanding]

There are too many articles about code optimization. Unfortunately, most of me didn't look, even though they were in my bedside (whenever I want to see it, I can't help but yawn ... 嘿嘿). This. The article is shorter.

The meaning of code optimization:

The optimization of code optimization is of course small and fast, but in general, both are like fish and bear's paws, we usually find these two folds, what should be biased, then It is necessary to see our actual needs.

But some common sense is that we should keep in mind, let's talk about the specific situation we are most often in place:

Register clear 0

I absolutely don't want to see the following:

1) MOV EAX, 00000000H 5 bytes

It seems that the above writes are very logical, but you should realize that there is more optimized ways:

2) SUB EAX, EAX 2 bytes

3) xor Eax, EAX 2 bytes

Look at the number of bytes behind you should understand why you have to make this, in addition to this, there is no loss on the speed, they are as fast, but do you like XOR or SUB? I prefer XOR, the reason is very simple , Because I am not very much ...

However, Microsoft is more like Sub .... We know that Windows is slow ... (Oh, of course, it is not really reason X-D!)

2. Is the test register is 0

I don't want to see the following code:

1) CMP EAX, 00000000H 5 BYTES

Je_Label_ 2/6 Bytes (Short / Near)

[* Note Many instructions are optimized for EAX, you have to use Eax as much as possible, such as CMP Eax, 12345678H (5 bytes)

If you use other registers, 6bytes *]

Let us see, simple comparison instructions actually use 7/11 bytes, no no no no, try the following:

2) OR EAX, EAX 2 bytes

Je _Label_ 2/6 (Short / Near)

3) Test Eax, EAX 2 bytes

Je _Label_ 2/6 (Short / Near)

Oh, only 4/8 Bytes, see how many bytes we save, 3/4 bytes ... So the next question is that you like or Test, I personally like Test, because Test Do not change any registers, do not write content to any register, which is usually faster on the Pentium machine.

Don't be happy too early, because there is also something worthy of our happiness, if you want to judge the EAX register, then see below, is it more inspirated?

4) XCHG Eax, ECX 1 byte

Jecxz _label_ 2 bytes

In the case of a short jump, we saved 1 byte .oh ....___...

3. Does the test register are 0FFFFFFFFH

Some API returns -1, so how to test this value? See you may have this:

1) CMP EAX, 0FFFFFFFH 5 BYTES

Je_Label_ 2/6 Bytes

Hey, don't do this, think about it when you write code, so I have the following way:

2) Inc EAX 1 byteje _Label_ 2/6 Bytes

Dec EAX 1 byte

Can save 3 BYTES and perform speed faster.

4. Set the register for 0ffffffh

See if you are the author of the API, how to return -1? Is this?

1) MOV EAX, 0FFFFFFFH 5 BYTES

I watched it so XXX again? Take a look at:

2) xor Eax, Eax / Sub Eax, EAX 2 bytes

Dec EAX 1 byte

Save a word! And Writing:

3) STC 1 Byte

SBB EAX, EAX 2 bytes

This can sometimes optimize 1 byte:

JNC _Label_

SBB EAX, Eax 2 Bytes Only!

_Label_: ...

Why do we use ASM? This is the reason.

5. Register clear 0 and move into the low word value

1) xor Eax, Eax 2 bytes

MOV AX, Word PTR [ESI XX] 4 bytes

???? ---> No, this may be the writing of the most beginners, I certainly, I decided to rewrite after I saw Benny articles.

for:

2) Movzx Eax, Word PTR [ESI XX] 4 Bytes

Harvest 2 Bytes!

Below

3) xor Eax, EAX 2 bytes

MOV Al, Byte Ptr [ESI XX] 3 bytes

Correctly changed to:

4) Movzx Eax, Byte Ptr [ESI XX] 4 bytes

We should use MOVZX as much as possible

5) xor Eax, Eax 2 Bytes

Mov AX, BX 3 bytes

Because the execution speed is not slow and usually saves bytes ...

6) MOVZX EAX, BX 3 bytes

6. With regard to PUSH, the following is an optimization of the code volume, because the register operation is always faster than the memory operation.

1) MOV Eax, 50h 5 bytes

This is a small 1 word.

2) Push 50h 2 bytes

POP EAX 1 byte

When the operand is only 1 byte, the Push is only 2 Bytes, otherwise it is 5 Bytes, remember!

Next problem, press 7 0 to the stack

3) Push 0 2 bytes

Push 0 2 bytes

Push 0 2 bytes

Push 0 2 bytes

Push 0 2 bytespush 0 2 bytes

Push 0 2 bytes

Take up 14 bytes, obviously unsatisfactory, optimize

4) xor Eax, Eax 2 bytes

Push eax 1 byte

Push eax 1 byte

Push eax 1 byte

Push eax 1 byte

Push eax 1 byte

Push eax 1 byte

Push eax 1 byte

It can be more compact, but it will be slower as follows:

5) Push 7 2 bytes

POP ECX 1 byte

_Label_: Push 0 2 bytes

LOOP _Label_ 2 bytes

Can save 7 bytes ...

Sometimes you may transfer a value from one memory address to another memory address, and save all registers:

6) Push EAX 1 byte

MOV EAX, [EBP XXXX] 6 bytes

MOV [EBP XXXX], EAX 6 BYTES

POP EAX 1 byte

Try Push, POP

7) Push DWORD PTR [EBP XXXX] 6 BYTES

POP DWORD PTR [EBP XXXX] 6 bytes

7. Multiplication

When EAX has been placed, take 28h, how to write?

1) MOV ECX, 28h 5 bytes

Mul ECX 2 bytes

A better way is as follows:

2) Push 28h 2 bytes

POP ECX 1 byte

Mul ECX 2 bytes

Wow, this better ::

3) Imul Eax, EAX, 28h 3 bytes

Intel provides new instructions in the new CPU is not a furnish, you need your use.

8. String operation

How do you get a byte from memory?

Fast scheme:

1) MOV Al / Ax / Eax, [ESI] 2/3/2 Bytes

INC ESI 1 Byte

Small code:

2) LODSB / W / D 1 byte

I prefer LOD because he is small, although the speed is slow.

How to reach the string tail?

JQWRTY's Method:

9) Lea ESI, [EBP ASCIZ] 6 bytes

S_Check: Lodsb 1 byte

Test Al, Al 2 Bytes

JNE S_CHECK 2 BYTES

Super'S Method:

10) Lea EDI, [EBP ASCIZ] 6 bytes

XOR Al, Al 2 Bytes

S_Check: scaSB 1 byte

JNE S_CHECK 2 BYTE

Which one is SUPER is faster than 386, JQWerTy's faster, in Pentium, and selects by you.

9. Complex ...

Suppose you have a DWORD table, EBX point to the beginning of the table, ECX is a pointer, you want to add 1 to each DOWORD 1, see how to do:

1) Pushad 1 byte

Imul ECX, ECX, 4 3 bytes

Add ebx, ECX 2 bytes

Inc DWORD PTR [EBX] 2 bytes

POPAD 1 BYTE

It can be optimized, but it seems that no one is used:

2) Inc DWORD PTR [EBX 4 * ECX] 3 bytes

A directive saves 6 bytes, and it is faster and more easy to read, but it seems that there is no use? ... why?

You can also have immediate:

3) Pushad 1 byte

Imul ECX, ECX, 4 3 bytes

Add ebx, ECX 2 bytes

Add ebx, 1000h 6 bytes

Inc DWOR PTR [EBX] 2 bytes

POPAD 1 BYTE

Optimization is:

4) Inc DWORD PTR [EBX 4 * ECX 1000h] 7 bytes

Save 8 bytes!

Look at the LEA instructions, what can we do?

Lea Eax, [12345678H]

What is the last result of EAX? The correct answer is 12345678h.

Suppose EBP = 1

Lea Eax, [EBP 12345678H]

The result is 123456789H .... huh, more:

Lea Eax, [EBP 12345678H] 6 BYTES

==========================

Mov Eax, 12345678h 5 bytes

Add Eax, EBP 2 bytes

5) Take a look:

Mov Eax, 12345678h 5 bytesadd Eax, EBP 2 bytes

Imul ECX, 4 3 bytes

Add Eax, ECX 2 bytes

6) Use Lea to make some calculations that I will get good at volume:

Lea Eax, [EBP ECX * 4 12345678h] 7 bytes

The speed last Lea directive is faster! Does not affect the flag ... Remember the following format, in many places to use them to save time and space.

Opcode [Base Index * Scale Displacement]

10. The following is about the optimization of viral relocation, please bypass by the drug ...

You should not be unfamiliar with the code below

1) Call GDELTA

GDELTA: POP EBP

Sub EBP, Offset GDELTA

In the later code, we use Delta to avoid relocation problems.

Lea Eax, [EBP VARIABLE]

Such instructions are inevitable when applying memory data. If you can optimize it, I will get several-fold revenue, open your SICE or TRW or OLLYDBG, see:

3) Lea Eax, [EBP 401000H] 6 bytes

If it is the following

4) Lea Eax, [EBP 10h] 3 bytes

That is to say, if the eBP is backed by 1 byte, the total instruction only has 3 bytes.

Modify the initial format becomes:

5) Call GDELTA

GDELTA: POP EBP

In some cases, our directive is only 3 bytes, saving 3 bytes, hey, let's take a look:

6) Lea Eax, [EBP VARIABLE - GDELTA] 3 BYTES

And above is equivalent, but we can save 3 bytes, see Cih ...

11. Other tips:

If EAX is less than 80000000H, EDX Clear 0:

--------------------------------------------------

1) xor edx, edx 2 bytes, but fateer

2) CDQ 1 Byte, But Slower

I have been using CDQ, why not? Small size ...

Below this, do not use ESP and EBP, use other registers.

-------------------------------------------------- ---------

1) MOV EAX, [EBP] 3 bytes

2) MOV Eax, [ESP] 3 bytes

3) MOV Eax, [EBX] 2 bytes

The order of 4 bytes in the exchange register? Use BSWAP

-------------------------------------------------- -------

Mov Eax, 12345678h 5 bytes

Bswap eax 2 bytes

Eax = 78563412h now

WANNA SAVE SOME BYTES REPLACIN 'CALL?

---------------------------------------

1) Call_Label_ 5 bytesret 1 byte

2) JMP _Label_ 2/5 (Short / Near)

If it is just optimized, do not need to pass parameters, try to use JMP instead of Call

How to save time when comparing REG / MEM:

------------------------------------------

1) CMP REG, [MEM] SLOWER

2) CMP [MEM], REG 1 CYCLE FASTER

Take 2 except 2 How to save time and space?

-------------------------------------------------- ------------

1) MOV EAX, 1000H

Mov ECX, 4 5 bytes

xor edx, edx 2 bytes

Div ECX 2 bytes

2) SHR EAX, 4 3 bytes

3) MOV ECX, 4 5 bytes

Mul ECX 2 bytes

4) SHL EAX, 4 3 bytes

LOOP instruction

---------------------------------------------------------------------------------------------------------------------------------------

1) DEC ECX 1 BYTE

JNE _Label_ 2/6 Bytes (Short / Near)

2) loop _label_ 2 bytes

Look at:

3) JE $ 5 2 bytes

Dec ECX 1 Byte

JNE _Label_ 2 bytes

4) Loopxx_Label_ (xx = e, ne, z or nz) 2 bytes

LOOP is small, but the speed of execution of 486 or more will be slow ...

Comparison:

-------------------------------------------------- -------

1) Push Eax 1 Byte

Push ebx 1 byte

POP EAX 1 byte

POP EBX 1 BYTE

2) XCHG EAX, EBX 1 BYTE

3) XCHG ECX, EDX 2 bytes

If you just want to move the value, use MOV, there will be better execution speed on Pentium:

4) MOV ECX, EDX 2 bytes

Comparison:

------------------------------------------ 1) Not optimized:

LBL1: MOV Al, 5 2 bytes

Stosb 1 byte

Mov Eax, [EBX] 2 bytes

Stosb 1 byte

Ret 1 byte

LBL2: MOV Al, 6 2 bytes

Stosb 1 byte

Mov Eax, [EBX] 2 bytes

Stosb 1 byte

Ret 1 byte

---------

14 BYTES

2) Optimize:

LBL1: MOV Al, 5 2 bytes

LBL: Stosb 1 byte

Mov Eax, [EBX] 2 bytes

Stosb 1 byte

Ret 1 byte

LBL2: MOV Al, 6 2 bytes

JMP LBL 2 bytes

---------

11 bytes

Read constant variables, try the direct definition directly in the instruction:

-----------------------------

Mov Eax, [EBP VARIABLE] 6 BYTES

...

MOV [EBP VARIABLE], EAX 6 BYTES

...

...

Variable DD 12345678H 4 bytes

2) Optimization is:

Mov Eax, 12345678h 5 bytes

Variable = DWORD PTR $ - 4

...

...

MOV [EBP VARIABLE], EAX 6 BYTES

Oh, I haven't seen such interesting code for a long time, the premise is to compile the write attribute to the code segment to be set.

Finally, introduce Unapplicared instructions SALC, now the debugger support ... What meanings: The CF position 1 is set to 0xFF

-------------------------------------------------- ----------------

1) JC _LBL1 2 bytes

Mov al, 0 2 bytes

JMP _END 2 bytes

_LBL: MOV Al, 0FFH 2 bytes

_END: ​​...

2) SALC DB 0D6H 1 byte

转载请注明原文地址:https://www.9cbs.com/read-108836.html

New Post(0)