The second revision 2002.4.16 The first revision of 2002.3.12 is published in the first time. 2002.1.17
Speaking of optimization, many people will directly think of compilation. Can I optimize only in the compilation level? Of course, the C hierarchy can be used as code optimization, and some are often unexpected. Optimization at the C level, better transplantability than the assembly hierarchy, should be the preferred approach to optimization.
Determining floating-point variables and expressions are float types
In order to make the compiler produce better code (such as generating 3DNOW! Or SSE instructions), you must determine that floating-point variables and expressions are Float type. It is important to note that floating point constants for "f" or "f" (such as: 3.14F) is a float type, otherwise the default is a Double type. To avoid the FLOAT type parameters automatically convert to Double, use float when the function declaration.
Use 32-bit data types
There are many compilers, but they all contain typical 32-bit types: int, signed, signed int, unsigned, unsigned int, long, signed long, long int, signed long int, unsigned long, unsigned long int. Try to use 32-bit data types because they are more efficient than data or even 8 digits of data.
Waice to use symbol integer variables
In many cases, you need to consider the integer variable is a symbol or no symbol type. For example, it is impossible to save a person's weight data, so it is not necessary to use a symbol type. However, if you want to save temperature data, you must use a symbolic variable.
In many places, it is necessary to consider whether a variable with symbols is necessary. In some cases, the symbol is faster; but in some cases it is reversed.
For example, when integer to floating point transformation, a symbolic integer for more than 16 digits is faster. Since the X86 architecture provides instructions that are converted from the symbolic integer to floating point, it is not provided with instructions that are converted from unsigned interstitial to floating point. Look at the assembly code generated by the compiler:
Wild code:
Double X; MOV [Foo 4], 0Unsigned Int i; MOV EAX, IX = I; MOV EAX, IX = I; MOV [foo], EAXFLID QWORD PTR [FSTP QWORD PTR [x] The above code is relatively slow. Not only because of the number of instructions, but also because the FLID instruction caused by the instructions cannot be paired. It is best to use the following code to replace:
Recommended code:
Double X; Fild DWORD PTR [I] INT I; FSTP Qword PTR [x] x = i;
When the computers and remaining numbers are used in an integer operation, the unsigned type is faster. The following typical code is the 32-bit integer number of compilers divided by 4 code:
Wild code recommended code
Post before compilation int I; MOV Eax, II = I / 4; CDQAND EDX, 3ADD EAX, EdxSar Eax, 2MOV I, EAX
After compilation, Unsigned Int i; SHR I, 2I = I / 4;
to sum up:
No symbol type for:
Swanting and remainder
Cycle count
Array subscript
Has a symbol type for:
Integer to floating point transformation
While vs. for
In programming, we often need to use infinite loops, and two common methods are While (1) and for (;;). These two methods are the same, but that is better? However, let's take a look at the code after compilation: WHILE (1); MOV Eax, 1Test Eax, Eaxje Foo 23HJMP Foo 18h
Before compiling for (;;); jmp foo 23h
At a glance, the for (;;;) instructions are small, not occupying the register, and there is no judgment jump, better than while (1).
Use an array type instead of pointer type
Using a pointer will make the compiler to optimize it. Because of the lack of effective pointer code optimization, the compiler always assumes that the pointer can access anywhere of memory, including storage space assigned to other variables. So for the compiler to produce better code, avoid using pointers in unnecessary places. A typical example is to access data stored in an array. C allows an operator [] or pointer to access an array, using an array code that allows the optimizer to reduce the possibility of unsafe code. For example, x [0] and x [2] cannot be the same memory address, but * p and * q may. It is highly recommended that the array type is used because this may be intentionally improved.
Wild code recommended code TypeDef struct
{
Float X, Y, Z, W;
} Vertex;
Typedef struct
{
Float M [4] [4];
Matrix;
Void XForm (Float * Res, Const Float * v, const float * m, int nnumverts)
{
Float DP;
INT I;
Const Vertex * vv = (Vertex *) V;
For (i = 0; i { DP = VV-> x * * m ; DP = VV-> Y * * m ; DP = VV-> Z * * m ; DP = VV-> w * * m ; * res = dp; // write converted X DP = VV-> x * * m ; DP = VV-> Y * * m ; DP = VV-> Z * * m ; DP = VV-> w * * m ; * res = dp; // write conversion Y DP = VV-> x * * m ; DP = VV-> Y * * m ; DP = VV-> Z * * m ; DP = VV-> w * * m ; * Res = dp; // write converted Z DP = VV-> x * * m ; DP = VV-> Y * * m ; DP = VV-> Z * * m ; DP = VV-> w * * m ; * Res = dp; // write conversion W VV ; // Next vector m - = 16; } } TypeDef Struct { Float X, Y, Z, W; } Vertex; Typedef struct { Float M [4] [4]; Matrix; Void XForm (Float * Res, Const Float * v, const float * m, int nnumverts) { INT I; Const Vertex * vv = (Vertex *) V; Const Matrix * mm = (matrix *) m; Vertex * rr = (vertex *) res; For (i = 0; i { RR-> x = vv-> x * mm-> m [0] [0] vv-> y * mm-> m [0] [1] VV-> z * mm-> m [0] [2] vv-> w * mm-> m [0] [3]; RR-> y = vv-> x * mm-> m [1] [0] vv-> y * mm-> m [1] [1] vv-> z * mm-> m [1] [2] vv-> w * mm-> m [1] [3]; RR-> z = vv-> x * mm-> m [2] [0] vv-> y * mm-> m [2] [1] vv-> z * mm-> m [2] [2] vv-> w * mm-> m [2] [3]; RR-> w = vv-> x * mm-> m [3] [0] vv-> y * mm-> m [3] [1] vv-> z * mm-> m [3] [2] VV-> w * mm-> m [3] [3]; } } Note: The conversion of the source code is combined with the code generator of the compiler. It is difficult to control the generated machine code from the source code hierarchy. Relying on the compiler and special source code, it is possible that the machine code compiled by the pointer code is faster than the array code in the same conditions. Wise practice is to check whether the performance is really improved after the source code transformation, and then select the pointer type or the array type. Make full decomposition of small cycles To make full use of the CPU's instruction cache, you should fully decompose a small loop. Especially when the cyclic body itself is small, the decomposition cycle can improve performance. BTW: Many compilers do not automatically break down the loop. The code recommended by bad code // 3D conversion: multiplies the vector V and 4x4 matrix M For (i = 0; i <4; i ) { R [I] = 0; For (j = 0; j <4; j ) { R [i] = m [j] [i] * v [j]; } } R [0] = M [0] [0] * V [0] m [1] [0] * V [1] M [2] [0] * V [2] m [3] [ 0] * v [3]; R [1] = m [0] [1] * v [0] m [1] [1] * v [1] M [2] [1] * v [2] m [3] [1 ] * V [3]; R [2] = m [0] [2] * v [0] m [1] [2] * v [1] m [2] [2] * v [2] m [3] [2 ] * V [3]; R [3] = m [0] [3] * v [0] m [1] [3] * v [1] m [2] [3] * v [2] m [3] [3 ] * v [3]; Avoid unnecessary read and write dependence When data is saved to memory, it is necessary to read it again after proper written. Although the CPUs such as AMD Athlon have a hardware that accelerates reading and writing, it allows the data to be saved to read before the memory is written, but if reading-write dependence is avoided, the data is saved in the internal register, the speed will be faster. . Avoid reading and writing dependence in a very long and interdependent code chain. If readback depends on an array of operation, many compilers cannot automatically optimize code to avoid reading and writing. Therefore, the recommended programmer manually eliminates reading and writing dependence, for example, introducing a temporary variable that can be saved in the register. This can have a lot of performance improvement. The following code is an example: bad code recommended code float x [veclen], y [veclen], z [veclen]; ....... For (unsigned int k = 1; k { X [k] = x [k-1] y [k]; } For (k = 1; k { X [k] = z [k] * (y [k] - x [k-1]); } float x [veclen], y [veclen], z [veclen]; ....... Float t (x [0]); For (unsigned int k = 1; k { T = T Y [K]; X [K] = T; } T = x [0]; For (k = 1; k { T = z [k] * (y [k] - t); X [K] = T; } Switch's usage Switch may translate into a variety of different algorithms. The most common is the jump table and comparative chain / tree. It is recommended to sort the possibility of CASE, and put the most likely placed in the first one, which can improve performance when Switch transforms in a comparison chain. In addition, small continuous integers are recommended in CASE, because in this case, all compilers can convert Switch into a jump table. The code recommended by bad code INT days_IN_MONTH, SHORT_MONTHS, NORMAL_MONTHS, Long_MONTHS; ....... Switch (days_in_month) { Case 28: Case 29: Short_months ; Break; Case 30: NORMAL_MONTHS ; Break; Case 31: Long_months ; Break; DEFAULT: COUT << "Month Has Fewer Tan 28 Or More Than 31 Days" << Endl; Break; } INT DAYS_IN_MONTH, SHORT_MONTHS, NORMAL_MONTHS, LONG_MONTHS; ....... Switch (days_in_month) { Case 31: Long_months ; Break; Case 30: NORMAL_MONTHS ; Break; Case 28: Case 29: Short_months ; Break; DEFAULT: COUT << "Month Has Fewer Tan 28 Or More Than 31 Days" << Endl; Break; All functions should have prototype definitions In general, all functions should have prototype definitions. Prototype definitions can communicate more information that may be used to optimize. Use constant as possible (const) Use constants as much as possible. The C standard specifies that if the address of a const declared object is not acquired, the compiler is allowed to assign a storage space to it. This makes the code more efficient and generate better code. Enhance the performance of cycles To enhance the performance of the loop, reducing excess constant calculations is very useful (for example, without calculating cyclic variation). Well code (in for () contains unchanged IF ()) recommended code for (i ...) { IF (constant0) { DOWORK0 (i); // Assume that the value of Constant0 does not change here } Else { DOWORK1 (i); // Assume that this does not change the value of Constant0 } } IF (constant0) { For (i ...) { DOWORK0 (i); } } Else { For (i ...) { DOWORK1 (i); } } If IF () is already known, this can avoid repeated calculations. Although the branch in the bad code can be simply predicted, since the recommended code has been determined before entering the cycle, dependence on branch prediction can be reduced. State the local function as static (STIC) If a function is not used outside of its file, declare it as a static (STICEC) to force the internal connection. Otherwise, the function is defined as an external connection with the default. This may affect the optimization of certain compilers - such as automatic inline. Consider dynamic memory allocation Dynamic Memory Allocation ("new" in C ) may always return a genirable pointer for a long base type (four-word alignment). However, if you do not guarantee alignment, use the following code to implement four words alignment. This code assumes that the pointer can be mapped to the LONG type. Example Double * p = (double *) New byte [sizeof (double) * number_of_doubles 7L]; Double * np = (double *) ((long (p) 7L) & -8L); Now you can use NP instead of P to access data. Note: You should still use Delete P when you release the storage space. Use explicit parallel code To solve the long-dependent code chain as much as possible into a few code chains that can be performed in parallel in the pipeline execution unit. Because the floating point operation has a long latency, this is important regardless of whether it is mapped into X87 or 3DNOW! Directive. Many advanced languages, including C , and do not reorder the generated floating point expressions because it is a fairly complex process. It should be noted that the reordering code and the original code are not equal to the calculation results, because the floating point operation lacks accuracy. In some cases, these optimization may result in unexpected results. Fortunately, in most cases, the last result may only be the least important bit (ie, the lowest bit) is wrong. The code recommended by bad code Double A [100], SUM; INT I; SUM = 0.0F; For (i = 0; i <100; i ) SUM = a [i]; Double A [100], SUM1, SUM2, SUM3, SUM4, SUM INT I; SUM1 = SUM2 = SUM3 = SUM4 = 0.0; For (i = 0; i <100; i = 4) { SUM1 = a [i]; SUM2 = a [i 1]; SUM3 = a [i 2]; SUM4 = a [i 3]; } SUM = (SUM4 SUM3) (SUM1 SUM2); It should be noted that using 4-way decomposition is because this uses a 4-stage pipeline floating point addition, each stage of floating point adds occupies a clock cycle, ensuring maximum resource utilization. Propose public sub-expression In some cases, the C compiler cannot propose a common sub-expression from a floating point expression because it means that it is equivalent to reordering the expression. It is important to point out that the compiler will rearrange the expression before extracting the public sub-expression before the equivalent relationship. At this time, the programmer should manually propose public sub-expression (there is a "global optimization" option in VC.NET, but the effect is unknown). The code recommended by bad code float a, b, c, d, e, f; .... E = B * C / D; F = B / D * a; float a, b, c, d, e, f; .... Const float t (b / d); E = C * t; f = a * t; The code recommended by bad code FLOAT A, B, C, E, F; .... E = A / C; F = B / C; Float A, B, C, E, F; .... Const float t (1.0f / c); E = a * t; f = b * t; Structural membership layout Many compilers have the option to "make structural characters, double words or four-word alignment". However, it is still necessary to improve the alignment of structural members, and some compilers may allocate the order of the structural member space and their statements. However, some compilers do not provide these features, or if the effect is not good. Therefore, to achieve the best structural and structural members in the case of paying the minimum cost, it is recommended to take these methods: Sort by type length The members of the structure are sorted according to their type, and the length of the length is placed in a short front. Pack the structure into an integer of the longest type length Fill the structure into an integer of the longest type length. As such, if the first member of the structure is aligned, all the entire structures are naturally aligned. The following example demonstrates how to reorder structural members: Wild code, the code recommended by ordinary order, new order and manually fill a few bytes Struct { Char a [5]; Long K; Double X; } baz; struct { Double X; Long K; Char a [5]; Char Pad [7]; } baz; This rule is equally applicable to the layout of classes. Sort Local Variables by the length of the data type When the compiler is assigned to the local variable space, their order is the same as the order declared in the source code, as in the previous rule, the long variable should be placed in front of the short variable. If the first variable is aligned, the other variables will be continuously stored, and it will be aligned without the padding byte. Some compilers do not automatically change the variable order when allocating variables, and some compilers cannot generate 4-byte aligned stacks, so 4 bytes may not be aligned. The following example demonstrates the reordering of the local variable declaration: Wild code, general order recommended code, improved order Short Ga, GU, GI; Long foo, bar; Double X, Y, Z [3]; CHAR A, B; Float Baz; Double Z [3]; Double X, Y; Long foo, bar; Float baz; Short Ga, GU, GI; Avoid unnecessary integer division Integer division is the slowest in an integer operation, so it should be avoided as much as possible. One place that may reduce integer division is even, where the division can be replaced by multiplication. This replacement side effect is possible to overflow when the product is calculated, so it can only be used in a range of division. The bad code recommended code INT i, J, K, M; M = I / J / K; INT I, J, K, M; M = I / (j * k); Copy frequently used pointer parameters to local variables Avoid frequently using the value to point to the pointer type parameter in the function. Because the compiler does not know if there is a conflict between the pointers, the pointer type parameters often cannot be optimized by the compiler. This is that the data cannot be stored in the register, and the memory bandwidth is obviously occupied. Note that many compilers have "assumes no conflict" optimization switch (which must be manually addup compiler command line / OA or / OW), which allows the compiler to assume that two different pointers have different contents, so No need to save the pointer parameters to local variables. Otherwise, save the data points to the pointer to the local variable in the function. If necessary, copy it back before the function ends. The bad code recommended code / / hypothesis Q! = R Void IsQRT (unsigned long * q, unsigned long * r) { * q = a; IF (a> 0) { While (* q> (* r = a / * q))) { * Q = (* q * r) >> 1; } } * r = a - * q * * q; } // Suppose Q! = R Void IsQRT (unsigned long * q, unsigned long * r) { Unsigned long QQ, RR; QQ = a; IF (a> 0) { While (qq> (rr = a / qq)) { QQ = (QQ RR) >> 1; } } RR = a - QQ * QQ; * Q = QQ; * r = rr; } Assignment and initialization Let's take a look at the following code: Class Cint { INT M_I; PUBLIC: CINT (int A = 0): m_i (a) {cout << "cint" << endl;} ~ Cint () {cout << "~ cint" << endl;} Cint Operator (const cint & a) {return cint (m_i a.get ());} Void setint (const INT i) {m_i = i;} INT getint () const {return m_i;} } Wild code recommended code void main () { CINT A, B, C; a.setint (1); B.setint (2); C = a b; } void main () { CINT A (1), B (2); CINT C (A B); } The things made by these two codes are the same, but that is better? Looking at the output will find that the bad code outputs four "CINT" and four "~ cint", and the recommended code only outputs three. That is, the second example generates a temporary object than the first example. Why? Please note that the first in the first cent is the method of declaring the reputation, the second method is the initialization method, and the difference between them. The first example of "C = A B" is used as a temporary object to save the value of the A B, and the time to copy the temporary object is assigned to C, and the temporary object is destroyed. This temporary object is the more items. The second example is initialized to C with the method of copy constructor, and does not generate a temporary object. So, try to declare when an object is needed, and assign initial value with initialization. Try to use member initialization list When initializing a member, try to use a member initialization list rather than a traditional assignment method. Well-code recommended code Class CMYCLASS { String strname; PUBLIC: CMYCLASS (Const String & Str); } CMYCLASS :: CMYCLASS (Const string & STR) { Strname = STR; Class CMYCLASS { String strname; INT I; PUBLIC: CMYCLASS (Const String & Str); } CMYCLASS :: CMYCLASS (Const string & STR) : Strname (STR) { } A bad example is used to assign a value. In this way, Strname will be established first (call the default constructor of String), and then assign a value by the parameter STR. The recommended example is a list of member initialization. Strname is directly constructed as STR, less adjustment of the default constructor, but also some security hazards.