C hierarchy optimization
Author: System Date: 2003-04-22 Read: 89
The second revision 2002.4.16 The first revision of 2002.3.12 will be published in 2002.1.17, and many people will directly think of compilation. Can I optimize only in the compilation level? Of course, the C hierarchy can be used as code optimization, and some are often unexpected. Optimization at the C level, better transplantability than the assembly hierarchy, should be the preferred approach to optimization. Determining floating-point variables and expressions are Float type to make the compiler generate better code (such as code generating 3DNOW! Or SSE instructions), you must determine that floating-point variables and expressions are Float type. It is important to note that floating point constants for "f" or "f" (such as: 3.14F) is a float type, otherwise the default is a Double type. To avoid the FLOAT type parameters automatically convert to Double, use float when the function declaration. There are many kinds of compilers using 32-bit data types, but they all contain typical 32-bit types: int, signed, signed int, unsigned, unsigned int, long, signed long, unsigned long Unsigned long int. Try to use 32-bit data types because they are more efficient than data or even 8 digits of data. Wish-wise uses a symbol integer variable in many cases, you need to consider the integer variable is a symbol or no symbol type. For example, it is impossible to save a person's weight data, so it is not necessary to use a symbol type. However, if you want to save temperature data, you must use a symbolic variable. In many places, it is necessary to consider whether a variable with symbols is necessary. In some cases, the symbol is faster; but in some cases it is reversed. For example, when integer to floating point transformation, a symbolic integer for more than 16 digits is faster. Since the X86 architecture provides instructions that are converted from the symbolic integer to floating point, it is not provided with instructions that are converted from unsigned interstitial to floating point. Look at the assembly code generated by the compiler: bad code:
Double X; MOV [Foo 4], 0Unsigned Int i; MOV EAX, IX = I; MOV EAX, IX = I; MOV [foo], EAXFLID QWORD PTR [FSTP QWORD PTR [x] The above code is relatively slow. Not only because of the number of instructions, but also because the FLID instruction caused by the instructions cannot be paired. It is best to use the following code: Recommended code:
After compiling Double X; Fild DWORD PTR [I] INT i; FSTP QWORD PTR [x] x = i; when the computers and remainder in the integer operation, the unsigned type is faster. The following typical code is the 32-bit integer number of compilers divided by 4 code:
Wild code recommended code
Post before compilation int I; MOV Eax, II = I / 4; CDQAND EDX, 3ADD EAX, EdxSar Eax, 2MOV I, EAX
After compilation, Unsigned Int i; SHR I, 2i = I / 4; summary:
No symbol type for:
Swanting and remainder
Cycle count
Array subscript
Has a symbol type for:
Integer to floating point transformation
While vs. for programming, we often need to use infinite loops, and the two common methods are While (1) and for (;;). These two methods are the same, but that is better? However, let's take a look at the code after compilation: WHILE (1); MOV Eax, 1Test Eax, Eaxje Foo 23HJMP Foo 18h
After compilation, for (;;); JMP Foo 23h is a glimpse, for (;;) instructions, not occupying registers, and no jump, better than while (1). Using an array type instead of a pointer to use a pointer to make the compiler to optimize it. Because of the lack of effective pointer code optimization, the compiler always assumes that the pointer can access anywhere of memory, including storage space assigned to other variables. So for the compiler to produce better code, avoid using pointers in unnecessary places. A typical example is to access data stored in an array. C allows an operator [] or pointer to access an array, using an array code that allows the optimizer to reduce the possibility of unsafe code. For example, x [0] and x [2] cannot be the same memory address, but * p and * q may. It is highly recommended that the array type is used because this may be intentionally improved.
Wild code recommended code TypeDef struct
{
Float X, Y, Z, W;
} Vertex;
Typedef struct
{
Float M [4] [4];
Matrix;
Void XForm (Float * Res, Const Float * v, const float * m, int nnumverts)
{
Float DP;
INT I;
Const Vertex * vv = (Vertex *) V;
For (i = 0; i { DP = VV-> x * * m ; DP = VV-> Y * * m ; DP = VV-> Z * * m ; DP = VV-> w * * m ; * res = dp; // write converted X DP = VV-> x * * m ; DP = VV-> Y * * m ; DP = VV-> Z * * m ; DP = VV-> w * * m ; * res = dp; // write conversion Y DP = VV-> x * * m ; DP = VV-> Y * * m ; DP = VV-> Z * * m ; DP = VV-> w * * m ; * Res = dp; // write converted Z DP = VV-> x * * m ; DP = VV-> Y * * m ; DP = VV-> Z * * m ; DP = VV-> w * * m ; * Res = dp; // write conversion W VV ; // Next vector m - = 16; } } typedef struct { Float X, Y, Z, W; } Vertex; Typedef struct { Float M [4] [4]; Matrix; Void XForm (Float * Res, Const Float * v, const float * m, int nnumverts) { INT I; Const Vertex * vv = (Vertex *) V; Const Matrix * mm = (matrix *) m; Vertex * rr = (vertex *) res; For (i = 0; i { RR-> x = vv-> x * mm-> m [0] [0] vv-> y * mm-> m [0] [1] VV-> z * mm-> m [0] [2] vv-> w * mm-> m [0] [3]; RR-> y = vv-> x * mm-> m [1] [0] vv-> y * mm-> m [1] [1] vv-> z * mm-> m [1] [2] vv-> w * mm-> m [1] [3]; RR-> z = vv-> x * mm-> m [2] [0] vv-> y * mm-> m [2] [1] vv-> z * mm-> m [2] [2] vv-> w * mm-> m [2] [3]; RR-> w = vv-> x * mm-> m [3] [0] vv-> y * mm-> m [3] [1] vv-> z * mm-> m [3] [2] VV-> w * mm-> m [3] [3]; } } Note: The conversion of the source code is combined with the code generator of the compiler. It is difficult to control the generated machine code from the source code hierarchy. Relying on the compiler and special source code, it is possible that the machine code compiled by the pointer code is faster than the array code in the same conditions. Wise practice is to check whether the performance is really improved after the source code transformation, and then select the pointer type or the array type. A well-decomposed small cycle should take advantage of the CPU's instruction cache to fully decompose a small loop. Especially when the cyclic body itself is small, the decomposition cycle can improve performance. BTW: Many compilers do not automatically break down the loop. The code recommended by bad code // 3D conversion: multiplies the vector V and 4x4 matrix M For (i = 0; i <4; i ) { R [I] = 0; For (j = 0; j <4; j ) { R [i] = m [j] [i] * v [j]; } } R [0] = M [0] [0] * V [0] m [1] [0] * V [1] M [2] [0] * V [2] m [3] [ 0] * v [3]; R [1] = m [0] [1] * v [0] m [1] [1] * v [1] M [2] [1] * v [2] m [3] [1 ] * V [3]; R [2] = m [0] [2] * v [0] m [1] [2] * v [1] m [2] [2] * v [2] m [3] [2 ] * V [3]; R [3] = m [0] [3] * v [0] m [1] [3] * v [1] m [2] [3] * v [2] m [3] [3 ] * v [3]; Avoid unnecessary read and write dependence When data is saved to memory, it is necessary to read it again after proper written. Although the CPUs such as AMD Athlon have a hardware that accelerates reading and writing, it allows the data to be saved to read before the memory is written, but if reading-write dependence is avoided, the data is saved in the internal register, the speed will be faster. . Avoid reading and writing dependence in a very long and interdependent code chain. If readback depends on an array of operation, many compilers cannot automatically optimize code to avoid reading and writing. Therefore, the recommended programmer manually eliminates reading and writing dependence, for example, introducing a temporary variable that can be saved in the register. This can have a lot of performance improvement. The following code is an example: bad code recommended code float x [veclen], y [veclen], z [veclen]; ....... For (unsigned int k = 1; k { X [k] = x [k-1] y [k]; } For (k = 1; k { X [k] = z [k] * (y [k] - x [k-1]); } float x [veclen], y [veclen], z [veclen]; ....... Float t (x [0]); For (unsigned int k = 1; k { T = T Y [K]; X [K] = T; } T = x [0]; For (k = 1; k { T = z [k] * (y [k] - t); X [K] = T; } Switch's usage Switch may translate into a variety of different algorithms. The most common is the jump table and comparative chain / tree. It is recommended to sort the possibility of CASE, and put the most likely placed in the first one, which can improve performance when Switch transforms in a comparison chain. In addition, small continuous integers are recommended in CASE, because in this case, all compilers can convert Switch into a jump table. The code recommended by bad code INT days_IN_MONTH, SHORT_MONTHS, NORMAL_MONTHS, Long_MONTHS; ....... Switch (days_in_month) { Case 28: Case 29: Short_months ; Break; Case 30: NORMAL_MONTHS ; Break; Case 31: Long_months ; Break; DEFAULT: COUT << "Month Has Fewer Tan 28 Or More Than 31 Days" << Endl; Break; } INT DAYS_IN_MONTH, SHORT_MONTHS, NORMAL_MONTHS, LONG_MONTHS; ....... Switch (days_in_month) { Case 31: Long_months ; Break; Case 30: NORMAL_MONTHS ; Break; Case 28: Case 29: Short_months ; Break; DEFAULT: COUT << "Month Has Fewer Tan 28 Or More Than 31 Days" << Endl; Break; } All functions should be prototype definitions, all functions should be prototype definitions. Prototype definitions can communicate more information that may be used to optimize. Use constant whenever possible to use constant as possible. The C standard specifies that if the address of a const declared object is not acquired, the compiler is allowed to assign a storage space to it. This makes the code more efficient and generate better code. The performance of the lifting cycle is to increase the performance of the cycle, reducing excess constant calculations is very useful (for example, without calculating cyclic variation). Well code (in for () contains unchanged IF ()) recommended code for (i ...) { IF (constant0) { DOWORK0 (i); // Assume that the value of Constant0 does not change here } Else { DOWORK1 (i); // Assume that this does not change the value of Constant0 } } IF (constant0) { For (i ...) { DOWORK0 (i); } } Else { For (i ...) { DOWORK1 (i); } } If IF () is already known, it can avoid repeated calculations. Although the branch in the bad code can be simply predicted, since the recommended code has been determined before entering the cycle, dependence on branch prediction can be reduced. Declaring the local function as static (static) If a function is not used outside of its file, declare it as static (static) to force the internal connection. Otherwise, the function is defined as an external connection with the default. This may affect the optimization of certain compilers - such as automatic inline. Considering that dynamic memory allocation dynamic memory allocation ("new" in C ) may always returns a long base type (four-word alignment) to return a genial pointer. However, if you do not guarantee alignment, use the following code to implement four words alignment. This code assumes that the pointer can be mapped to the LONG type. Example Double * p = (double *) New byte [sizeof (double) * number_of_doubles 7L]; Double * np = (Double *) & -8L); Now you can use NP instead of P to access data. Note: You should still use Delete P when you release the storage space. Use explicit parallel code to solve the long-dependent code chain to several ordered code chains that can be performed in parallel in the pipeline execution unit. Because the floating point operation has a long latency, this is important regardless of whether it is mapped into X87 or 3DNOW! Directive. Many advanced languages, including C , and do not reorder the generated floating point expressions because it is a fairly complex process. It should be noted that the reordering code and the original code are not equal to the calculation results, because the floating point operation lacks accuracy. In some cases, these optimization may result in unexpected results. Fortunately, in most cases, the last result may only be the least important bit (ie, the lowest bit) is wrong. The code recommended by bad code Double A [100], SUM; INT I; SUM = 0.0F; For (i = 0; i <100; i ) SUM = a [i]; Double A [100], SUM1, SUM2, SUM3, SUM4, SUM INT I; SUM1 = SUM2 = SUM3 = SUM4 = 0.0; For (i = 0; i <100; i = 4) { SUM1 = a [I]; sum2 = a [i 1]; SUM3 = a [i 2]; SUM4 = a [i 3]; } SUM = (SUM4 SUM3) (SUM1 SUM2); pay attention to: Use 4-way decomposition because this uses a 4-stage pipeline floating point addition, each stage of floating point adds, a clock cycle, guaranteed the largest Resource utilization. The public sub-expression is proposed in some cases that the C compiler cannot propose a common sub-expression from a floating point expression because it means that the expression is reordered. It is important to point out that the compiler will rearrange the expression before extracting the public sub-expression before the equivalent relationship. At this time, the programmer should manually propose public sub-expression (there is a "global optimization" option in VC.NET, but the effect is unknown). The code recommended by bad code float a, b, c, d, e, f; .... E = B * C / D; F = B / D * a; float a, b, c, d, e, f; .... Const float t (b / d); E = C * t; f = a * t; The code recommended by bad code FLOAT A, B, C, E, F; .... E = A / C; F = B / C; Float A, B, C, E, F; .... Const float t (1.0f / c); E = a * t; f = b * t; the layout of structural members Many compilers have an option to "align structures, double words or four-word alignment". However, it is still necessary to improve the alignment of structural members, and some compilers may allocate the order of the structural member space and their statements. However, some compilers do not provide these features, or if the effect is not good. Therefore, to achieve the best structural and structural members in the case of paying minimal cost, it is recommended to take these methods: Sort by the length of the type, sort the components of the structure, declare the length of the type when the member is declared Placed in a short front. The structure fills the structure into the longest type length to fill the structure into the maximum type length of the structure. As such, if the first member of the structure is aligned, all the entire structures are naturally aligned. The following example demonstrates how to reorder structural members: Wild code, the code recommended by ordinary order, new order and manually fill a few bytes Struct { Char a [5]; Long K; Double X; } baz; struct { Double X; Long K; Char a [5]; Char Pad [7]; } baz; This rule is equally applicable to the layout of classes. Sorting the Local Variables by the length of the data type When the compiler is assigned to the local variable space, their order is the same as the order in which they declare in the source code, and the long variable should be placed in front of the short variable as the previous rule. If the first variable is aligned, the other variables will be continuously stored, and it will be aligned without the padding byte. Some compilers do not automatically change the variable order when allocating variables, and some compilers cannot generate 4-byte aligned stacks, so 4 bytes may not be aligned. The following example demonstrates the reordering of the local variable declaration: Wild code, general order recommended code, improved order Short Ga, GU, GI; Long foo, bar; Double X, Y, Z [3]; CHAR A, B; Float Baz; Double Z [3]; Double X, Y; Long foo, bar; Float baz; Short Ga, GU, GI; Avoid unnecessary integer division integer division is the slowest in an integer operation, so avoid as much as possible. One place that may reduce integer division is even, where the division can be replaced by multiplication. This replacement side effect is possible to overflow when the product is calculated, so it can only be used in a range of division. The bad code recommended code INT i, J, K, M; M = I / J / K; INT I, J, K, M; M = I / (j *k); copy frequently used pointer type parameters to local variables to avoid frequently use the value of the pointer type parameter frequently in the function. Because the compiler does not know if there is a conflict between the pointers, the pointer type parameters often cannot be optimized by the compiler. This is that the data cannot be stored in the register, and the memory bandwidth is obviously occupied. Note that many compilers have "assumes no conflict" optimization switch (which must be manually addup compiler command line / OA or / OW), which allows the compiler to assume that two different pointers have different contents, so No need to save the pointer parameters to local variables. Otherwise, save the data points to the pointer to the local variable in the function. If necessary, copy it back before the function ends. The bad code recommended code / / hypothesis Q! = R Void IsQRT (unsigned long * q, unsigned long * r) { * q = a; IF (a> 0) { While (* q> (* r = a / * q))) { * Q = (* q * r) >> 1; } } * r = a - * q * * q; } // Suppose Q! = R Void IsQRT (unsigned long * q, unsigned long * r) { Unsigned long QQ, RR; QQ = a; IF (a> 0) { While (qq> (rr = a / qq)) { QQ = (QQ RR) >> 1; } } RR = a - QQ * QQ; * Q = QQ; * r = rr; } Assignment and initialization first look at the following code: Class Cint { INT M_I; PUBLIC: CINT (int A = 0): m_i (a) {cout << "cint" << endl;} ~ Cint () {cout << "~ cint" << endl;} Cint Operator (const cint & a) {return cint (m_i a.get ());} Void setint (const INT i) {m_i = i;} INT getint () const {return m_i;} } Wild code recommended code void main () { CINT A, B, C; a.setint (1); B.setint (2); C = a b; } void main () { CINT A (1), B (2); CINT C (A B); } The two of the two code is the same, but that is better? Looking at the output will find that the bad code outputs four "CINT" and four "~ cint", and the recommended code only outputs three. That is, the second example generates a temporary object than the first example. Why? Please note that the first in the first cent is the method of declaring the reputation, the second method is the initialization method, and the difference between them. The first example of "C = A B" is used as a temporary object to save the value of the A B, and the time to copy the temporary object is assigned to C, and the temporary object is destroyed. This temporary object is the more items. The second example is initialized to C with the method of copy constructor, and does not generate a temporary object. So, try to declare when an object is needed, and assign initial value with initialization. Try to use the member initialization list When the member of the initialization class, try to use the member initialization list instead of the traditional assignment method. Well-code recommended code Class CMYCLASS { String strname; PUBLIC: CMYCLASS (Const String & Str); } CMYCLASS :: CMYCLASS (Const string & STR) { Strname = STR; Class CMYCLASS { String strname; INT I; PUBLIC: CMYCLASS (Const String & Str); } CMYCLASS :: CMYCLASS (Const string & STR) : Strname (STR) { } The bad example is used to assign a value. In this way, Strname will be established first (call the default constructor of String), and then assign a value by the parameter STR. The recommended example is a list of member initialization. Strname is directly constructed as STR, less adjustment of the default constructor, but also some security hazards. v