C ++ code optimization

xiaoxiao2021-03-06  40

Speaking of optimization, many people will directly think of compilation. Can I optimize only in the compilation level? Of course, the C hierarchy can be used as code optimization, and some are often unexpected. Optimization at the C level, better transplantability than the assembly hierarchy, should be the preferred approach to optimization. Determining floating-point variables and expressions are Float type to make the compiler generate better code (such as code generating 3DNOW! Or SSE instructions), you must determine that floating-point variables and expressions are Float type. It is important to pay special attention to "; f"; or "; f"; floating point constant for suffix (such as 3.14f) is a float type, otherwise the default is a Double type. To avoid the FLOAT type parameters automatically convert to Double, use float when the function declaration. There are many kinds of compilers using 32-bit data types, but they all contain typical 32-bit types: int, signed, signed int, unsigned, unsigned int, long, signed long, unsigned long Unsigned long int. Try to use 32-bit data types because they are more efficient than data or even 8 digits of data. Wish-wise uses a symbol integer variable in many cases, you need to consider the integer variable is a symbol or no symbol type. For example, it is impossible to save a person's weight data, so it is not necessary to use a symbol type. However, if you want to save temperature data, you must use a symbolic variable. In many places, it is necessary to consider whether a variable with symbols is necessary. In some cases, the symbol is faster; but in some cases it is reversed. For example, when integer to floating point transformation, a symbolic integer for more than 16 digits is faster. Since the X86 architecture provides instructions that are converted from the symbolic integer to floating point, it is not provided with instructions that are converted from unsigned interstitial to floating point. Look at the assembly code generated by the compiler: bad code: Double X; MOV [Foo 4], 0 Unsigned Int I; MOV Eax, IX = I; MOV [foo], Eax Flid Qword PTR [ FOO] FSTP QWORD PTR [x] The above code is slow. Not only because of the number of instructions, but also because the FLID instruction caused by the instructions cannot be paired. It is best to use the following code instead: Recommended code: After compiling Double X; Fild DWORD PTR INT I; FSTP QWORD PTR [X] x = i; when the computers and remainder in the integer operation, use unsigned types faster .

The following typical code is the 32-bit integer number generated by the compiler divided by code: the code recommended by the bad code, INT i; MOV EAX, II = I / 4; CDQ and EDX, 3 Add Eax, EDX SAR EAX, 2 MOV I, EAX compiled after compilation Unsigned INT i; SHR I, 2 i = I / 4; Summary: Non-symbol type is used for: division and remainder cycle counts In: Converting to floating point While vs. for programming, we often need to use infinite loops, and two common methods are While (1) and for (;;). These two methods are the same, but that is better? However, we look at the code after compilation: WHILE (1); Mov Eax, 1 test Eax, EAX JE FOO 23H JMP Foo 18h compiles for (;;); JMP Foo 23h , For (;;) instructions, not occupying registers, and no jump is jumped, better than while (1). Using an array type instead of a pointer to use a pointer to make the compiler to optimize it. Because of the lack of effective pointer code optimization, the compiler always assumes that the pointer can access anywhere of memory, including storage space assigned to other variables. So for the compiler to produce better code, avoid using pointers in unnecessary places. A typical example is to access data stored in an array. C allows an operator [] or pointer to access an array, using an array code that allows the optimizer to reduce the possibility of unsafe code. For example, x [0] and x [2] cannot be the same memory address, but * p and * q may. It is highly recommended that the array type is used because this may be intentionally improved.

Wild code recommended code typef struct {float x, y, z, w;} Vertex; typef struct {float m [4] [4];} matrix; void XForm (float * res, const float * v, const Float * m, int nnumverts) {float dp; int i; const vertex * vv = (vertex *) v; for (i = 0; i <; nnumverts; i ) {dp = vv ->; x * * m ; DP = vv ->; y * * m ; DP = vv ->; z * * m ; DP = vv ->; w * * m ; * res = DP ; // Write converted X DP = VV ->; x * * m ; dp = vv ->; y * * m ; dp = vv ->; z * * m ; DP = vv ->; w * * m ; * res = dp; // write conversion Y DP = VV ->; x * * m ; dp = vv ->; y * * m ; dp = vv ->; z * * m ; dp = vv ->; w * * m ; * res = DP; // write converted Z DP = VV ->; x * * m ; dp = vv ->; y * * m ; DP = vv ->; z * * m ; dp = vv ->; w * * M ; * res = dp; // Write converted W VV ; // Next vector m - = 16;}} typedef struct {float x, y, z, w;} Vertex; TypedEf struct {float m [4] [4];} matrix; void XForm (Float * Res, const float * v, const float * m, int nnumverts) {INT i; const vertex * VV =

(Vert Matrix * mm = (matrix *) m; Vertex * RR = (Vertex *) res; for (i = 0; i <; nnumverts; i ) {rr ->; x = vv-> ; X * mm ->; m [0] [0] vv ->; y * mm ->; m [0] [1] vv ->; z * mm ->; m [0] [2] VV ->; w * mm ->; m [0] [3]; rr ->; y = vv ->; x * mm ->; m [1] [0] vv ->; y * mm ->; m [1] [1] vv ->; z * mm ->; m [1] [2] vv ->; w * mm ->; m [1] [3]; rR-> ; Z = vv ->; x * mm ->; m [2] [0] vv ->; y * mm ->; m [2] [1] vv ->; z * mm ->; [2] [2] vv ->; w * mm ->; m [2] [3]; rr ->; w = vv ->; x * mm ->; m [3] [0] VV ->; y * mm ->; m [3] [1] vv ->; z * mm ->; m [3] [2] vv ->; w * mm ->; m [3] [ 3];}} Note: The conversion of the source code is combined with the compiler's code generator. It is difficult to control the generated machine code from the source code hierarchy. Relying on the compiler and special source code, it is possible that the machine code compiled by the pointer code is faster than the array code in the same conditions. Wise practice is to check whether the performance is really improved after the source code transformation, and then select the pointer type or the array type. A well-decomposed small cycle should take advantage of the CPU's instruction cache to fully decompose a small loop. Especially when the cyclic body itself is small, the decomposition cycle can improve performance. BTW: Many compilers do not automatically break down the loop.

Code // 3D conversion of bad code: multiply the vector V and 4x4 matrix M multiply (i = 0; i <; 4; i ) {r = 0; for (j = 0; J <; 4; J ) {r = m [j] * v [j];}} r [0] = m [0] [0] * v [0] m [1] [0] * v [ 1] M [2] [0] * V [2] m [3] [0] * v [3]; R [1] = m [0] [1] * v [0] m [1 ] [1] * v [1] m [2] [1] * v [2] m [3] [1] * v [3]; R [2] = m [0] [2] * V [0] M [1] [2] * v [1] m [2] [2] * v [2] m [3] [2] * v [3]; R [3] = m [ 0] [3] * V [0] M [1] [3] * V [1] M [2] [3] * V [2] m [3] [3] * v [3]; Avoid unnecessary read and write dependence When data is saved to memory, it is necessary to read it again after proper written. Although the CPUs such as AMD Athlon have a hardware that accelerates reading and writing, it allows the data to be saved to read before the memory is written, but if reading-write dependence is avoided, the data is saved in the internal register, the speed will be faster. . Avoid reading and writing dependence in a very long and interdependent code chain. If readback depends on an array of operation, many compilers cannot automatically optimize code to avoid reading and writing. Therefore, the recommended programmer manually eliminates reading and writing dependence, for example, introducing a temporary variable that can be saved in the register. This can have a lot of performance improvement. The following period is an example: a bad code recommended by the code float x [veclen], y [veclen], z [veclen]; ... for (unsigned int K = 1; k <; veclen; k ) {x [k] = x [k-1] y [k];} for (k = 1; k <; veclen; k ) {x [k] = z [k] * (y [k ] - x [k-1]);} float x [veclen], y [veclen], z [veclen]; ... float t (x [0]); for (unsigned INT K = 1; K <; veclen; k ) {t = t y [k]; x [k] = t;} t = x [0]; for (k = 1; k <; veclen; k ) { T = z [k] * (y [k] - t); x [k] = t;} Switch's usage Switch may translate into a variety of different algorithms. The most common is the jump table and comparative chain / tree. It is recommended to sort the possibility of CASE, and put the most likely placed in the first one, which can improve performance when Switch transforms in a comparison chain.

In addition, small continuous integers are recommended in CASE, because in this case, all compilers can convert Switch into a jump table. Wild code recommended code INT days_IN_MONTH, SHORT_MONTHS, NORMAL_MONTHS, Long_MONTHS; ... Switch (days_in_month) {Case 28: Case 29: Short_Months ; Break; Case 30: Normal_Months ; Break; Case 31 : Long_months ; Break; Default: cout <; <; "; month Has Fewer Than 28 or more Than 31 days"; <; <; endl; break;} int days_IN_MONTH, SHORT_MONTHS, NORMAL_MONTHS, long_months; .... .. switch (days_in_month) {CASE 31: long_months ; Break; Case 30: Normal_Months ; Break; Case 28: Case 29: Short_Months ; Break; default: cout <; <; month Has Fewer Than 28 or more Than 31 Days "; <; <; endl; break;} All functions should have prototype definitions, all functions should be prototype definitions. Prototype definitions can communicate more information that may be used to optimize. Use constant whenever possible to use constant as possible. The C standard specifies that if the address of a const declared object is not acquired, the compiler is allowed to assign a storage space to it. This makes the code more efficient and generate better code. The performance of the lifting cycle is to increase the performance of the cycle, reducing excess constant calculations is very useful (for example, without calculating cyclic variation). Wild code (in for () contains unchanged IF ()) recommended code for (i) {if (constant0) {DOWORK0 (i); // assumes that this does not change the value of Constant0} Else {DOWORK1 (i); // Assume that the value of Constant0 does not change}}}}}} {for (i ...) {DOWORK0 (i);}} else {for (i ...) {DOWORK1 (i )}} If the value of IF () is already known, it can avoid repeated calculations. Although the branch in the bad code can be simply predicted, since the recommended code has been determined before entering the cycle, dependence on branch prediction can be reduced.

Declaring the local function as static (static) If a function is not used outside of its file, declare it as static (static) to force the internal connection. Otherwise, the function is defined as an external connection with the default. This may affect the optimization of certain compilers - such as automatic inline. Consider dynamic memory allocation dynamic memory allocation ("" in C ; new ";) may always return a long-term pointer (four-word alignment) to return a neutral pointer. However, if you do not guarantee alignment, use the following code to implement four words alignment. This code assumes that the pointer can be mapped to the LONG type. Example Double * p = (double *) New byte [sizeof (double) * number_of_doubles 7L]; double * np = (double *) ((long (p) 7L) &; -8L); now you can use NP instead of P to access data. Note: You should still use Delete P when you release the storage space. Use explicit parallel code to solve the long-dependent code chain to several ordered code chains that can be performed in parallel in the pipeline execution unit. Because the floating point operation has a long latency, this is important regardless of whether it is mapped into X87 or 3DNOW! Directive. Many advanced languages, including C , and do not reorder the generated floating point expressions because it is a fairly complex process. It should be noted that the reordering code and the original code are not equal to the calculation results, because the floating point operation lacks accuracy. In some cases, these optimization may result in unexpected results. Fortunately, in most cases, the last result may only be the least important bit (ie, the lowest bit) is wrong. Code Double A [100], SUM; INT I; SUM = 0.0f; for (i = 0; i <; 100; i ) SUM = A; Double A [100], SUM1, SUM2 , SUM3, SUM4, SUM; INT I; SUM1 = SUM2 = SUM3 = SUM4 = 0.0; for (i = 0; i <; 100; i = 4) {SUM1 = A; SUM2 = a [i 1 ]; SUM3 = a [i 2]; SUM4 = a [i 3];} SUM = (SUM4 SUM3) (SUM1 SUM2); pay attention to: Using 4-way decomposition because this is used The 4-stage pipeline floating point addition, each stage of floating point addition, occupies a clock cycle, ensuring the largest resource utilization. The public sub-expression is proposed in some cases that the C compiler cannot propose a common sub-expression from a floating point expression because it means that the expression is reordered. It is important to point out that the compiler will rearrange the expression before extracting the public sub-expression before the equivalent relationship. At this time, the programmer should manually propose public sub-expression (there is a "global optimization" option in VC.NET, but the effect is unknown).

Recommended code float a, b, c, d, e, f; ... E = B * C / D; f = b / d * a; float a, b, c, d, e, f; .. .const float t (b / d); E = C * t; f = a * t; recommended code FLOAT A, B, C, E, F; ... E = A / C; f = B / C Float A, B, C, E, F; ... const float t (1.0F / c); E = a * t; f = b * t; the layout of structural members has a lot of compilers " Word, double word or four-word alignment "option. However, it is still necessary to improve the alignment of structural members, and some compilers may allocate the order of the structural member space and their statements. However, some compilers do not provide these features, or if the effect is not good. Therefore, to achieve the best structural and structural members in the case of paying minimal cost, it is recommended to take these methods: Sort by the length of the type, sort the components of the structure, declare the length of the type when the member is declared Placed in a short front. The structure fills the structure into the longest type length to fill the structure into the maximum type length of the structure. As such, if the first member of the structure is aligned, all the entire structures are naturally aligned. The following example demonstrates how to reorder structural members: bad code, recommended code, new order, and manually fill a few bytes struct {char A [5]; long k; double x; } baz; struct {double x; long k; char a [5]; char pad [7];} baz; this rule is equally applicable to the layout of classes. Sorting the Local Variables by the length of the data type When the compiler is assigned to the local variable space, their order is the same as the order in which they declare in the source code, and the long variable should be placed in front of the short variable as the previous rule. If the first variable is aligned, the other variables will be continuously stored, and it will be aligned without the padding byte. Some compilers do not automatically change the variable order when allocating variables, and some compilers cannot generate 4-byte aligned stacks, so 4 bytes may not be aligned. The following example demonstrates the reordering of local variable declaration: bad code, universal order, improved order Short Ga, GU, GI; Long Foo, Bar; Double X, Y, Z [3]; Char A , B; Float Baz; Double Z [3]; Double X, Y; Long Foo, Bar; Float Baz; Short Ga, GU, GI; Avoid unnecessary integer division integer division is the slowest in the integer operation, so it should Avoid as much as possible. One place that may reduce integer division is even, where the division can be replaced by multiplication. This replacement side effect is possible to overflow when the product is calculated, so it can only be used in a range of division.

Code INT I, J, K, M; M = I / J / K; INT I, J, K, M; M = I / (J *K); use frequent pointer type parameters Copy to local variables Avoid frequently using the value to point to the pointer type parameter in the function. Because the compiler does not know if there is a conflict between the pointers, the pointer type parameters often cannot be optimized by the compiler. This is that the data cannot be stored in the register, and the memory bandwidth is obviously occupied. Note that many compilers have "assumes no conflict" optimization switch (which must be manually addup compiler command line / OA or / OW), which allows the compiler to assume that two different pointers have different contents, so No need to save the pointer parameters to local variables. Otherwise, save the data points to the pointer to the local variable in the function. If necessary, copy it back before the function ends.

The bad code recommended code? / Suppose Q! = Rvoid isqrt (unsigned long a, unsigned long * q, unsigned long * r) {* q = a; if (a>; 0) {while (* q>; (* r = a / * q)) {* q = (* q * r)>;>; 1;}} * r = a - * q * * q;} // hypothesis Q! = rvoid isqrt Unsigned long a, unsigned long * r) {unsigned long QQ, RR; QQ = a; if (a>; 0) {while (qq>; (rr = a / qq)) {QQ = QQ RR)>;>; 1;}} rr = a - QQ * QQ; * Q = QQ; * r = rr;} Assignment and initialization first look at the following code: Class Cint {Int M_i; public: CINT INT A = 0): m_i (a) {cout <; <; "; cint"; <; <; end1;} ~ cint () {cout <; <; "; ~ cint"; <; <; } CINT OPERATOR (const cint &; a) {return cint (m_i a.get ());} void setint (const INT i) {m_i = i;} int getint () const {return m_i;}}; Good code recommended code void main () {cint A, b, c; a.setint (1); b.setint (2); c = a b;} void main () {cint A (1), B (2); CINT C (A B);} The two of the two codes did the same, but that is better? Looking at the output will find that the bad code outputs four "; cint"; and four "; ~ cint"; and the recommended code only outputs three. That is, the second example generates a temporary object than the first example. Why? Please note that the first in the first cent is the method of declaring the reputation, the second method is the initialization method, and the difference between them. The first example "; c = a b"; Mr. is used as a temporary object to save the value of the A B, and then assigns C to c to C, then the temporary object is destroyed. This temporary object is the more items.

转载请注明原文地址:https://www.9cbs.com/read-74472.html

New Post(0)