Optimize performance (3) using Intel to quantization compiler (3)
This article is translated from the intel compiler document
1. To quantify the data type in the cycle
In the loop of the plastic, MMX and SSE technologies provide corresponding SIMD instructions for most of the algorithms and logic operations using 8/16 / 32BITS length data. If an algorithm using an integer is saved in a variable with sufficient accuracy Then, this algorithm may be quantified, for example, the result of an operation is 32-bit integers, but is saved in a 16-bit integer, then this operation cannot be quantified, not all plastic operations can be Quantization.
In a loop of 32-bit or 64-bit floating point, the SSE instruction set is not only to provide a corresponding SIMD instruction, and also provides SIMD instructions such as MAX / min / sort, other mathematics For example, the SIMD version of the triangular function SIN / COS / TAN has also been supported in the vector math library provided with the compiler.
2. Expand the loop
The compiler will automatically analyze the loop and generate the unfolded code. This means that you don't need you to go to the loop to rewrite loop operation, in many cases, this can make you get more vectorization.
Look at the loop below
INT i = 0;
While (i { a [i] = b [i] c [i]; i; } Generate such 2 cycling operations after quantization While (i <(k - k% 4))) { a [i] = b [i] c [i]; A [i 1] = B [i 1] C [i 1]; A [i 2] = b [i 2] C [i 2]; A [i 3] = B [i 3] C [i 3]; i = 4; } While (i { a [i] = b [i] c [i]; i; <- 文 文 文 程序 程序 程序 } 3. Scriptures in the loop The quantitative operation between integers and floating point numbers is different. a) floating point array operation Supported algorithms include plus / reduce / multiply / divide / non / square roots / seeking maximum / seeking minimum, the P4 processor can also use double precision operations, provided with -QXW or -QAXW open to P4 when compiling optimization. b) Integer array operation In a loop containing 8/16/32-bit plastic data, calculation of the square root (SQRT) and floating point absolute value (FABS) can also be supported, like / subtraction / multiplication (16 bits) / division (16 bits) ) / and / or / xor / max / min These operations are also supported under certain conditions. You can use multiple data types in the operation as long as the result is not lost, the results cannot exceed 32bits overflow. c) Other operations Any statement that exceeds an integer and floating point numerical range cannot be quantified, and those types such as __m64 and __m128 cannot be quantified, and no function calls in the loop (Note: Void fun " This is that can be j), nor can I use the SSE SDK's Intrinsics library (herein, the desperation is Use of the streaming simd extensions Intrinsics ".. 4. Language support and control instructions Some control instructions can better help you quantify your code. __declspec (align (n)) Allow your compiler to align the address of the variable n bytes, that is, the address mod n = 0 __Declspec (align (n, off)) Make your compiler to align the address of the variable N bytes offset, that is, the address mod n = OFF Restrictt Make the compiler that there is no pointer alias, it can be easier to quantize your code. __assume_aligned (a, n) If the compiler has no information aligned, use this to make the compiler to assume array a to align by N #pragma ivdep Let the compiler to assume your code without data dependent #pragma vector {aligned | unaligned | always} Specify how to quantize this loop and ignore efficiency speculation #pragma novector Do not quantify this cycle 5. Some examples of quantization cycles a) a loop of some questions The following example is a vector replication operation Void vec_copy (int * p1, int * p2) { FOR (i = 0; i <100; i) { P1 [i] = p2 [i]; } } Here, since the compiler can clearly separate two pointers, it can be quantified. There is also an example Void Vec_copy (INT * RESTRICT P1, INT * RESTRICT P2) { FOR (i = 0; i <100; i) { P1 [i] = p2 [i]; } } And the previous example can also be quantified, but use RESTRICT, so the compiler does not have multiple versions of code, pay attention, add / QRESTRICT parameters when compile with Restrict 6. Data alignment A 16-byte or larger data structure or array should be aligned, so the base address of these structures and data should be a multiple of 16, and the figure below shows how a DCU (Data Cache Unit) is divided into a 16 word without alignment Data of the section. It takes more than 6-12 clock cycles when accessing this data. If you can align the data to avoid these extra performance overhead for example Float * a, * b; For (i = 0; i <10; i) { a [i] = b [i]; } If A [0] / B [0] is all 16-byte alignment, then this loop can use the #pragma vector aligned instruction, which is performed in vector. 2 times to quantify the twice cycle ß ---------------------- à ß -------> i = 0-3, i = 4-7 i = 8-9 2 times of quantification iteration can be operated by block replication. Note that if you quantify the incorrect alignment instruction, the compiler produces unforeseen behaviors, such as using #pragma vector aligned on unsigned data will result in an exception. This is a few examples of data alignment Void f (int LB) { Float Z2 [N], A2 [N], Y2 [N], X2; For (i = lb; i { A2 [i] = a2 [i] * x2 y2 [i]; } } Because I can't determine LB when compiling, we can only think that there is no alignment here. If you determine the multiple of LB 4, then you can use #pragma vector aligned. Void f (int LB) { Float Z2 [N], A2 [N], Y2 [N], X2; Assert (lb% 4 = 0); #pragma vector aligned For (i = lb; i { A2 [i] = a2 [i] * x2 y2 [i]; } } After the full text, thank you for your support.