Optimize performance (1) using Intel to quantization compiler (1)
This article is translated from the Intel website
When using the Intel C compiler, I can't find the relevant information. I can refer to, I found some tutorial in the INTEL website, my C level is very general, the use of this compiler is also started, everyone will see more Finger, this is the first part, and I will continue to finish it in these days.
Learn how to enhance C / C and Fortran code under Windows and Linux performance, use Intel's vectorized compiler to automatically generate SIMD code for specific loop operations, including MMX / SIMD EXTENSIONS / SIMD EXTENSIONS 2.
1. What is a vector
The vector is a feature of the Intel C / Fortran compiler, using the Intel Vector Compile Features By automatically generating SIMD code to accelerate your specific loops, these SIMD instructions include MMX / SIMD EXTENSIONS / SIMD EXTENSIONS 2.
2. What is a quantization program?
a) Mainly improve performance
When the function contains a large number of cycles, the use of SIMD instructions can be used to improve the improvement in a larger performance, one segment is characterized by quantization cycles, and the general program or class library of the same way of operation is sufficient. Comparison A vectorized loop and a scalar circulation of the same manner, providing the quantization of performance equivalent to achieving this loop using the underlying assembly (usually using streaming sIxt Extensions). Vector It is also possible to open these loops into some prefetch and streaming stores (don't know how to translate) the code in your loop, which may also get some extra performance improvements.
b) The compiler using the Intel allows you to freely from a monotonous optimization, the compiler can save you for a lot of time, you can replace you more work.
c) The Intel Compiler starts supporting vectorization from version 4.5, and 5.0 enhances the quantized functionality including Performing Simple Statement REORDERING AutomaticLy (what is, nor knowing) and support for SIMD2
3. Why use a vector
a) Improve performance, such as quantifying a floating point operation and frequently called loop will greatly improve the performance of the program
b) Write a single version of the code, reduce the use of compilation to make the encoding work simplify, and less compile means that you will greatly reduce your work for specific system programming, your program will easily upgrade and use the latest mainstream system without Rewind those assembly code.
4. What kind of loop can be quantified
a) For a loop, if the compiler thinks that each statement in the loop does not depend on another statement and there is no loop dependency, then this loop is vector. In other words, each statement must be able to execute independently, read The operation of writing data must be partially iterated by the loop.
See this cycle
For (int i = 0; i <1000; i ) {S1: a [i] = b [i] * t d [i]; S2: B [i] = a [i] b [i]) / 2; s3: c = c b [i];
If equivalent to the following operation
For (int i = 0; i <1000; i ) a [i] = b [i] * t d [i]; for (int i = 0; i <1000; i ) b [i] = (a [i] b [i]) / 2; for (int i = 0; i <1000; i ) c = c b [i];
Then think that this cycle can be quantified.
Look at an example
For (int i = 1; i <1000; i )
{S1: a [i] = a [i-1] * b [i];
In any case, this loop cannot be quantified, because A [i] reads A [I-1] in the previous iteration in each iteration. We call this is a cross-iterative data dependence or " Flow dependence, such a loop cannot be quantified by the compiler. Assume that the floating point data-vector can be used in the first example, can operate 4 floats in the same manner, here, in order to be quantified, The number of times the cycle must be greater than 4.
b) The vector can only act in the innermost cycle
In a nested loop, the vector can only attempt to quantify the innermost loop, the output information of the vector can know if the loop can be quantified and cause, if the key loop affecting performance does not quantify, you may It is necessary to do some deeper plan, such as the help of the vector to make the most correct decision like the content described below.
c) quantization is not parallelism
Look at this example
For (int i = 0; i <1000; i )
{S1: a [i] = b [i] * t d [i]; S2: B [i] = (a [i] b [i]) / 2; S3: C = C B [i }
This to the quantization program sequentially S1, S2, S3 perform all cyclic iterations.
Parallelization means that it may be disturbed by another processing process in different iterative loops of statements. If the statement requests sharing resources (such as reading and writing the same data), then this output cannot be guaranteed correctly, so this loop does not quantify correctly of.
5. How to open the vector
parameter
WINDOWS *
Linux *
Open vector when the compiler specifies the CPU type
/ Qx [m, k, w]
-x [m, k, w]
Open the vector and automatically detect the CPU type, the compiler can generate the quantized code to the latest IA-32 processor, but also generate the processor used by non-quantized code for the old model, which makes the execution file can be run A variety of processors
/ QAX [M, K, W]
-AX [M, K, W]
-W Open the support of the P4's streaming SIMD EXTENSIONS 2, -k opens support for Pentium® III Process, -M is supported by MMX technology
6. Use with #pragma ivdep and restrict
a) In order to quantify a loop containing or may contain dependencies, plus #pragma ivdep (ignore vector dependencies), if needed.
Void foo (INT K)
{#pragma ivdep
For (int J = 0; j <1000; j )
{
A [J] = B [J K] * C [J]; B [J] = (A [J] B [J]) / 2; B [J] = B [J] * C [J] ;
}
}
When you quantify this loop. The compiler will think that the array B relies on cross iterations, the reason is to use the variable k, if you know that K does not interfere with data access, plus the #pragma IVDEP vector ignore vector dependency and try To conduct vectorization, you must know what this dependence is, and confestent that they will not become a problem.
b) Use the pointer to the cycle
The cycle of the pointer may cause dependencies, and if you quantify this loop, you can use the restrict keyword, if needed
Void foo (Float * Restrict A, Float * Restrict B, Float * Restrict C)
{
For (int J = 0; j <1000; j )
{
A [J] = B [J] * C [J];
B [J] = (a [j] b [j]) / 2;
B [J] = B [J] * C [J];
}
}
Note that the RESTRICT keyword is used to use the / QRESTRICT switch. If you do not use the restrict keyword, the compiler will consider the reference to the array may have a cross iteration.
This is because the pointer is used to access the data in the loop, the compiler cannot know if the same address (generally alias), in order to prevent this program to quantify, the restrict keyword tells the compiler pointer pointing to the address is limited , Can only be accessed through this pointer, in other words, there is no alias 7. Verbose mode parameters
Use the / QVEC_REPORT [0, 1, 2, 3] parameters under Windows to generate a detailed quantization report, use -Vec_report [0, 1, 2, 3] parameters under Linux, [0, 1, 2, 3] Specifies the detailed level of the output information, in the most detailed level 3, the compiler will indicate that the vector is the specific information of generating code.
Here is some of the output information of the compilation operation of the vector:
C: /TEMP/Text1.cpp (16): (col. 1) Remark:
Loop
Was vectorized.
Below is an esample of vectorizer output using the level 3 switch.
Char * p;
P = "aeiou";
While (* p ) {
Cout << * p;
}
C: /Temp/text1.cpp (20): (col. 1) Remark: loop Was Not Vectorized: Nonstandard Loop is Not a vectorization candidate.
Float x [100], y [100];
Int z [100];
For (int J = 0; j <100; j )
{
X [J] = Y [J] 3;
z [j] = z [j 1];
}
C: /TEMP/Text1.cpp (29): (col. 5) Remark: loop was not vectorized: Mixed Data Types.
Float x [100], y [100];
Int z [100];
For (int J = 0; j <100; j )
{
X [J] = Y [J] 3;
Y [J 1] = Z [J 1];
}
C: /TEMP/Text1.cpp (26): (col. 1) Remark: Loop Was Not Vectorized: Existence of Vector Dependence.
Note that there are two problems here, 2 cycles have vector dependencies, and there is a problem with the inconsistency between the data type, and the vector is given information according to the priority level of the test.
to be continued………………….