Optimize performance (2) using Intel to quantization compiler (2)
This article is translated from the Intel website, partially selected from the Intel Compiler Document, and some information about the Calculation Office of the Chinese Academy of Sciences.
First, I have to apologize to everyone. Due to negligence, I found that this tutorial of Intel may just be a 5.0 version of the compiler. Now it is already 7.1, so the example in the text may have different performance on the current compiler, I will put Each example tests and combines the documentation from other channels about the INTEL compiler to introduce to quantization compilation technology.
1. How to write to quantization procedures
To the quantization code is a process of trial, each loop may need to adjust many times before being quantified, according to the following principles, can help you complete your work.
a) Try to be compiled once, there will be some cycles that can be directly quantified.
Some existing code may already meet the requirements of the compiler. You can be directly quantified without any modifications. No matter how, try it first.
b) Find the loop of those most affected program performance in your code
Use the code analysis tool to find the loop that is often called in your code, which is most worthy of being quantified, even a small loop, if it is frequently called, the impact of performance is very considerable, so it is possible To quantify this loop will bring a lot of performance to your code, Intel vTune Performance Analyzer can help you find the bottlenecks on these performance, you can download from Intel's website and trial http://www.intel .com / software / products / vTune. I will introduce you if I will introduce you any more time. Now I will do it simply.
1. Additional on an executable file to collect processor data during this program running.
2. Track your code and record the performance of your code performance, such as how many clock cycles running, how many instructions have been used, so you can easily find the most influential performance loops.
3. Generate a graphic report to the running data.
4. Summary of statistical results and know your optimization of the code.
c) Use #pragma ivdep
This last explained.
d) Re-write this cycle
Sometimes a loop is destined to be quantified, then it can only be stoveted ...
2. Data dependencies in the loop
In order to use the Intel compiler, you need to understand the reasons for data dependencies and how to deal with these situations.
a) Because of the example of the Intel website, there may be a problem with the compiler, so I have changed an example to explain the problem. Of course, this example does not make sense in practical applications, just for the problem
The following programs cannot be quantified.
Float Data [100];
INT i = 0;
For (i = 0; i <100-1; i)
{
S1: DATA [I] = DATA [I-1] * 0.25 DATA [i] * 0.5 DATA [i 1] * 0.25;
}
Here, DATA [i] generates data dependencies here in the S1 statement, DATA [i] is modified by DATA [I-1] in the previous cycle iteration, does not understand?, Look:
Data [i-1] is less than DATA [i] small 1, and DATA [i-1] will always overwrite DATA [i], which means that the quantization is not parallelized by the last time. To quantization, it is necessary to consider such mutual effects, so it cannot be said to be complete parallelization, although the data is operated in parallel with SIMD mode. Even so understanding, maybe there is incorrect place, please refer to it in time. The following description may be clearer.
In the above example, when i = 1: i = 2
Read Data [0] Read Data [1]
Read Data [1] Read Data [2]
Read Data [2] Read Data [3]
Write Data [1] Write Data [2]
When there is a problem when using a scalar cycle, Data [1] data is rewritten in turn, but when I = 2 when I = 2 in the SIMD parallel operation, the data may have been used by i = 1 Write Data [i] is modified, which changed this loop. This is absolutely not .b) The above is the actual example of data dependencies. Here we are theoretically explaining the reasons for data dependence.
The condition of data dependencies is that it is discovered in the operation that may be accessible to the overwritten in memory. If there are 2 references in the code, then.
1. These 2 references may be alias, and they point to the same memory or two mutually covered memory
2. If it is an array, it will determine whether to overwrite the same memory according to their subscript.
For arrays, the compiler's data dependency analyzer will use a series of gradually enhanced tests to get overhead on time and space. First, the compiler will give some simple tests for each dimension of the array until you There is no problem with data dependencies. Before the test, use the multi-dimensional array that may occur in their defined boundaries can be converted to a linear form, some of which use the rapid maximum number of conventions. Testing the data that reaches array boundaries has ensured that there is no problem with data overlay.
If the above test fails, the compiler will use the Fourier-Motzkin downtime to solve all dimensional data dependence issues.
Here I found that this process is not mentioned in the tutorial of the Intel website. It may be that it is more likely to solve the problem in the high version compiler and does not need people to modify the code. I actually say the problem. The example is to directly compile to quantify and do not need to be modified.
See a few loops below.
Cycle structure
You can use for / while / us to constitute a loop, but the loop can only have an entry and an exit.
While (i <100)
{
a [i] = b [i] * c [i];
IF (a [i] <0.0)
{
a [i] = 0.0;
}
i;
}
This can be quantified
While (i <100)
{
a [i] = b [i] * c [i];
IF (a [i] <0.0)
{
a [i] = 0.0;
Break;
}
i;
}
This is not, there is no way to export
2. Cyclic exit conditions
The cyclic exit condition is determined, the number of iterations performed, such as a fixed number of times. In short, this must be an expression that can determine the number of cycles.
a) constant, such as 100
b) A variable that is not changed, such as i = 100
c) a linear function of a calculation cycle, such as i = 100-1
Here are a few examples
INT k = 100; int i = 0;
While (i { a [i] = b [i] * c [i]; IF (a [i] <0.0) { a [i] = 0.0; } i; } Conditional determination INT k = 100; int i = 0; While (i { a [i] = b [i] * c [i]; IF (a [i] <0.0) { a [i] = 0.0; } i; } Condition is uncertain, the number of cycles is dependent on i For (i = m; i { a [i] = i 1; } The condition is determined, the number of cycles is N-M