Program Design Introduction Based on SSE Directive Set

zhaozj2021-02-16 72

Author: Alex Farber Source: http: //www.codeproject.com/cpp/sseintro.asp

SSE technology introduction

Intel's single instructions Multi-Data Flow Extensions (SSE, STREAMING SIMD Extensions) technology can effectively enhance CPU floating point operations. Visual Studio .NET 2003 provides programming support for the SSE instruction set, allowing users to directly use the SSE instructions without writing assembly code in C code. MSDN's topic for SSE technology [1] may be confused by beginners who are not familiar with SSE assembly instructions, but while reading MSDN's documentation, refer to Intel Software Manuals [2] Make you more clearly understand the key points of using SSE instructions.

SIMD (SINGLE-INSTRUCTION, MULTIPLE-DATA) is a CPU execution mode that uses a single channel instruction to process multiple data streams, i.e., in a CPU instruction execution cycle to perform a plurality of data. Consider the following task: calculate the square root of each element in a long floating-point array. Algorithm for achieving this task can be written like this:

FOR Each F in Array // Align each element f = SQRT (f) // in the array calculates its square root

In order to understand the details of the implementation, we write the above code:

FOR Each f in array {loaded Floaded from memory to floating point register to calculate the square root and then put the calculation result from the register into the memory}

Processors with Intel SSE instruction sets have 8 128-bit registers, each register can store 4 (32-bit) single-precision floating point numbers. SSE provides an instruction set, which allows the floating point number to be loaded into these 128-bit registers, and these numbers can be arithmetic logic operations in these registers, and then put the result back back memory. After using SSE technology, the algorithm can be written as the following:

FOR Each 4 MEMBERS IN Array // Each 4 elements in the array loads the four numbers in the array into a 128-bit SSE register to complete the calculation of these 4 square roots in a CPU instruction execution cycle. Operation The resulting four results are removed from the memory} C programmer does not have to care for these 128-bit registers when using the SSE instruction function, you can use 128-bit data type "__m128" and a series of C functions to implement these arithmeters And logic operations, and which SSE register and code optimization are the task of the C compiler. SSE technology is indeed a very efficient way when the elements in a long floating point count are needed.

SSE program design details

The header file containing:

All SSE instruction functions and __m128 data types are defined in XMMINTRIN.H files: #include Because the SSE processor instruction used in the program is determined by the compiler, it is not related .lib Library file.

Data Alignment

Each floating point count of SSE instructions must be divided into a group of 16 bytes (128-bit binary) each 16 bytes (128-bit binary). Static Array can be declared by __declspec (align (16)) keyword:

__declspec (align (16)) float m_farray [array_size];

Dynamic array (dynamic array) may be assigned to a function space _aligned_malloc: m_fArray = (float *) _aligned_malloc (ARRAY_SIZE * sizeof (float), 16); which may be released by the function occupied by _aligned_free dynamic array function allocates space _aligned_malloc Space: _ALIGNED_FREE (M_FARRAY);

__m128 data type

The variable of this data type can be used as an operand of the SSE instruction, which cannot be directly accessed by the user instruction. The _m128 type variable is automatically assigned to 16 bytes of word long.

CPU support for SSE instruction set

If your CPU can have an SSE instruction set, you can use the Visual Studio .NET 2003 to support the C function library supported by the SSE instruction set, you can view an example of a Visual C CPUID in MSDN [4], it Can help you detect if your CPU supports SSE, MMX instruction sets, or other CPU functions.

Programming Example The following explains the application instance of SSE technology under Visual Studio .NET 2003, you can download sample programs compressed packets at http://www.codeproject.com/cpp/sseintro/sse_src.zip. The compression package contains two projects, which are Visual C . The Visual C . Net project established based on the Microsoft Basic Class Library (MFC), you can also establish these two items as described below.

SSTEST sample project

The SSTEST project is a dialog-based application that uses three floating point array participation operations:

Fresult [i] = SQRT (fsource1 [i] * fsource1 [i] fsource2 [i] * fsource2 [i]) 0.5

Where I = 0, 1, 2 ... array_size-1

Where array_size is defined as 30000. The data source array (Source array) is assigned by using the SIN and COS functions, we use the Waterfall Chart Control (Waterfall Chart Control) developed by Kris Jearakul to display the source array and result arrays of the participating calculations. Calculate the time required (in milliseconds MS) is displayed in the dialog box. We use three different ways to complete the calculation:

Pure C code; use the C code of the SSE instruction function; include the code of the SSE assembly instruction.

Pure C code:

Void Cssetestdlg :: ComputeArrayCplusplus (FLOAT * PARRAY1, / / [Input] Source Array 1 FLOAT * PARRAY2, / / [Input] Source Array 2 FLOAT * PRESULT, / / [Output] Array INT NSize for storage results) // [Enter] The size of the array {

INT I;

Float * psource1 = parray1; float * psource2 = parray2; float * pdest = preSult;

For (i = 0; i

Below we override the above function with C code with SSE features. In order to query the method of using the SSE Directive C function, I refer to the instructions for the SSE assembly instruction in Intel Software Manuals. First, I found the relevant SSE instruction found in the first volume of the ninth chapter, and then in the second Volume finds a detailed description of these SSE instructions, some of which involve C functions related to its characteristics. Then I find the description related to the MSDN through the C function corresponding to these SSE instructions. The results of the search are shown below:

The implementable functionality corresponding to the SSE assembly instruction Visual C . The SSE function in NET puts 4 32-bit floating point numbers into a 128-bit storage unit. Movss and shufps_mm_set_ps1 will perform 4 pairs of 32-bit floating point numbers simultaneously. These 4 pairs of 32-bit floating point numbers come from two 128-bit storage units, and then assign the calculation result (product) to a 128-bit storage unit. MULPS_MM_MUL_PS performs 4 pairs of 32-bit floating point simultaneously. These 4 pairs of 32-bit floating point numbers are from two 128-bit memory cells, and then assign the calculation result (add sum) to a 128-bit storage unit. ADDPS_MM_ADD_PS performs square root operation simultaneously in a 128-bit memory cell. SQRTPS _MM_SQRT_PS

Use Visual C . Code of the SSE instruction function:

Void Cssetestdlg :: ComputeArrayCplusplusSse (FLOAT * PARRAY1, / / [Input] Source Array 1 FLOAT * PARRAY2, / / [Input] Source Array 2 FLOAT * PRESULT, // [Output] Array INT NSize in the result of the result) // [Input] Array size {INT NLOOP = nsize / 4;

__m128 m1, m2, m3, m4;

__m128 * psrc1 = (__m128 *) PARRAY1; __M128 * PSRC2 = (__m128 *) PARRAY2; __M128 * PDEST = (__m128 *) PRESULT;

__m128 m0_5 = _mm_set_ps1 (0.5F); // m0_5 [0, 1, 2, 3] = 0.5

For (int i = 0; i

Void cssetestdlg :: ComputeArrayassemblysse (FLOAT * PARRAY1, / / [Input] source array 1 float * parray2, // [input] source array 2 float * presult, // [Output] Array INT nsize used to store results) // [Input] Array size {int nloop = nsize / 4; float f = 0.5f;

_ASM {Movss XMM2, F // XMM2 [0] = 0.5 shufps XMM2, XMM2, 0 // XMM2 [1, 2, 3] = XMM2 [0]

MOV ESI, PARRAY1 // Enter the address of the source array 1 to ESI MOV EDX, PARRAY2 // Enter the address of the source number 2 to EDX

MOV EDI, PRESULT / / The address of the output result array is saved in EDI MOV ECX, NLOOP // Time to ECX

START_LOOP: MOVAPS XMM0, [ESI] // XMM0 = [ESI] Mulps XMM0, XMM0 // XMM0 = XMM0 * XMM0

Movaps XMM1, [EDX] // Xmm1 = [EDX] MULPS XMM1, XMM1 // XMM1 = XMM1 * XMM1

AddPS XMM0, XMM1 // XMM0 = XMM0 XMM1 SQRTPS XMM0, XMM0 // XMM0 = SQRT (XMM0)

AddPS XMM0, XMM2 // XMM0 = XMM1 XMM2MOVAPS [EDI], XMM0 // [EDI] = XMM0

Add ESI, 16 // ESI = 16 Add EDX, 16 // EDX = 16 Add EDI, 16 // EDI = 16

DEC ECX // ECX - JNZ START_LOOP // If not 0, turn to start_loop}}}

Finally, run the results of the calculation test on my computer:

Pure C code calculations The time used by 26 milliseconds using SSE C function calculations used for 9 milliseconds containing the C code calculation of SSE assembly instructions. Time is 9 milliseconds.

The above time results are derived after Release optimization.

SSESAMPLE Sample Project

The SSESAMPLE project is a dialog-based application where it calculates with the following floating point count:

Fresult [i] = SQRT (fsource [i] * 2.8)

Where I = 0, 1, 2 ... array_size-1

This program simultaneously calculates the maximum and minimum value in the array. Array_size is defined as 100,000, the calculation results in the array are displayed in the list box. Among them, the time required for the following three methods is: pure C code calculation 6 milliseconds using SSE C function calculation 3 milliseconds using SSE assembly instructions to calculate 2 milliseconds

As you can see that the results calculated using the SSE assembly instruction will be better because the efficiency enhanced SSX register group is used. However, in general, the C function calculation using SSE is higher than the efficiency of the assembly code, because the code after the C compiler has a high calculation efficiency, to make assembly code compared to optimized code operations Higher efficiency, which is usually very difficult.

Pure C code:

// Enter: m_finitialaray // Output: m_fresultarray, m_fmin, m_fmaxvoid cssesampledlg :: OnbnclickedButtonCplusplus () {m_fmin = fl_max; m_fmax = flt_min;

INT I;

For (i = 0; I

IF (m_fresultarray [i]

IF (m_fresultarray [i]> m_fmax) m_fmax = m_fresultaryRay [i];}}

Use Visual C . Code of the SSE instruction function:

// Input: m_fInitialArray // Output: m_fResultArray, m_fMin, m_fMaxvoid CSSESampleDlg :: OnBnClickedButtonSseC () {__m128 coeff = _mm_set_ps1 (2.8f); // coeff [0, 1, 2, 3] = 2.8 __m128 tmp; __ m128 min128 = _MM_SET_PS1 (flt_max); // min128 [0, 1, 2, 3] = fl_max __m128 max128 = _mm_set_ps1 (flt_min); // MAX128 [0, 1, 2, 3] = fl_min

__m128 * psource = (__m128 *) m_finitialarray; __m128 * pdest = (__m128 *) m_fresultarray;

For (int i = 0; i

Min128 = _MM_MIN_PS (* pdest, min128); MAX128 = _mm_max_ps (* pdest, max128);

Psource ; pdest ;

/ / Calculate the maximum value of MAX128 and minimum of min128 UNION U {__M128 m; float f [4];} x;

X.M = min128; m_fmin = min (x.f [0], min (x.f [1], min (x.f [2], x.f [3])))))

X.m = max128; m_fmax = max (x.f [0], max (x.f [1], max (x.f [2], x.f [3])))))))

Use the SSE assembly instruction C function code:

// Enter: m_finitialaray // Output: m_fresultarray, m_fmin, m_fmaxvoid cssesamplesamplembly () {float * pin = m_finitialarray; float * pout = m_fresultaRray;

FLOAT F = 2.8F; float flt_min = flt_min; float flt_max = flt_max;

__m128 min128; __m128 max128;

// Use the following additional registers: XMM2, XMM3, XMM4: // XMM2 - multiplied coefficient // xmm3 - minimum // xmm4 - maximum

_ASM {MovsS XMM2, F // XMM2 [0] = 2.8 SHUFPS XMM2, XMM2, 0 // XMM2 [1, 2, 3] = XMM2 [0]

MovsS XMM3, FLT_MAX / / XMM3 = fl_max shufps XMM3, XMM3, 0 // xmm3 [1, 2, 3] = XMM3 [0] MovsS XMM4, FLT_MIN / / XMM4 = FLT_MIN SHUFPS XMM4, XMM4, 0 // XMM3 [1 , 2, 3] = XMM3 [0]

MOV ESI, PIN // Enter the address of the array to ESI MOV EDI, POUT / / Output array Address to EDI MOV ECX, Array_Size / 4 // Cycle Counter initialization

START_LOOP: MOVAPS XMM1, [ESI] // XMM1 = [ESI] MULPS XMM1, XMM2 // XMM1 = XMM1 * XMM2 SQRTPS XMM1, XMM1 // XMM1 = SQRT (XMM1) MOVAPS [EDI], XMM1 // [EDI] = XMM1

MINPS XMM3, XMM1 Maxps XMM4, XMM1

Add ESI, 16 Add EDI, 16

Dec ECX JNZ Start_Loop

Movaps min128, XMM3 MOVAPS MAX128, XMM4}

Union u {__m128 m; float f [4];} x;

X.M = min128; m_fmin = min (x.f [0], min (x.f [1], min (x.f [2], x.f [3])))))

X.m = max128; m_fmax = max (x.f [0], max (x.f [1], max (x.f [2], x.f [3])))))))))

}

Reference documentation:

[1] MSDN, SSE technical topic: http://msdn.microsoft.com/library/default.asp? URL = / library / en-us / vclang / html / vcrefstreamingsimdextensions.asp

[2] Intel Software Manuals: http://developer.intel.com/design/archives/Processors / MMX/index.htm

[3] Kris Jearakul's waterfall statistical chart control: http://www.codeguru.com/controls/waterfall.shtml

[4] Microsoft Visual C CPUID example: http://msdn.microsoft.com/library/default.asp? URL = / library / en-us / vcsample / html / vcsamcpuiddeterminecpuCapabilities.asp

转载请注明原文地址:https://www.9cbs.com/read-22380.html

9cbs

New Post(0)