SSE instructions and their C, C ++ applications

zhaozj2021-02-08  559

SSE is the new generation after Intel's MMX (of course a few years ago) CPU instruction set, earliest application on the PIII series CPU. Supported by Intel PIII, P4, Celeon, Xeon, AMD Athlon, DURON, etc. The updated SSE2 instruction set is only supported by the P4 series CPU, which is why this article is one of the reasons why SSE is not SSE2. Another reason is that SSE and SSE2 instruction systems are very similar, SSE2 is only a small amount of additional floating-point processing functions than SSE, and 64-bit floating point arithmetic support and 64-bit integer operator support.

Why is SSE faster than traditional floating point operations? Because it uses 128-bit storage units, this can be stored for 4 of the 32-bit floating point, that is, all calculations in SSE are done at one time for 4 floats. This batch will certainly bring efficiency to improve. Let's review the full name of SSE: stream SIMD EXTENTIONS (stream SIMD extension). SIMD is Single Instruction Multiple Data, which is the "data flow order command multi-data extension", from the name, we can better understand how SSE works.

Although SSE is in theory, it will be much faster than traditional floating point operations, but he is limited. First, although he is equivalent to four times, it will be more fast than traditional floating point operations. However, he executed a speed but did not imagine so fast, so it is necessary to reflect the speed of SSE. There must be a premise of stream to do a large amount of streaming data, so it can play the powerful role of SIMD. Secondly, the data type supported by SSE is 4 32-bit (total 128-bit) floating point gathering, which is Float [4] in the C, C language, and must be aligned with 16-bit byte boundary (later will be code To interpret, the concept of border alignment, readers can refer to other articles on the forum, will have a very detailed answer, I will not have false here). Therefore, this also brings a lot of trouble to input and output. It actually affects SSE performance, which is constant to replicate data to apply it to its data format.

I am a C programmer, not very familiar with the compilation, but I want to use SSE to optimize my program, what should I do? Fortunately, VC . NET provides us with a convenient instruction C function level package and C format data type, we only need to define variables like usually write C code, call the function to apply the SSE instruction.

Of course, we need to include a header file, which includes the declaration of data types and functions we need:

#include

The standard data type of the SSE operation is only one, that is:

__m128, it is defined in this:

Typedef struct __declspec (intrin_type) __DECLSPEC (align (16)) __m128 {

FLOAT M128_F32 [4];

} __m128;

Simplify, it is:

Struct __m128

{

FLOAT M128_F32 [4];

}

For example, to define a __m128 variable and assign four float integers to it, you can write:

__m128 s1 = {1.0F, 2.0F, 3, 0f, 4, 0f};

To change the 2nd (base 0) element, you can write:

S1.M128_F32 [2] = 6.0F;

We will use several assigned instructions that allow us to use this data structure more convenient: S1 = _mm_set_ps1 (2.0f);

It will make the four elements in S1.M128_F32 all give 2.0F, which will be more faster than you a value.

S1 = _MM_SETZERO_PS ();

This will enable all 4 floats in S1 to zero.

There are some other assignments, but it is not possible to assign a quick value, only for some special purposes, if you want to know more information, you can refer to MSDN -> Visualc reference -> C / C Language -> C Language Reference -> Compiler Intrinsics -> MMX, SSE, AND SSE2 Intrinsics -> Stream SIMD Extensions (SSE) chapter.

Generally speaking, all SSE instruction functions have 3 parts, and the intermediate is separated by underscore:

_MM_SET_PS1

MM represents multimedia expansion instructions

Set indicates the meaning of this function

PS1 represents the effect of the function on the result variable, consisting of two letters, the first letter represents the influence of the result variable, and P represents a pointer to a set of data, each element will participate in the operation, s Indicates that only the first element in the result variable is involved in the calculation; the second letter represents the data type involved in the calculation. S represents 32-bit floating point numbers, D represents the 64-bit floating point number, i32 represents the number of points of bit number, I64 represents the number of points set, since SSE supports 32-bit floating point numbers, so you may find it in these instruction package functions Do not include non-s modifiers, but you can concentrate them in the MMX and SSE2 instructions.

Next, I will explain how the SSE's instruction function is used. It must be explained that the code I do below is written on the platform of VC7.1, which is not guaranteed to other such as DEV-C , Borland C , etc. Complete compatibility of the development platform.

For the convenience of comparison, I will optimize two ways to optimize them with the standing method and SSE, and use a test speed class CTIMER to perform timing.

This algorithm is enlarged for a set of float values. The function scalevalue1 is optimized using SSE instructions, and there is no. We test these two algorithms with Float array data with 10,000 elements, each algorithm operation 10,000 times, below is the test programs and results:

#include

#include

Class CTimer

{

PUBLIC:

__forceinline ctimer (void)

{

QueryperFormanceFrequency; & m_frequency);

QueryperFormanceCounter; & m_startcount;

}

__forceinline void reset (void)

{

QueryperFormanceCounter; & m_startcount;

}

__forceinline Double End (Void)

{

Static __INT64 NCURCOUNT;

QueryperFormanceCounter ((Plarge_integer) & ncurcount);

Return Double (NCURCOUNT * (* (__ int64 *) & m_startcount) / double (* (__ t4 *) & m_frequency);

}

PRIVATE: LARGE_INTEGER M_FREQUENCY;

Large_integer m_startcount;

}

Void Scalevalue1 (Float * Parray, DWORD DWCOUNT, FLOAT FSCALE)

{

DWORD dwGroupCount = dwcount / 4;

__m128 e_scale = _mm_set_ps1 (fscale);

For (DWORD I = 0; i

{

* (__ m128 *) (PARRAY I * 4) = _MM_MUL_PS (* (__ m128 *) (PARRAY I * 4), E_SCALE);

}

}

Void Scalevalue2 (Float * Parray, DWORD DWCOUNT, FLOAT FSCALE)

{

For (DWORD I = 0; i

{

PARRAY [I] * = fscale;

}

}

#define arraycount 10000

INT __CDECL Main ()

{

FLOAT __DECLSPEC (align (16)) array [arraycount];

MEMSET (Array, 0, SizeOf (Float) * arraycount;

CTimer T;

Double DTIME;

t.reset ();

For (int i = 0; i <100000; i )

{

Scalevalue1 (Array, Arraycount, 1000.0F);

}

DTIME = T.End ();

COUT << "Use sse:" << DTIME << "Second" << Endl;

t.reset ();

For (int i = 0; i <100000; i )

{

Scalevalue2 (Array, Arraycount, 1000.0F);

}

DTIME = T.End ();

COUT << "NOT USE SSE:" << DTIME << "Second" << ENDL;

System ("pause");

Return 0;

}

Use sse: 0.997817

NOT USE SSE: 2.84963

Here you should pay attention to it, I use __declspec (align (16)) as a quotient defined by array, indicating that the array is aligned with 16-byte border, because SSE instructions can only support memory of this format. data.

We have seen the power of the SSE algorithm here. I believe it will become a multimedia programmer's hand used to deal with endless media data. I will also write some articles about the SSE algorithm more complex application, so stay tuned, thank you for your reading!

转载请注明原文地址:https://www.9cbs.com/read-511.html

New Post(0)