Fast initialization of memory (1)

zhaozj2021-02-16  58

Quickly initialize memory

Many computational intensive applications require a large amount of memory. The memory initialization in this application is a regular operation, and the speed bottleneck between the data exchange inside the CPU has determined the initialization of memory will take a considerable time. However, because the application initialized memory often calls the CRT's Memset or the Windows API of the WINDOWS API, few people are optimized in terms of initialization.

On the other hand, the current application hardware general configuration is better, most applications run on PII, but we often select speed optimization when using compilation environments such as VC, and select the appropriate processor, then send I hope that the compiler gives us an optimized result, and the result is often discovered.

In one of our image processing projects, a large amount of memory operation is required, and multiple threads are running simultaneously, and memory access has become a competitive resource of each module, so the memory access is optimized to the key. On the basis of the efforts to reduce the number of memory operations, speed up the initialization of the memory becomes our improvement.

After using VC various means without too much improvement, we turn your gaze to processor characteristics. Starting from the Pentium series, INTEL is constantly increasing the CPU frequency, and also launching MMX / SSE / SSE2 in terms of applications such as multimedia, adding many multi-bit quick processing instructions. In terms of high-level language, Intel's C Compiler provides optimization results for different processors. However, in a mature project, use another risk of compiler, so we have drawn a MEMSET from the Intel environment, reorganizing a lib, and changed in our project, and the initialization of memory, and Link to the extracted LIB library. There is a big increase in the initialization of memory.

Below we use the test example to explain the process.

one example

In the test program, the MemSet and Intel versions of the Microsoft C library are called separately, respectively, 60M memory, respectively, in order to simulate multithreaded environments, starting two threads simultaneously in the same time. The Release version is used when the test is used. In order to view the contained debugging information (the debug information has no effect). Test Results:

MSC version: 12.453 ~ 12.547 seconds

Intel C version: 4.375 ~ 4.531 seconds

It can be seen that the difference is relatively large when a large amount of memory is operated. For memory access, because memory access is often bottleneck, it should also improve overall processing performance.

The following is the example of the example:

// This program example shows the speed difference between MEMSET and Intel optimized MEMSET initialization memory using Microsoft CRT.

// lihw.

#include

#include

#include

EXTERN "C"

void * __cdecl __intel_new_memset (void *, int, size_t);

#pragma comment (Lib, "IntelMem.lib")

#define size 1024 * 1024 * 100

Void Threadfunc (Void * Dummy)

{

LPBYTE LPBYTE = (LPBYTE) Dummy; // New byte [size];

Int J;

#define looptimes 60

DWORD DWSTART, DWTIME1, DWTIME2;

//

// Intel Version

DWStart = gettickcount ();

For (j = 0; j

{

__INTEL_NEW_MEMSET (LPBYTE, 1, SIZE);

}

DWTIME1 = gettickcount () - dwstart; // ms CRT Version

DWStart = gettickcount ();

For (j = 0; j

{

MEMSET (lpbyte, 1, size);

// ZeromeMory (lpbyte, size);

}

Dwtime2 = gettickcount () - dwstart;

// delete [] lpbyte;

Printf ("Intel =% DMS MSC =% DMS / N", DWTIME1, DWTIME2);

}

Int main (int Argc, char * argv [])

{

#define threads 2

Handle hthread [threads]; // array to hold thread handle

LPBYTE LPBYTE [THREADS]; // Array to Hold Thread-Specific Memory

INT I;

// count mem alloc time. Debug version is very long

DWORD DWSTART = GetTickCount ();

For (i = 0; i

{

LPBYTE [I] = new byte [size];

}

Printf ("Alloc Spend =% D / N", GetTickCount () - DWSTART);

// Start Thread

For (i = 0; i

HThread [i] = (handle) _Beginthread (threadfunc, 0, lpbyte [i]);

// threadfunc (lpbyte [i]);

WaitFormultiPleObjects (Threads, Hthread, True, Infinite);

For (i = 0; i

Delete [] lpbyte [i];

Printf ("Process EXEC TIME =% DMS / N", GetTickCount () - DWSTART);

Return 0;

}

Let's take a look at what caused such a big difference. Set breakpoints at the RELEASE version of __intel_new_memset and Memset, open the anti-assembly window:

Intel version:

31: for (j = 0; j

32: {

33: __INTEL_NEW_MEMSET (lpbyte, 1, size);

00401017 PUSH 6400000H

0040101C PUSH 1

0040101E PUSH EBX

0040101f call ___intel_new_memset (00401110)

00401024 Add ESP, 0CH

00401027 DEC ESI

00401028 JNE Threadfunc 17h (00401017)

34:}

转载请注明原文地址:https://www.9cbs.com/read-23196.html

New Post(0)