Quickly initialize memory
Many computational intensive applications require a large amount of memory. The memory initialization in this application is a regular operation, and the speed bottleneck between the data exchange inside the CPU has determined the initialization of memory will take a considerable time. However, because the application initialized memory often calls the CRT's Memset or the Windows API of the WINDOWS API, few people are optimized in terms of initialization.
On the other hand, the current application hardware general configuration is better, most applications run on PII, but we often select speed optimization when using compilation environments such as VC, and select the appropriate processor, then send I hope that the compiler gives us an optimized result, and the result is often discovered.
In one of our image processing projects, a large amount of memory operation is required, and multiple threads are running simultaneously, and memory access has become a competitive resource of each module, so the memory access is optimized to the key. On the basis of the efforts to reduce the number of memory operations, speed up the initialization of the memory becomes our improvement.
After using VC various means without too much improvement, we turn your gaze to processor characteristics. Starting from the Pentium series, INTEL is constantly increasing the CPU frequency, and also launching MMX / SSE / SSE2 in terms of applications such as multimedia, adding many multi-bit quick processing instructions. In terms of high-level language, Intel's C Compiler provides optimization results for different processors. However, in a mature project, use another risk of compiler, so we have drawn a MEMSET from the Intel environment, reorganizing a lib, and changed in our project, and the initialization of memory, and Link to the extracted LIB library. There is a big increase in the initialization of memory.
Below we use the test example to explain the process.
one example
In the test program, the MemSet and Intel versions of the Microsoft C library are called separately, respectively, 60M memory, respectively, in order to simulate multithreaded environments, starting two threads simultaneously in the same time. The Release version is used when the test is used. In order to view the contained debugging information (the debug information has no effect). Test Results:
MSC version: 12.453 ~ 12.547 seconds
Intel C version: 4.375 ~ 4.531 seconds
It can be seen that the difference is relatively large when a large amount of memory is operated. For memory access, because memory access is often bottleneck, it should also improve overall processing performance.
The following is the example of the example:
// This program example shows the speed difference between MEMSET and Intel optimized MEMSET initialization memory using Microsoft CRT.
// lihw.
#include
#include
#include
EXTERN "C"
void * __cdecl __intel_new_memset (void *, int, size_t);
#pragma comment (Lib, "IntelMem.lib")
#define size 1024 * 1024 * 100
Void Threadfunc (Void * Dummy)
{
LPBYTE LPBYTE = (LPBYTE) Dummy; // New byte [size];
Int J;
#define looptimes 60
DWORD DWSTART, DWTIME1, DWTIME2;
//
// Intel Version
DWStart = gettickcount ();
For (j = 0; j { __INTEL_NEW_MEMSET (LPBYTE, 1, SIZE); } DWTIME1 = gettickcount () - dwstart; // ms CRT Version DWStart = gettickcount (); For (j = 0; j { MEMSET (lpbyte, 1, size); // ZeromeMory (lpbyte, size); } Dwtime2 = gettickcount () - dwstart; // delete [] lpbyte; Printf ("Intel =% DMS MSC =% DMS / N", DWTIME1, DWTIME2); } Int main (int Argc, char * argv []) { #define threads 2 Handle hthread [threads]; // array to hold thread handle LPBYTE LPBYTE [THREADS]; // Array to Hold Thread-Specific Memory INT I; // count mem alloc time. Debug version is very long DWORD DWSTART = GetTickCount (); For (i = 0; i { LPBYTE [I] = new byte [size]; } Printf ("Alloc Spend =% D / N", GetTickCount () - DWSTART); // Start Thread For (i = 0; i HThread [i] = (handle) _Beginthread (threadfunc, 0, lpbyte [i]); // threadfunc (lpbyte [i]); WaitFormultiPleObjects (Threads, Hthread, True, Infinite); For (i = 0; i Delete [] lpbyte [i]; Printf ("Process EXEC TIME =% DMS / N", GetTickCount () - DWSTART); Return 0; } Let's take a look at what caused such a big difference. Set breakpoints at the RELEASE version of __intel_new_memset and Memset, open the anti-assembly window: Intel version: 31: for (j = 0; j 32: { 33: __INTEL_NEW_MEMSET (lpbyte, 1, size); 00401017 PUSH 6400000H 0040101C PUSH 1 0040101E PUSH EBX 0040101f call ___intel_new_memset (00401110) 00401024 Add ESP, 0CH 00401027 DEC ESI 00401028 JNE Threadfunc 17h (00401017) 34:}