Http://www.blogcn.com/blog/cool/main.asp?uid=flier_lu&id=1577430http://www.blogcn.com/blog/cool/main.asp?uid=flier_lu&id=1577440
In complex underlying network programs, memory copies, string comparison, and search operations are easy to become a performance bottleneck. This kind of function comes with the compiler has made some versatile optimization, but because of the compatibility constraints in the use of instruction sets, it is far from reaching the maximum use of hardware capabilities. Through the optimization of a particular hardware platform, the performance of such operations can be greatly improved. Below I will use the memory copy operation under the P4 platform as an example. According to an example in an optimized documentation provided by AMD, it briefly introduces how to use a specific instruction set to optimize memory bandwidth. Although because the hardware restrictions did not reach the performance improvement of the Memcpy function in the AMD document, the measured performance of% from 175-% 200 on my machine (this data may be different). Optimizing Memory Bandwidth from AMD Follow the well-known "molar" law, the CPU is turned around every 18 months, but at the same time, the speed of memory and exemption (hard disk) cannot reach synchronous growth. This causes the high-speed CPU and relatively low-speed memory and peripherals to become a bottleneck of many programs. How to maximize the utilization of existing hardware is the main way to optimize the following levels of algorithms. For memory copy operation, understand and reasonably use Cache is the most critical. For the pursuit of performance, we will be at the expense of compatibility, so the following discussions and code are mainly P4 and above CPUs, although the AMD chip is distinguished, but in the instruction set and the overall structure. First let's take a look at the simplest Memcpy assembly implementation:
The following is quoted:
;
Flier Lu (Flier@nsfocus.com)
;
NASMW.EXE -F WIN32 FASTMEMCPY.ASM -O Fastmemcpy.obj
;
; extern "c" {
; EXTERN VOID FAST_MEMCPY1 (Void * DST, Const Void * src, size_t size);
}
;
CPU P4
Segment .text us32
GLOBAL _FAST_MEMCPY1
% Define param ESP 8 4
% Define SRC PARAM 0
% Define DST PARAM 4
% Define Len Param 8
_fast_memcpy1:
PUSH ESI
Push EDI
Mov ESI, [SRC]; Source Array
MOV EDI, [DST]; Destination Array
MOV ECX, [LEN]
REP MOVSB
POP EDI
POP ESI
RET
Here I use the compilation code of NASM format for code. NASM is a very good open source assembly compiler that supports various platforms and intermediate formats, which are widely used by open source projects, which avoids simultaneous use of VC's embedded assembly and gcc's troubled UNIX style AT & T format assembly: P code initial The CPU P4 defines the use of the P4 instruction set because many optimization works use the P4 instruction set and related features; the next segment .text use32 defines this code in the 32-bit code segment; then Global definition tag _fast_memcpy1 is the global symbol, make C The code can be link it. Obj is access this code; the last% define defines multiple macros to access the function parameters. You only need to define the FAST_MEMCPY1 function format and link NASM compilation. UABJ file. NASM compile time -f parameter specifies that the generated intermediate file format is 32-bit COFF format of MS, and the -o parameter specifies the output file name. The above code is very simple, suitable for a fast copy of the small memory block. In fact, the VC compiler will automatically replace the MEMCPY function using the REP MOVSB according to the situation, and optimize the length and performance by ignoring the function call and stack operation. However, under the 32-bit X86 architecture, there is no need to operate on one byte, and the MOVSD is used to replace MOVSB is an inevitable choice. The following is quoted:
GLOBAL _FAST_MEMCPY2
% Define param ESP 8 4
% Define SRC PARAM 0
% Define DST PARAM 4
% Define Len Param 8
_fast_memcpy2:
PUSH ESI
Push EDI
Mov ESI, [SRC]; Source Array
MOV EDI, [DST]; Destination Array
MOV ECX, [LEN]
SHR ECX, 2; Convert to DWord Count
REP MOVSD
POP EDI
POP ESI
RET
To showcase convenience, it is assumed here that the source and target memory block itself are integrated with an integer multiple of 64 bytes, and has been paid to 4K pages. The former guarantees that the single instruction does not have a situation across the Cache row; the latter guarantees the test results when the test speed is affected due to a cross-page operation. Waiting for Cache, explain why this assumption is explained in detail. However, because modern CPUs use a long instruction line, multiple instruction parallel work tend to be higher than one instruction efficiency, so this is given in the AMD document:
The following is quoted:
GLOBAL _FAST_MEMCPY3
% Define param ESP 8 4
% Define SRC PARAM 0
% Define DST PARAM 4
% Define Len Param 8
_fast_memcpy3:
PUSH ESI
Push EDI
Mov ESI, [SRC]; Source Array
MOV EDI, [DST]; Destination Array
MOV ECX, [LEN]
SHR ECX, 2; Convert to DWord Count
.copyloop:
MOV Eax, DWORD [ESI]
Mov DWORD [EDI], EAX
Add ESI, 4
Add Edi, 4
Dec ECX
Jnz .copyloop
POP EDI
POP ESI
RET
Tags. Copyloop The cycle is actually completed with the REP MOVSD instruction, but because of multiple instructions, theoretically, the CPU instruction line can be handled in parallel. Therefore, in the AMD's document, it indicates that 1.5% performance is improved, but I don't know much about my measured effect. Relatively, the difference between the two methods is very obvious when migrating from 486 to the Pentium architecture. Remember that Delphi 3 or 4 is just through this kind of optimization, and its string processing performance has greatly improved. At present, the mainstream CPU vendor is actually through microcode technology, and the RISC micro-instruction is used to simulate the CISC instruction set, so the effect is not obvious. Then, the optimization strategy can be increased through the optimization strategy of the loop to increase the amount of performance, reducing the number of performances. The following is quoted:
GLOBAL _FAST_MEMCPY4
% Define param ESP 8 4
% Define SRC PARAM 0
% Define DST PARAM 4
% Define Len Param 8
_fast_memcpy4:
PUSH ESI
Push EDI
Mov ESI, [SRC]; Source Array
MOV EDI, [DST]; Destination Array
MOV ECX, [LEN]
SHR ECX, 4; Convert to 16-Byte Size COUNT
.copyloop:
MOV Eax, DWORD [ESI]
Mov DWORD [EDI], EAX
MOV EBX, DWORD [ESI 4]
Mov DWORD [EDI 4], EBX
Mov Eax, DWORD [ESI 8]
Mov DWORD [EDI 8], EAX
MOV EBX, DWORD [ESI 12]
Mov DWORD [EDI 12], EBX
Add ESI, 16
Add EDI, 16
Dec ECX
Jnz .copyloop
POP EDI
POP ESI
RET
But this action is evaluated on the AMD document, but% 1.5 performance is lowered, huh, huh. Its own statement is to packet read memory and write memory, so that the CPU can get a one-time. Recognizes the following grouping operations can be increased by 3% -_- B by _fast_memcpy3
The following is quoted:
GLOBAL _FAST_MEMCPY5
% Define param ESP 8 4
% Define SRC PARAM 0
% Define DST PARAM 4
% Define Len Param 8
_fast_memcpy5:
PUSH ESI
Push EDI
Mov ESI, [SRC]; Source Array
MOV EDI, [DST]; Destination Array
MOV ECX, [LEN]
SHR ECX, 4; Convert to 16-Byte Size COUNT
.copyloop:
MOV Eax, DWORD [ESI]
MOV EBX, DWORD [ESI 4]
Mov DWORD [EDI], EAX
Mov DWORD [EDI 4], EBX
Mov Eax, DWORD [ESI 8]
MOV EBX, DWORD [ESI 12]
Mov DWORD [EDI 8], EAX
Mov DWORD [EDI 12], EBX
Add ESI, 16
Add EDI, 16
Dec ECX
Jnz .copyloop
POP EDI
POP ESI
RET
Unfortunately, I can't do anything in P4. Oh, probably P4 and AMD realize the idea of the pipeline has a subtle entry. Since the loop is launched, why not simply start? Although there are only one few of the general registers below X86, now there is MMX, huh, huh, a lot of registers: D. After using the MMX register, one load / write operation can handle 64-bytes of data, huh, ratio _ FAST_MEMCPY5 can be increased by 7% performance. The following is quoted:
GLOBAL _FAST_MEMCPY6
% Define param ESP 8 4
% Define SRC PARAM 0
% Define DST PARAM 4
% Define Len Param 8
_fast_memcpy6:
PUSH ESI
Push EDI
Mov ESI, [SRC]; Source Array
MOV EDI, [DST]; Destination Array
MOV ECX, [LEN]; Number of Qwords (8 bytes) Assumes Len / Cacheblock is an inTeger
SHR ECX, 3
LEA ESI, [ESI ECX * 8]; End of Source
Lea EDI, [EDI ECX * 8]; End of Destination
NEG ECX; Use a negative offset as a combo pointer-and-loop-counter
.copyloop:
MOVQ MM0, QWORD [ESI ECX * 8]
MOVQ MM1, QWORD [ESI ECX * 8 8]
MOVQ MM2, QWORD [ESI ECX * 8 16]
MOVQ MM3, QWORD [ESI ECX * 8 24]
MOVQ MM4, QWORD [ESI ECX * 8 32]
MOVQ MM5, QWORD [ESI ECX * 8 40]
MOVQ MM6, QWORD [ESI ECX * 8 48]
MOVQ MM7, QWORD [ESI ECX * 8 56]
MOVQ QWORD [EDI ECX * 8], MM0
MOVQ QWORD [EDI ECX * 8 8], MM1
MOVQ QWORD [EDI ECX * 8 16], MM2
MOVQ QWORD [EDI ECX * 8 24], MM3
MOVQ QWORD [EDI ECX * 8 32], MM4
MOVQ QWORD [EDI ECX * 8 40], MM5
MOVQ QWORD [EDI ECX * 8 48], MM6
MOVQ QWORD [EDI ECX * 8 56], MM7
Add ECX, 8
Jnz .copyloop
EMMS
POP EDI
POP ESI
RET
Optimized to this share, the regular optimization means is basically exhausted, and it is necessary to use very means, huh, huh. Let us go back to see the Cache structure under the P4 architecture. The IA-32 Intel Architecture Software Developer's Manual, VOLUME 3: System Programming GuideIntel's system becomes the memory cache control under the IA32 architecture. Because of the huge gap between CPU speed and memory speed, CPU vendors improve frequent access speed of frequent use of data in CPUs and external multi-level caches. In general, there is a L1, L2 and L3 three-level cache between the CPU and memory (there are several TLB caches that do not involve this), and the speed of each level has a magnitude of about a magnitude of the difference, and the capacity has a large difference ( In fact, it is related to $, huh, the L1 cache is divided into instruction cache and data cache for different purposes. For the processor of P4 and Xeon, the L1 instruction cache is replaced by trace cache, built in the NetBust microarchitecture; L1 data cache and L2 cache are encapsulated in the CPU, depending on the CPU grade, 8-16k and 256 respectively Between 512K; and the L3 cache is only implemented in the Xeon processor, and is also encapsulated in the CPU, about 512K-1M. You can view the current machine's cache information such as CPUInfo, such as: P4 1.7G, 8K L1 Code Cache, 12k L1 Data Cache, 256K l2 cache. The cache is implemented, and the slot or line is composed, and each line corresponds to the continuous data on one address in the memory, and the data loaded by the cache manager controls the read and write. Its principle is not much Luo Wei, interested friends can check the Intel manual yourself. What need to be known is that the length of each SLOT is 32 bytes before P4, and P4 starts to change to 64 bytes. The operation of the cache line is complete, even if only one byte, it is necessary to load the entire cache line (64 bytes), and the later optimization is largely based on these principles. In the work mode of the cache, there are more than six kinds of P4, which will not be introduced here. It has an impact on our optimization, in fact, is the performance of the cache when memory. The most common WT (WRITE-THROUGH) mode updates data to the cache while writing data to memory; and WRITE-BACK write back mode, directly written to the cache, not slower memory Read and write. These two modes have greater performance differences in memory variable processing of operation frequent operation (100 million this level per second). For example, by writing a drive module, MTRR is forcibly open WB mode, and has received a good effect in Linux's network card driver, but the optimization of memory replication is not large, because we need to completely skip the operation of the cache, no matter Cache positioning, load or write. Fortunately, the MovntQ instruction is provided in P4, using the WRITE-Combining mode, skip the cache to write memory. Because our write memory operation is purely written, the written data will not be used in a certain time, no matter whether it is WT or WB mode, there will be redundant cache operations. The optimization code is as follows: The following is a reference:
GLOBAL _FAST_MEMCPY7
% Define param ESP 8 4
% Define SRC PARAM 0
% Define DST PARAM 4
% Define Len Param 8
_fast_memcpy7:
PUSH ESI
Push EDI
Mov ESI, [SRC]; Source Array
MOV EDI, [DST]; Destination Array
MOV ECX, [LEN]; Number of Qwords (8 bytes) Assumes Len / Cacheblock Is An Integerser ECX, 3
LEA ESI, [ESI ECX * 8]; End of Source
Lea EDI, [EDI ECX * 8]; End of Destination
NEG ECX; Use a negative offset as a combo pointer-and-loop-counter
.copyloop:
MOVQ MM0, QWORD [ESI ECX * 8]
MOVQ MM1, QWORD [ESI ECX * 8 8]
MOVQ MM2, QWORD [ESI ECX * 8 16]
MOVQ MM3, QWORD [ESI ECX * 8 24]
MOVQ MM4, QWORD [ESI ECX * 8 32]
MOVQ MM5, QWORD [ESI ECX * 8 40]
MOVQ MM6, QWORD [ESI ECX * 8 48]
MOVQ MM7, QWORD [ESI ECX * 8 56]
Movntq QWORD [EDI ECX * 8], MM0
Movntq QWORD [EDI ECX * 8 8], MM1
Movntq QWORD [EDI ECX * 8 16], MM2
Movntq QWORD [EDI ECX * 8 24], MM3
Movntq QWORD [EDI ECX * 8 32], MM4
Movntq QWORD [EDI ECX * 8 40], MM5
Movntq QWORD [EDI ECX * 8 48], MM6
Movntq QWORD [EDI ECX * 8 56], MM7
Add ECX, 8
Jnz .copyloop
Sfnce; Flush Write Buffer
EMMS
POP EDI
POP ESI
RET
The MOVQ instruction of the memory is all changed to the Movntq instruction, and after the replication operation is complete, call the SFENCE refresh write buffer because the content in the cache may have been invalid. This way, in-depth-in-depth, and the operation of the cache itself is saved, greatly reduces the redundant memory operation. According to AMD, there can be 60% performance improvement, I measured more than about 50% of the obvious performance improvement. movntq and sfence etc. The instructions may refer to Intel's instruction manual: The IA-32 Intel Architecture Software Developer's Manual, Volume 2A: Instruction Set Reference, A-MThe IA-32 Intel Architecture Software Developer's Manual, Volume 2B: Instruction Set Reference, NZ in Optimizing the memory is OK, it can also be optimized by optimizing the performance of the read memory. Although the CPU is read, there is an automatic read reading optimization, but explicitly requires the CPU pre-read data when the continuous memory area is operated, or the performance can be significantly optimized.
The following is quoted:
GLOBAL _FAST_MEMCPY8
% Define param ESP 8 4
% Define SRC PARAM 0
% Define DST PARAM 4
% Define Len Param 8
_fast_memcpy8:
PUSH ESI
Push EDI
Mov ESI, [SRC]; Source Array
MOV EDI, [DST]; Destination Arraymov ECX, [LEN]; Number of Qwords (8 bytes) Assumes Len / Cacheblock is an inTeger
SHR ECX, 3
LEA ESI, [ESI ECX * 8]; End of Source
Lea EDI, [EDI ECX * 8]; End of Destination
NEG ECX; Use a negative offset as a combo pointer-and-loop-counter
.writeloop:
Prefetchnta [ESI ECX * 8 512]; Fetch Ahead by 512 Bytes
MOVQ MM0, QWORD [ESI ECX * 8]
MOVQ MM1, QWORD [ESI ECX * 8 8]
MOVQ MM2, QWORD [ESI ECX * 8 16]
MOVQ MM3, QWORD [ESI ECX * 8 24]
MOVQ MM4, QWORD [ESI ECX * 8 32]
MOVQ MM5, QWORD [ESI ECX * 8 40]
MOVQ MM6, QWORD [ESI ECX * 8 48]
MOVQ MM7, QWORD [ESI ECX * 8 56]
Movntq QWORD [EDI ECX * 8], MM0
Movntq QWORD [EDI ECX * 8 8], MM1
Movntq QWORD [EDI ECX * 8 16], MM2
Movntq QWORD [EDI ECX * 8 24], MM3
Movntq QWORD [EDI ECX * 8 32], MM4
Movntq QWORD [EDI ECX * 8 40], MM5
Movntq QWORD [EDI ECX * 8 48], MM6
Movntq QWORD [EDI ECX * 8 56], MM7
Add ECX, 8
Jnz .writeloop
Sfnce; Flush Write Buffer
EMMS
POP EDI
POP ESI
RET
Add a simple PrefetChnta directive, prompting the CPU to pretend to a cache line 64 byte content at the front 512 byte while processing the current read memory operation. In this way, there can be a performance improvement of about 10%. Finally, the memory that is being processed can be operated by an explicit memory read operation, which is mandarized to be loaded into the cache, because the PrefetChnta directive is only a prompt, can be ignored by the CPU. This will achieve a performance prompt of about 60% again, I don't have sufficient measurement, but it is more obvious.
The following is quoted:
Global _fast_memcpy9
% Define param ESP 12 4
% Define SRC PARAM 0
% Define DST PARAM 4
% Define Len Param 8
% Define Cacheblock 400h
_fast_memcpy9:
PUSH ESI
Push EDI
Push EBX
Mov ESI, [SRC]; Source Array
MOV EDI, [DST]; Destination Array
MOV ECX, [LEN]; Number of Qwords (8 bytes) Assumes Len / Cacheblock is an inTeger
SHR ECX, 3
LEA ESI, [ESI ECX * 8]; End of Sourcelea EDI, [EDI ECX * 8]; End of Destination
NEG ECX; Use a negative offset as a combo pointer-and-loop-counter
.MAINLOOP:
Mov Eax, Cacheblock / 16; Note: .prefetchloop is unrolled 2x
Add Ecx, Cacheblock; Move Up to End of Block
.prefetchloop:
MOV EBX, [ESI ECX * 8-64]; Read One Address in this cache line ...
MOV EBX, [ESI ECX * 8-128]; ... and one in the previous line
SUB ECX, 16; 16 qwords = 2 64-byte cache lines
Dec EAX
Jnz .prefetchloop
Mov Eax, Cacheblock / 8
.writeloop:
Prefetchnta [ESI ECX * 8 512]; Fetch Ahead by 512 Bytes
MOVQ MM0, QWORD [ESI ECX * 8]
MOVQ MM1, QWORD [ESI ECX * 8 8]
MOVQ MM2, QWORD [ESI ECX * 8 16]
MOVQ MM3, QWORD [ESI ECX * 8 24]
MOVQ MM4, QWORD [ESI ECX * 8 32]
MOVQ MM5, QWORD [ESI ECX * 8 40]
MOVQ MM6, QWORD [ESI ECX * 8 48]
MOVQ MM7, QWORD [ESI ECX * 8 56]
Movntq QWORD [EDI ECX * 8], MM0
Movntq QWORD [EDI ECX * 8 8], MM1
Movntq QWORD [EDI ECX * 8 16], MM2
Movntq QWORD [EDI ECX * 8 24], MM3
Movntq QWORD [EDI ECX * 8 32], MM4
Movntq QWORD [EDI ECX * 8 40], MM5
Movntq QWORD [EDI ECX * 8 48], MM6
Movntq QWORD [EDI ECX * 8 56], MM7
Add ECX, 8
Dec EAX
Jnz .writeloop
OR ECX, ECX; Assumes Integer Number of Cacheblocks
Jnz .mainloop
Sfnce; Flush Write Buffer
EMMS
POP EBX
POP EDI
POP ESI
RET
At this point, the optimization process of a complete memory replication function is over, through the understanding and use of the cache, transcenders once again, and finally get a more satisfactory result. (Name) 300% performance tips, measured 175% -200%, it is quite good) need to pay attention to two points when writing test code: First, the problem of timing accuracy, requires high-precision physical counters to avoid errors. It is recommended to use the RDTSC instruction and then calculate time according to the CPU frequency calculation time. The CPU clock can be dynamically calculated by high-precision timer, I am talking directly from the registry directly from the registry: P code as follows:
The following is quoted:
#ifdef Win32
TYPEDEF __INT64 UINT64_T;
#ELSE
#include
#ENDIF
Bool getPentiumClockestimateFromRegistry (uint64_t & frequency)
{
HKEY HKEY;
FREQUENCY = 0;
Long rc = :: regopenkeyex (HKEY_LOCAL_MACHINE, "hardware // description //// 0", 0, key_read, & hkey);
IF (rc == error_success)
{
DWORD CBBUFFER = SizeOf (DWORD);
DWORD FREQ_MHZ;
Rc = :: RegQueryValueex (HKEY, "~ MHz", NULL, NULL, (LPBYTE) (& FREQ_MHZ), & CBBuffer;
IF (rc == error_success)
FREQUENCY = FREQ_MHZ * MEGA;
RegcloseKey (HKEY);
}
Return Frequency> 0;
}
Void getTimeStamp (uint64_t & timestamp)
{
#ifdef Win32
__ASM
{
Push Edx
Push ECX
MOV ECX, TimeStamp
/ / _ EMIT 0FH // RDTSC
/ / _ EMIT 31H
RDTSC
MOV [ECX], EAX
MOV [ECX 4], EDX
POP ECX
POP EDX
}
#ELSE
__ASM__ __Volatile__ ("RDTSC": "= a" (TimeStamp);
#ENDIF
}
The second is to test the size of the buffer of memory replication. If the buffer is too small, the first copy of the two buffers will cause all data to be loaded into the L2 cache, and the value is higher than the normal memory operation. . For example, my L2 buffer is 256K. If I use two 128K buffers to copy, no matter how many times the cycle, the speed is 10 times more of ordinary memory replication. So set a larger value is necessary.