Because we use static library links, Intel versions are a function call in code. Tracking, you can find that Intel's implementation will first detect the CPU type when calling, and then jump according to the CPU type to different implementations. On the P4 machine, its primary cycle is as follows:
00401A40 SUB ECX, 80H
00401A46 MOVDQA XMMWORD PTR [EDX], XMM0
00401A4A MOVDQA XMMWORD PTR [EDX 10H], XMM0
00401A4F MOVDQA XMMWORD PTR [EDX 20H], XMM0
00401A54 MOVDQA XMMWORD PTR [EDX 30H], XMM0
00401A59 MOVDQA XMMWORD PTR [EDX 40H], XMM0
00401A5E MOVDQA XMMWORD PTR [EDX 50H], XMM0
00401A63 MOVDQA XMMWORD PTR [EDX 60H], XMM0
00401A68 MOVDQA XMMWORD PTR [EDX 70H], XMM0
00401A6D Add EDX, 80H
00401A73 CMP ECX, 80H
00401A79 JGE ___ intel_new_memset 750h (00401A40)
It can be seen that Intel's implementation uses the 128-bit XMM register of SSE2, and 8 replication instructions are placed in parallel to cause instructions, so that each loop can copy 128 × 8 = 512 bit.
MSC version:
42: for (j = 0; j 43: { 44: MEMSET (LPBYTE, 1, SIZE); 0040103B MOV ECX, 1900000H 00401040 MOV EAX, 1010101H 00401045 MOV EDI, EBX 00401047 DEC EDX 00401048 Rep Stos DWORD PTR [EDI] 0040104A JNE Threadfunc 3bh (0040103B) If it is a debug version, because Microsoft provides CRT's sourcecode, you can track its compilation implementation, in the Release version, the optimization result is expanded to call the function call, but because the implementation only uses the normal 386 instructions Press DWORD to transfer data, so There will be such a big difference in performance. In addition, if the size in the test code is defined as a smaller value, such as 1024 * 128, on P4 of the L2 Cache is 512K, the execution result of the two methods is not large, and therefore, it can be seen that the promotion of local access is visible. Note: The above example is the same as the result in VC6 and VC7. If you use the Intel compiler to compile, you can use the Memset directly, the Intel compiler actually compiles to __vec_memset when you encounter MEMSET, and then link to the Runtime library of Intel.