To really test, optimize program performance - especially for multi-threaded programs developed for Windows servers, the standard clock provided by the operating system is not enough, and the resolute clock must be used. This article describes how to access the processor's billions of secondary clocks, which greatly improves the speed and accuracy of code performance testing. First, like the timing data and other Windows servers, the most performance advantages on Windows 2003 Server is multi-threaded programs. Windows 2003 Server supports a variety of multiprocessor systems, and can also run on the P4 system of a single processor. For a single processor P4 system, Windows 2003 Server will exert the advantages of various hardware thread executing engines provided by Intel Hyperformo. People who develop server applications know that there is only one - performance. However, it is well known that performance improvement is a relatively blurred goal, because the performance of multi-threaded code is usually only estimated by experience. In a single-threaded program, performance improvement can generally predict, such as how many instructions and delayed operations, but the multi-thread code is different, and the thread schedule in the Windows platform is uncertain, that is, in Windows The application can request the scheduler to run the thread, but when the scheduler (whether or not) running the thread far exceeds the scope of the application code. When testing performance, developers will soon encounter a problem, this is that the standard clock built by Windows is not accurate enough. It is difficult to measure the resolution of the event time. This way, to determine if a code snippet is It is very difficult to get optimization. If you must test with Windows standard clock, you must use the loop to let the code run millions of tens of thousands of times to get valid time data. In most cases, using such cycles means modifying applications. In fact, there is a better way, this is the Win32 high resolution clock, two functions involved: queryperformancecount (), queryperformancefrequency (). In the Intel system, starting from the P II, these functions depends on a counter built in the Pentium chip. When an Intel system is started, a 64-bit register tracks the disappearing clock cycle, which provides a highly desirable timing device. The entire 64-bit register is used. 32 Bit integers can count 2 billion. For processors that run 20-3 billion cycles per second, 32 Bit counters overflow in one second or less, 64 bit counter can accommodate these seconds The number of 2 billion times, according to 2 billion seconds is about 63 years - can believe that this has far exceeds the requirements of any program. To make a timing for an event, just get the clock count after the event is obtained. The following code does not depend on Win32 (ie, from C / C direct access), let's take a look at the functions provided by the operating system later. We first define a data structure, then you will then fill in the code of the structure:
TypeDef struct _binint32 {__INT32 I32 [2];} bigint32; typedef struct _bigint64 {__INT64 I64;} bigint64; typedef union _bigint {Bigint32 int32val; bigint64 int64val;}
The following code obtains the high and low position of the clock counter from the operating system, and fills in the __int64 data, respectively:
Bigint start_ticks, end_ticks; _ASM {rdtsc MOV Start_ticks.int32val.i32 [0], EAX MOV START_TICKS.INT32VAL.I32 [4], EDX} This code can run smoothly in Visual Studio .NET 2003, in previous C / There should be no problem in the C compiler. RDTSC (Read Time Stamp Counter is a compilation instruction, its function is to load the contents of the timestamp counter into the EAX and EDX registers. After performing the above code, START_TICKS contains a complete clock count. Top the above code again, replace the start_ticks to end_ticks, then subtract start_ticks from end_ticks, and get the clock cycle that passes during the two calls. To output this _int64 value, you can use the bottomf () mask:
Printf ("Function USED% i64LD Ticks / N", end_ticks.int64val.i64 - start_ticks.int64val.i64);