Win32 multi-threaded performance (2)

zhaozj2021-02-11  135

Win32 multi-threaded performance (2)

Author: Microsoft company Feeds

Ruediger R. Asche Microsoft Developer Network Technical Group Concurrentexecution Internal Work Please note: This section discussions are very technical, so you understand a lot of knowledge about Win32 thread API. If you are interested in how to use the Concurrentexecution class to collect test data, instead of interested in Concurrentexecution :: DoforalLobjects, you can now skip to the following "Test Thread Performance" with Concurrentexecution. Let's start from DoSerial, largely because it is a "no brainer guy": BOOL ConcurrentExecution :: DoSerial (int iNoOfObjects, long * ObjectArray, CONCURRENT_EXECUTION_ROUTINE pProcessor, CONCURRENT_FINISHING_ROUTINE pTerminator) {for (int iLoop = 0; iLoop

This code is from file Thrdlib.cpp in inherited, but in order to clear reasons, has been streamlined: int ConcurrentExecution :: DoForAllObjects (int iNoOfObjects, long * ObjectArray, CONCURRENT_EXECUTION_ROUTINE pObjectProcessor, CONCURRENT_FINISHING_ROUTINE pObjectTerminated) {int iLoop, iEndLoop; DWORD iThread; DWORD iArrayIndex; DWORD dwReturnCode; DWORD iCurrentArrayLength = 0; BOOL bWeFreedSomething; char szBuf [70]; m_iCurrentNumberOfThreads = iNoOfObjects; HANDLE * hPnt = (HANDLE *) VirtualAlloc (NULL, m_iCurrentNumberOfThreads * sizeof (HANDLE), MEM_COMMIT, PAGE_READWRITE); for (iLoop = 0; iLoop

We hope to do this before / / before finding a new SLOT, so we can immediately call the terminator of the thread ... IARRAYIDEX = WaitFormultiPleObjects (icrrentaraylength, m_hthreadarray, false, 0); if (IaRrayIndex == WAIT_TIMEOUT) // no slot free ... {{if (iCurrentArrayLength> = m_iMaxArraySize) {iArrayIndex = WaitForMultipleObjects (iCurrentArrayLength, m_hThreadArray, FALSE, INFINITE); bWeFreedSomething = TRUE;} else // we can release a slot somewhere, now do ... {iCurrentArrayLength ; iArrayIndex = iCurrentArrayLength-1;}; // Else iArrayIndex points to a thread that has been nuked};} else bWeFreedSomething = TRUE;}; // here, iArrayIndex comprising an effective Indexing to store new threads. hNewThread = hPnt [iLoop]; ResumeThread (hNewThread); if (bWeFreedSomething) {GetExitCodeThread (m_hThreadArray [iArrayIndex], & dwReturnCode); // Error CloseHandle (m_hThreadArray [iArrayIndex]); pObjectTerminated ((void *) m_hObjectArray [iArrayIndex], ( void *) dwReturnCode);}; m_hThreadArray [iArrayIndex] = hNewThread; m_hObjectArray [iArrayIndex] = ObjectArray [iLoop];}; // loop end DoForAllObjects core is hPnt, it is an object array of objects when the object is ConcurrentExecution The assignment is assigned. The array can accommodate the maximum number of threads, which corresponds to the maximum and release number specified in the constructor; therefore, each element in the array is a "slot", and there is a calculation. The Slots algorithm that decides how to fill and release is as follows: The object array is traversed from beginning to end, and for every object, we do what we do: if there is no Slot has been filled, we use the current object to populate the The first SLOT in the array and continues to perform threads that will process the current object. If there is at least one slot is used, we use the waitformultipleObjects function to determine if any calculations are running have ended; if so, we call the terminal on this object and "reuse" this Slot for the new object ". Please note that we can first populate each idle slot until there is no remaining slots, and then start filling empty slot. However, if we do this, the endpower function of the empty SLOT will not be called until all Slot has been filled, so it violates us that when the processor ends an object, the terminator is immediately called. Requirements.

Finally, there is still no SLOT (that is, the current activated thread is equal to the maximum and release number allowed by the ConcURRENTEXECUTION object). In this case, WaitFormultiPleObjects will be called again to make DOFORALLOBJECTS in the "sleep" state until there is a SLOT flight; as long as this happens, the terminator is called on an empty Slot, and working The thread of the current object is continued. Finally, all the calculations are either end, or the Slot in an object array will be occupied. The following code will handle all the remaining threads: iEndLoop = iCurrentArrayLength; for (iLoop = iEndLoop; iLoop> 0; iLoop--) {iArrayIndex = WaitForMultipleObjects (iLoop, m_hThreadArray, FALSE, INFINITE); if (iArrayIndex == WAIT_FAILED) {GetLastError (); _ASM INT 3; / / here to do some smart things ...}; getexitcodethread (m_hthreadArray [IARRAYX], & DWRETURNCODE); / / Error? If (! CLOSEHANDEX])) MessageBox (GetFocus (), "Can't delete thread!", "", mb_ok); // make it better ... POBJECTTERMINATED ((void *) M_HObjectArray [arayindex ], (void *) dwReturnCode); if (iArrayIndex == iLoop-1) continue; // good here, there is no need to fill rearwardly m_hThreadArray [iArrayIndex] = m_hThreadArray [iLoop-1]; m_hObjectArray [iArrayIndex] = m_hObjectArray [ iLoop-1];}; Finally, clear: if (hPnt) VirtualFree (hPnt, m_iCurrentNumberOfThreads * sizeof (HANDLE), MEM_DECOMMIT); return iCurrentArrayLength;}; use ConcurrentExecution to test range thread performance performance test is as follows: test application Threadlibtest .exe's user can specify whether to test the CPU-based or I / O-based calculation, how much calculation is performed, how long is calculated, how is the calculation is sorted (in order to test the worst case and random delay), and calculate Is it concurrent execution or serial execution. In order to eliminate unexpected results, each test can be executed ten times, then take the result of ten times to produce a more trusted result. By selecting the menu option "Run Entire Test Set", the user can request a variation of all test variables. The calculation length used in the test varies between the fundamental 10 and 3,500 ms (I will discuss this problem later), the number of calculations varies between 2 and 20. If Microsoft Excel is installed on a computer running the test, ThreadLibTest.exe will dump the result in a Microsoft Excel worksheet, which is located in C: /Temp/Values.xls.

In any case, the result value will also be saved to a plain text file, which is located at C: /TEMP/Results.fil. Please note that I use hard-coded ways for the location of the protocol file. It is purely lazy behavior; if you need to regenerate the test results on your computer, and you need to specify a different location, you only need to recompile to generate the project. Changing the TEXTFILOC and SheetFileLoc identifier of the beginning of the file threadlibTestView.cpp. Keep in mind that running the entire test program will always sort the calculation with the worst case (that is, the order in which the execution is serial, the longest calculation will be executed first, followed by the second long calculation, Then be pushed in times). This solution sacrifices the flexibility of serial execution because concurrent execution of the response time has not changed under a non-worst scheme, and the response time of the serial execution is possible. As I mentioned earlier, in an active program, you should analyze if each calculated time is predictable. The code to collect performance data is located in ThreadlibTestView.cpp using the Concurrentexecution class. The sample application itself is a real single-document interface (SDI) MFC application. All code-related code resident in the implementation of the View class CTHREADLIBTESTSTVIEW, which is inherited from CeasyoutputView. (For the discussion of this class, please refer to "Windows NT Security In THEORY AND PRACTICE".) This is not included in this class all of the interesting code, and most of the included statistics section and the user interface processing section. Execute "MEAT" in the test In CTHREADLIBTESTVIEW: :: ExecuteTest, a test run cycle will be executed. The following is a schematic of codes associated CThreadLibTestView :: ExecuteTest: void CThreadlibtestView :: ExecuteTest () {ConcurrentExecution * ce; bCPUBound = ((m_iCompType & CT_IOBOUND) == 0); // global ... ce = new ConcurrentExecution (25); if ( QueryerFormanceCounter (& M_LiOLDVAL)) Return; // Gets the current time.

if timeBeginPeriod (1) (m_iCompType & CT_IOBOUND!); if (m_iCompType & CT_CONCURRENT) m_iThreadsUsed = ce-> DoForAllObjects (m_iNumberOfThreads, (long *) m_iNumbers, (CONCURRENT_EXECUTION_ROUTINE) pProcessor, (CONCURRENT_FINISHING_ROUTINE) pTerminator); else ce-> DoSerial (m_iNumberOfThreads, (long *) m_iNumbers, (CONCURRENT_EXECUTION_ROUTINE) pProcessor, (CONCURRENT_FINISHING_ROUTINE) pTerminator); if (m_iCompType & CT_IOBOUND) timeEndPeriod (1);! delete (ce); } The code first creates an object of a ConcURRENTEXECUTION class, then sampling the current time, (for statistical calculation of time and response time), and Or members of DoforlObjects. Please note that for the current execution I request the maximum and release number of 25; if you want to run more than 25 calculated test programs, you should increase this value, make it greater than or equal to run your test program. Maximum and number of days. Let's take a look at the processor and terminator to get the results of accurate measurement: Extern "C" {long WinAPI PPROCESSOR (Long IARG) {pthreadBlockStruct PTARG = (pthreadblockstruct) IARG; Bool Bresult = true; int iplay = (ptarg-> iDelay); if (bCPUBound) {int iLoopCount; iLoopCount = (int) (((float) iDelay / 1000.0) * ptArg-> tbOutputTarget-> m_iBiasFactor); QueryPerformanceCounter (& ptArg-> liStart); for (int iCounter = 0; iCounter liStart); Sleep (ptArg-> iDelay);}; return bResult;} long WINAPI pTerminator (long iArg, long iReturnCode) {PTHREADBLOCKSTRUCT ptArg = (PTHREADBLOCKSTRUCT) iArg QueryperformanceCounter (& PTARG-> Lifinish); ptarg-> IENDORDER = IENDINDEX ; return (0);}} The processor simulates a calculation that has been placed in a computing data structure threadBlockStructure.

ThreadBlockStructure maintains the relevant data related to the calculation, such as delay and terminating time (with performance count "tick", and the reverse pointer point to the view (view) of the practicalization of the structure. Based on the time specified by "Sleep" to simulate the calculation of I / O. The CPU-based calculation will enter an empty FOR cycle. Some comments here are to help understand the functionality of the code: the calculation is based on the CPU, and assumes that its execution time is the specified number of milliseconds. In the earlier versions of this test program, I just want to loop to perform enough times to meet the needs of the specified delay, regardless of the actual meaning of the number. (According to the relevant code, the number based on I / O actually means milliseconds, and for CPU-based calculations, this number means iterations.) However, in order to be able to use absolute time to compare CPU-based calculations and Based on I / O calculations, I decided to rewrite this code, so regardless of the CPU-based calculation or based on I / O calculations, the delay related to the calculation is measured in milliseconds. I found that the CPU-based calculation with the specified, pre-defined time length is not a simple thing to write code to simulate it. The reason is that such code itself cannot query the system time, because the call caused to hand over the CPU sooner, which violates the CPU-based requirements. Attempting to use asynchronous multimedia clock events, the same is not satisfactory, because the Working mode of the timer service under Windows NT. The thread of a multimedia timer is actually hang until the timer callback is called; therefore, the CPU-based calculation suddenly turns into an I / O operation. So, finally I used a little punctual trick: CTHREADLIBTESTVIEW:: ONCREATE The code executed 100 cycles from 1 to 100,000, and the average time required to sampling through the loop. The result is saved in the member variable m_ibiasfactor, which is a floating point number, which is used in the processor function to determine how milliseconds are "translated" iterated. Unfortunately, because the height dramatic nature of the operating system, it is difficult to determine how many times a given cycle is actually running a specified length. However, I found that the strategy completed a very credible job in deciding the calculation time based on the CPU. Note If you recompile the generated test application, be careful to use the optimized option. If you specify "Minimize Execution Time" optimization, the compiler will detect for loops with empty main body and delete these loops. Terminator is very simple: current time is sampled and saved in the calculated ThreadBlockstructure. After the test is over, the code calculates the time and termination of the EXECUTETEST to each calculate the difference between the time being called. Then, all calculations consume the time determined by the last calculation completed in all completed calculations, and the response time is the average value of each calculated response time, here, each response time, the same Define the time divided from the test start the thread consumption by the delay factor of the thread. Note that the terminator is serialized in the main thread context, so the incremental command on the shared IENDINDEX variable is secure. These actual is all of this test; the rest of the part is mainly to set some parameters for the test run, and some mathematical calculations are performed on the results. Fill results to the relevant logic in Microsoft Excel work orders will be discussed in "Interacting With Microsoft Excel: a Case Study IN Ole Automation.".

Results If you want to recreate the test results on your computer, you need to do the following: If you need to change the test parameters, such as the maximum calculation number or protocol file, edit the threadlibTestView.cpp in THRDPERF sample project , Then recompile the application. (Note that you want to build the application, your computer needs to support long file names.) Make sure that file thrdlib.dll can link to its location in a threadlibTest.exe. If you want to use Microsoft Excel to view the results of the test, make sure Microsoft Excel is properly installed on a computer running the test. Execute ThreadLibTest.exe from Windows 95 or Windows NT and select "Run Entire Test Set" from the "Run Performance Tests" menu. Under normal circumstances, it takes several hours to complete a time. After the test is over, the results can be used to use a normal text protocol file C: /TEMP/RESULTS.FIL, and a work order file C: /TEMP/VALUES.XLS can also be used. Note that Microsoft Excel automation (Automation) logic does not automatically generate charts from raw data, I use several macros to rearrange this result and generate charts for you. I hate numbers, but I have to praise Microsoft Excel, because even if I am spectating the Spreadheet-Paranoid, I can provide such a beautiful user interface. Several data is loaded within a few minutes. Insert a useful chart. The test results I show are collected at a 486/33 MHz system with 16 MB RAM. The computer also installs Windows NT (3.51) and Windows 95; which, different test results on both operating systems are comparable because they are based on the same hardware system. So, let us explain these values. Here is a chart to summarize the calculation results; This chart should look at it: 6 values ​​for each chart (except for long calculation consumption schedules, this table only contains 5 values, because during my test run, for very long calculations The timer overflows). A value represents multiple calculations; I run each test in 2, 5, 8, 11, 14 and 17 calculations. In the Microsoft Excel Results Worksheet, you will find the results of each calculation of CPU-based computing and I / O-based threads, delay bias, is 10 ms, 30 ms, 90 ms, respectively. , 270 ms, 810 ms, and 2430 ms, but in this chart, I only includes the results of 10 ms and 2430 ms, so all numbers are simplified, and more easily understood. I need to explain the meaning of "delay bias", if a test runs DELAY BIAS is N, each calculation has a multiple n as its calculation time.

For example, if the test is 5 calculations of the delay bias 10, one of the calculations will execute 50 ms, the second will execute 40 ms, the third will execute 30 ms, the fourth will execute 20 ms, and the first Five will execute 10 ms. Also, when these calculations are executed by serial, it is assumed to be the worst case, so the calculation having the longest delay is first performed, and other calculations are arranged in descending order. So, in the case of "ideal" (that is, there is no overlap between calculations), for the CPU-based calculation, all required time will be 50 ms 40 ms 30 ms 20 ms 10 ms = 150 MS. For the time chart, the value of the Y-axis corresponds to milliseconds, for the response time chart, the value of the Y-axis and the relative (ie, the actual execution of milliseconds expected to expect milliseconds) Corresponding. Figure 1. Short computation time comparison NT Figure 5. Short computation time comparison, in Windows 95 Character Character Character comparison, in Windows 95 Character Character Figure 7. Short computation response time comparison, in Windows 95 below 8. Long computation response time comparison, In Windows 95, based on I / O-based tasks, measures time and Turnaround time, based on I / O-based threads are much better than serial execution compared to serial execution. As a function, for concurrent execution, the time consumption is incremented in linear mode, and for serial execution, increment in an index mode (for Windows NT, please refer to Figure 5 for Windows 95) And 6). Please note that this conclusion is consistent with the analysis of I / O-based calculations, based on I / O calculations are excellent candidates for multi-threads, because a thread is hanged at the end of the I / O request. And this time the thread does not take up the CPU time, so this time can be used by other threads. For concurrent calculations, the average response time is a constant. For serial calculations, the average response time is incremented (please refer to Figures 3, 4, 7, and 8). Please note that no matter in any case, only a few calculations executed, regardless of the execution of serial or concurrent, there is no obvious difference regardless of how the test parameters are set. The CPU-based task is as mentioned earlier, in a single processor's computer, the CPU-based task is not possible to perform fast than serial execution, but we can see that the thread is created under Windows NT. And the additional overhead of the handover is very small; for very short calculations, concurrent execution is only 10% lower than the serial execution, while the calculation length increases, these two times are very close. We can find a response time, we can find that the response gain performed relative to the serial execution can reach 50% for long calculations, but for short calculations, serial execution is actually better than concurrent execution.

Comparison between Windows 95 and Windows NT If we take a look at the chart for long calculations (ie, Figure 2, 4, 6, 6, and 8), we can find that its behavior is extremely similar in Windows 95 and Windows NT. Please don't be confused by such a fact, that is, if Windows 95 is processed based on I / O calculations and CPU-based calculations than Windows NT. I concluded the reason for this result into such a fact, that is, the algorithm I used to determine how many test cycles and 1 milliseconds (as mentioned earlier); I found that this algorithm is When performing multiple times in the same environment, the difference between the results can reach 20%. Therefore, comparing CPU-based computing and based on I / O calculations is actually unfair. A different point between Windows 95 and Windows NT is when the short calculation is calculated. As we see from Figures 1 and 5, the effect of Windows NT is much better for concurrent I / O-based short computing. I conclude this result to a more efficient thread creation scheme. Please note that the difference between serial and concurrent I / O operation disappears, so we handle fixed and relatively small additional overhead. For short calculations, measure (as shown in Figures 3 and 7) in response time, please note that in Windows NT, there is a breakpoint at 10 threads, and more calculates and performs better results here. For Windows 95, serial calculations have better capacity. Note that these comparisons are based on the current version of the operating system (Windows NT 3.51 and Windows 95). If considering the problem of the operating system, the thread engine is very likely to be enhanced, so the two operating systems The difference in behavior is likely to disappear. However, there is a little interesting note that short computing is generally unnecessary to use multithreading, especially under Windows 95. It is recommended that these results can launch the following suggestions: The most important factor in determining multi-threading performance is based on the ratio of I / O calculations and CPU-based computing, deciding whether to adopt multi-threaded main conditions is the user response of the front desk. Let us assume that there are multiple sub-calculations to be potentially executed in different threads. In order to decide whether to use a multi-threading for these calculations, consider the following points. If the user interface response analysis determines that some things should be implemented in the second thread, then determine the task to be executed is based on I / O calculations or CPU-based calculations. Based on I / O calculations are preferably repositioned to the background thread. (However, please note that I / O processing of asynchronous single-thread may be better than multi-thread synchronization I / O, depending on the problem) Very long CPU-based thread may be implemented from different threads; However, unless the response of the thread is very important, other CPU-based tasks in the same background thread may be more meaningful than in different threads. Remember that in any case, short calculations generally have very much extra overhead when constructed in concurrent execution. If the result of the CPU-based calculation - that is, the result of each calculation can be applied immediately, the response is the most critical, then you should try to determine if these calculations can be sorted in ascending order, in this case these calculations The overall performance of serial execution will still be better than in parallel.

转载请注明原文地址:https://www.9cbs.com/read-6075.html

New Post(0)