(Reproduced) Win32 multi-thread performance

zhaozj2021-02-16  184

Win32 multi-threaded performance

Ruediger R. Asche

Microsoft Developer Network Technology Group

Abstract This article discusses the strategy to rewrite a single-threaded application into multi-threaded applications. It analyzes the performance of multi-threaded calculations as an example of Microsoft Windows 95 and Windows NT, which compares multithreading comparisons, compared with compatible single-thread calculations. Introduction In the relevant multi-threaded information, most of them can find synchronization concepts. For example, how to serialize shared the thread of the public data. This focuses on discussion synchronization, because synchronization is an indispensable part of multi-threaded programming. This article retreats for a Takes A Step Back, which is mainly discussed with a few faces involved in multi-thread: Decide how a calculation can be meaningless to split into multiple threads. The sample programs used herein, THRDPERF, on the two platforms of Microsoft Windows 95 and Windows NT, the Test Suite is implemented for the same calculation to take serial and concurrent two methods, and from throughput and The performance is two aspects to compare them. The first part of this article has established a number of vocabular applications to discuss the scope of the test suite and how the sample program kit is designed. The second part discusses the results of the test and includes recommendations for multi-threaded application design. Article "Interacting with Microsoft Excel: a Case Study IN Ole Automation" discussion about how interesting questions about the sample program suite, however, how the data obtained with the test collection is how to use Ole Automation in Microsoft Excel. If you are an experienced multi-threaded application programmer, you can skip the introduction section and go directly to the "Result" section below. Multi-line programs have been used for a long time, your app has been used - it runs outstanding, it is reliable, and the whole bit - but it is very late, and how you use multi-threaded ideas. However, please wait a while before starting this, because there are many traps that make you believe that some multi-threaded design is very perfect, but actually not this. Before you jump to the conclusions you want to enter, let us clarify the content that will not be discussed in this article:

The library that provides multi-threaded access in the Microsoft Win32 Application Programming Interface (API) is different, but we don't pay attention to this problem. The sample program kit, Threadlib.exe, is written in a Microsoft Foundation Class Library (MFC) application, but you are using Microsoft C runtime (CRT) library, MFC library, or simple (Barebones) Win32 API to create and maintain threads, we don't care. In fact, each library will eventually call Win32 System Service CreateThread to create a working thread, and multi-thread itself always executes through the operating system. Which packaging mechanism you want to use will not affect the topics of this article. Of course, you can use some other packaging library (Wrapper library), may cause performance differences, but here, we mainly discuss the essence of multi-threads, and don't care about its packaging. The multi-threaded application running on a single processor machine is discussed herein. Multiprocessor computers are a completely different topic, and conclusions discussed herein have almost no machines that can be applied to multiprocessors. I have not yet this opportunity to perform this example on a SCALABLE) symmetrical multi-thread (SMP) machine running a Windows NT system. If you have this opportunity, I am very happy to know your results. In this article, I prefer to collect "calculations" generally. Calculation is defined for a subtask for your application, which can be performed as a whole or part, which can be used earlier than another calculation, or at the same time as other calculations. For example, let us assume that an application requires the user's data, and needs to save these data to disk. We can assume that the input data contains a calculation, and saving these data is another calculation. According to the design of the application, the following two cases are possible: one is that the storage of data and the input of new data are simultaneous cross; the other is until the user has entered all of the data, but the data Save to disk. The first case can generally be implemented in some form of multi-thread; we call this organizational calculation for concurrent or interacting. The latter case can generally be implemented with a single threaded application, in this article, it is called serial execution. Designed for concurrent applications is a very complex process. Generally, very rich (Who Make a Ton of Money) is more likely because it is necessary to calculate how much the given task is implemented and implemented in the end, usually takes a study. This article does not want to teach you how to design multithreaded applications. Instead, I want to point out some multi-threaded applications, and I use the real-life performance test to discuss my example. After reading this article, you should be able to observe a given design and can determine whether a certain design increases the overall performance of the application. A portion of the multi-threaded application design step is to decide where there is a multi-threaded data access conflict that may potentially causing data destruction, and how to use thread synchronization to avoid this conflict.

This task (later, this article will be called thread serialization) is the topic of many articles about multi-threads, (for example, "Synchronization on the fly" in MSDN Library or "Compound Win32 Synchronization Objects) In this paper, it will not involve discussions on it. As you will be discussed herein, we will assume that computing is required and does not share any data, and therefore do not require any thread order. This agreement seems to be a bit harsh, but please keep in mind, it is impossible to discuss the "general" discussion of synchronous multi-threaded applications, because each order will impose a unique "wait-wake up" structure (Waiting -AND-WAKING PATTERN to the desired thread, it will directly affect performance. Most of the input / output (I / O) operations under Win32 have two forms: synchronous or asynchronous. It has been proven that in many cases, a multi-threaded design using synchronous I / O can be used to simulate the design of asynchronous single-thread I / O. This article does not discuss asynchronous single-threaded I / O as multithreaded alternatives, however, I recommend that you should be considered in both design. Note that the Win32 I / O system design is to provide some mechanisms such that asynchronous I / O is better than synchronous I / O (eg, the I / O Almightie Ports). I plan to discuss issues related to synchronous I / O and asynchronous I / O in later articles. As pointed out in the "Multiple Threads In the User Interface", multi-threaded and graphical user interface (GUI) does not work well. In this article, I assume that the background thread can do its work at all, do not need to use Windows GUI; I have the type of thread that is just "working thread", which is only calculated in the background without the need to be directly Interact. There is limited calculation, and there is also an unlimited calculation corresponding to it. A "listening" thread in the server-side application is an example of unlimited calculations, it does not have any purposes, just waiting for a client to connect to the server. After a customer has been connected, the thread sends a notification to the main thread and returns to the "listening" state until the next customer's connection. Naturally, such a calculation cannot reside in the same as the application user interface (UI), unless a asynchronous I / O operation is used. (Note that this particular problem can also be resolved by using asynchronous I / O and Almightion ports, rather than using multithreading, I am using this example here just as a demonstration). In this article, I will only consider limited calculations, that is, the sub-task of the application will end after a limited period of time. Based on CPU-based calculation and I / O calculations for a single thread, the most important factor for determining whether the calculation given is an excellent solution is that this calculation is based on CPU-based calculations or based on I / O calculations. . CDU-based calculations refer to this calculation of most of the time CPUs are very "busy". Typical CPU-based calculations are as follows:

Complex mathematical calculations, such as complex calculations, graphics, or screen rear graphics calculations to the operation of file images reside in memory, such as a given string in memory mirroring of a text file.

In contrast, the calculation based on I / O is a calculation, and most of its time is expected to wait for the end of the I / O request. In most operating systems, the device I / O that is entering will be treated asynchronously, which may be processed by a dedicated I / O processor, or by an efficient interrupt handler, and from The I / O request of an application will hang up the call thread until the I / O ends. In general, spending most of the time to wait for the I / O request to compete with other threads to compete for CPU time; therefore, I / O-based calculations may not reduce the performance of other threads than other threads compared to CPU-based threads. (Later, I will explain this argument), but please note that this comparison is a theoretical. Most of the calculations are not purely I / O-based or pure CPU, but based on I / O calculations and CPU-based computing are included. The calculation of the same episode may be carried out in a scheme, and the calculation is carried out in another aspect, depending on the CPU-based calculation and relative division based on I / O. Multi-threaded design goals should ask yourself what this shift is to ask yourself before you want to use multithreading for your application. Multi-threaded has many potential advantages: enhanced performance enhanced capacity (Throughput) better users quickly respond (Responsiveness)

Let us discuss every advantage above. Performance takes into account the time, let us simply define "performance" is all the time consumed by a given one or set of calculations. According to its definition, the comparison of performance is only for limited calculations. Whether you believe or whether the multi-threading scheme has a limited performance of the application. The reason this is not very obvious, but it is very reasonable:

Unless the application runs on a multiprocessor's machine, (in this case, the sub-calculation is actually parallel execution), the CPU-based calculation is not possible in the case of a single thread. Perform speed is fast. This is because if the calculation is decomposed into small pieces (in the case of multi-thread) or large blocks (in the same thread, in the same thread is executed), there is only one CPU, and it must perform all calculations. As a result, for a set of given calculations, if it is performed in multiple threads, it is generally longer than the calculation of the completed mode because it adds the addition of the thread and switching the CPU between the thread. burden. In general, there must be some cases, regardless of whom who do one of the plurality of calculations, but their results must be synchronized. For example, using multiple threads to read multiple files into memory, then the order in which files being processed is uncomfortable, but it is necessary to wait until all data read into memory, the application will begin processing. We will discuss this idea in the "Capacity" section.

In this article, we will measure the performance of the total time consumed to consume all the calculations. The capacity capacity (or response) refers to the time of each calculated average processing cycle (Turnaround). In order to demonstrate capacity, let us assume an example of a supermarket (it is always a great demo tool for the operating system): Suppose each calculation is a customer who is served at the settlement counter. For supermarkets, you can open a settlement counter for each customer, or all customers can concentrate through a settlement counter. In order to analyze the needs, it is assumed to have multiple settlement counters, but only one cashier (poor guy!) Comes to serve all customers, regardless of whether the customer is queued in front of a counter or more counters queue. This super cashier will jump high speed from a counter to the next counter, only handle one customer's item, then move to the next customer. This super cashier is like a CPU that is cut by multiple calculations. Just like we have seen in the previous "Performance" section, the total time to serve all customers does not decrease because there are multiple settlement counters to open, because the customer is served in a counter or multiple counters, Always this cashier to complete all the work. However, things are like this, and the customer still likes this super cashier than a settlement counter. This is because in general, the number of goods in the customer's hand truck is huge, and some customers have a lot of goods, and some customers just want to buy a few items. If you have only hoped to buy a box of Granola Bars and a quaver milk, but it is behind a somewhere for the whole family of 24 people procurement, then you know what I mean. Anyway, if you can serve in the Clark Kent, instead of queuing, you will not queue if the time is to complete the checkout, because no matter what, the two items will have been processed quickly. Finish. The trolley that is full of goods purchased for 24 people is processed on the other counter, so you can leave the checkout soon. Therefore, the capacity is how many calculations can be performed in a given time. Each calculation is such a measure of its process, that is to compare the following two times: How much time it takes to complete this calculation, and how much time it takes to assume that the calculation is first processed. In other words, if you go to the supermarket, and I hope that you will leave there in two minutes, but actually spend two hours to settle your two product settlement because you have built Betty Crocker in the 1997 production line. Behind, then have to say that your process has failed. In this article, we define a calculated response time, calculating the time consumed by the time divided by the expected time. Then, if a calculation should be consumed by 10 milliseconds (MS), it actually consumes 20 ms, then its response processing period is 2, but if it is the same calculation, it consumes 200 ms (maybe because there is another A long calculation and competition is preferred to end, then the response processing cycle is 20. Obviously, the shorter response processing cycle is, better. We will see later, when introducing multithreaded into an application, even if the overall performance is lowered, capacity may still be a practical factor; however, to make capacity a factor with actual meaning It is necessary to meet some of the conditions:

Each calculation must be independent of each other, as long as the end is calculated, the result of any calculation can be processed or used. If you are a team member of a university football team, and each of your players buy their own travel food in the same supermarket, your product is handled or after being processed, how long you spend more time is Two items of goods, and how long you have been waiting for this, these have nothing to do, because your car will not leave unless all players have bought food. Different are only your waiting time, or it cost is waiting in line waiting for checkout, or if the super cashier has already served, time spent on other people. This is important, but it is often ignored. As I mentioned earlier, most applications will be explicitly or implicitly synchronized or implicitheated. For example, if your application collects data from different files, you may want to display results on the screen, or save them to another file. In the previous case (display results on the screen), you should realize that most graphics systems perform some internal batch or serial operation unless all output data have been collected, otherwise it is not worth Will have good display; in the back, (save the result to another file) unless the entire prototype file has been written, it is generally not your application (or other application) can fully handle it. . So, if a person or something sequentially in some form, whether it is an application, operating system, or even a user, then the benefits you can get in handling files may disappear. There is a significant difference in quantity between calculations. If each customer in the supermarket needs to check out, the super cashier's way is not the advantage; if he has to jump between the three settlement counters, and each customer wants to be served There is only 2 (or 3, 4 or N) items to settle, then every customer has to wait for a few times to complete his or her settlement, which is better than letting all the customers queue together. bad. Here, the multi-thread imagination is a Shock absorption device: short calculations do not take the risk of being ranging after a long calculation, but they are divided into threads and spend more time, while they can be shorter The completion is completed. If the calculated length can be decided in advance, the serial processing may be better than the multi-threaded tension, you can use the ascending order to arrange the calculation with a long time short. In the supermarket example, it is equivalent to the site of the number of customers (a variant of the Express Lane program), this idea is based on such considerations, only the customers of only very small goods like it, because they Will not delay a lot of time for a little, and those customers who have a lot of goods will not care, because all the settlements have to be taken for a long time, and everyone in front of them Products are less than it.

If you just know a range of calculation time, your application cannot sort these calculations, then you should make a worst situation for some time. In such an analysis, you should assume that these calculations are not sorted in an ascending order of time. In contrast, they are sorted in descending sequences. From a response to this perspective, this scheme is the worst case because each calculation will have its highest possible response processing cycle as defined above. Quick Response I will discuss here, the last criterion of the application multi-threaded is a fast response (very close to the response on the language, enough to make you confused). In this article, if an application is designed to ensure that users can always be able to interact with the application within a very short time (very short time referring time, making the user feel that the application is suspended) Then we are simple, define the application to respond to fast applications. For a WIN32 application with GUI, the fast response can be implemented very simple, as long as the long calculation is delegated to the background thread, but the structure required to achieve a fast response may require higher skill, as I mentioned earlier If some people may wait for a calculation to return at a certain time, so executing a long calculation in the background may need to change the user interface (for example, you need to add a "cancel" button and rely on the menu item of the calculation result. It has also been made to mess. In addition to performance, capacity, and fast response, other reasons may affect multithreading design. For example, in certain structures, it is necessary to calculate the calculation of a pseudo-random manner (the example in the mind is the neural network of the Bolzmann machine type, in which the network is only asynchronously When calculating, the intended behavior of the Internet can operate). However, in this article, I will limit the scope of the discussion to the three factors mentioned above, that is, performance, capacity, and fast response. I have heard that I have heard a lot of discussion about the abstraction mechanism, saying that it encapsulates all multi-threaded bad (Nasty) to a C object, and thus makes an application get a multi-thread all the advantages. Not the shortcomings. In this article, I designed such an abstraction at first. I will define a prototype for a C class ConcURRENTEXECUTION that will contain member functions such as: DoconCurrent and DOSERIAL, and the parameters of these two member functions will be an array of ordinary objects and a callback function, which will be An object is called concurrently or serially. The C class will package all the truth details on maintaining the thread and the internal data structure. However, for me, I am very clear from the beginning, such an abstract use is very limited, because the maximum number of work when designing a multi-thread application is a task that cannot be completed, this work is Decide how to implement multiple threads. The first restriction of Concurrentexecution is that the callback function will not allow explicit or implicit shared data; or the callback function requires any other form of synchronization operation, and these synchronization operations will immediately sacrifice all the abstraction, and Open all "Wonderful" synchronous world traps and circles, such as dead locks, competitive conflicts, or very complex composite synchronization objects. Similarly, it is not allowed to call the UI, because it is like me, Win32 API has forced many implicit synchronous operations for calling the UI thread.

Note that there are many other API subsets and libraries to share their threads forced implicit synchronization operations. These limits make Concurrentexecution only with extremely limited functions, saying that a specific point is an abstraction of a pure worker thread (completely independent calculation is limited to mathematical calculations in a non-continuous memory area). However, it is true that it is very useful to implement the Concurrentexecution class and use it in performance testing, because when I implemented this class, when designing and running the test, many of the hidden details about multithreading are exposed. . Please clearly below, although the Concurrentexecution class makes multiple threads easier to handle, but if you want to use it in commercial products, then this type of implementation requires some other work. In particular, I ignored all the wrong processes, which is invalid. But I assume that it is only used for testing (I obviously use the concurrentexecution, the error will not appear. ConcurrentExecution class The prototypes ConcurrentExecution class: class ConcurrentExecution { public: ConcurrentExecution (int iMaxNumberOfThreads); ~ ConcurrentExecution (); int DoForAllObjects (int iNoOfObjects, long * ObjectArray, CONCURRENT_EXECUTION_ROUTINE pObjectProcessor, CONCURRENT_FINISHING_ROUTINE pObjectTerminated); BOOL DoSerial ( int iNoOfObjects, long * ObjectArray, CONCURRENT_EXECUTION_ROUTINE pObjectProcessor, CONCURRENT_FINISHING_ROUTINE pObjectTerminated);}; This class is derived from Thrdlib.dll library, and the library is a project Thrdlib.dll example test suite THRDPERF in. Before discussing the internal structure of the class, let us first discuss the semantics of member functions (semantics): ConcurrentExecution :: ConcurrentExecution (int iMaxNumberOfThreads) {m_iMaxArraySize = min (iMaxNumberOfThreads, MAXIMUM_WAIT_OBJECTS); m_hThreadArray = (HANDLE *) VirtualAlloc (NULL, m_iMaxArraySize * sizeof (hANDLE), MEM_COMMIT, PAGE_READWRITE); m_hObjectArray = (DWORD *) VirtualAlloc (NULL, m_iMaxArraySize * sizeof (DWORD), MEM_COMMIT, PAGE_READWRITE); // of course, necessary to achieve a true process provided herein for the error. ..}; You may notice that the constructor Concurrentexecution has a number parameter. This parameter specifies the "concurrency maximum degree" supported by the instance of the class; in other words, if a ConcURRENTEXECUTION is created, N is a parameter, then there is no more than N in any given time. Calculation is executed.

According to our previous analysis, this parameter means "no matter how many customers are waiting, the number of open settlement counter is not more than N". INT DOFORALLOBJECTS (int inoofoBjects, long * ObjectArray, const_execution_routine pobjectprocessor, concurrent_finishing_routine pobjectterminated); this is the only interesting member function implemented here. The main parameters of DOFORALLOBJECTS are an array of objects, a processor function, and a terminal function. Regarding the object completely no mandatory format; each time the processor is called, there will be an object to be passed to it, and the object is completely interpreted by the processor. The first parameter inoofObjects, just wants to know the number of elements in an object array. Note that when you call DOFORALLOBJECTS, if the length of the object is 1, then it is very similar to that call CreateThread (a little different, that is, CreateThread does not accept a terminator parameter). The semantics of DOFORALLOBJECTS are as follows: The processor will call every object. The sequence of objects is processed is not specified; all of all the guarantees is only passed to the processor at a certain time. The maximum number of concurrent is determined by the parameters of the constructor passing to the ConcURRENTEXECUTION object. Processor functions cannot access shared data and cannot call UI or do anything else needed to explicitly or implicitly operate. Currently, there is only one processor function to work on all objects; however, to replace the processor parameter using a processor array will be simple. The prototype of this processor is as follows: Typedef DWORD (WinAPi * Concurrent_execution_routine) (LPVOID LPPARETERBLOCK); When the processor has completed the work on an object, the terminal function will be called immediately. Unlike the processor, the terminal function is serialized in the environment of the call function, and all the routines can be invoked and all the data that access the calling program can be accessed. However, it should be noted that the terminator should be optimized as possible because the length calculation in the terminator affects the performance of DOFORALLOBJECTS. Note that although the processor will be called immediately when each object terminator is ended until the last object has been terminated, the DOFORALLOBJECTS itself has not returned. Why do we have to experience so many use terminations? We can also make each calculation execute the terminal code at the end of the processor function, is it? This is basically possible; however, it is necessary to emphasize that the terminator is called in the thread environment that calls DOFORALLOBJECTS. Such a design makes it easier to handle their results when each calculation is entered without having worrying synchronization issues. The prototype of the terminal function is as follows: typedef dword (WinApi * concurrent_finishing_routine) (LPVOID LPRESULTCODE); the first parameter is the object being processed, the second parameter is the result of the processor function on the object. The similarity of DOFORALLOBJECTS is DOSERIAL, DOSERIAL and DOFORALLOBJECTS have the same parameter list, but the calculation is processed in a serial order and starts in a list of first objects. Internal work of Concurrentexecution, please note that this section is very technical, so you understand a lot of knowledge about Win32 thread API.

If you are interested in how to use the Concurrentexecution class to collect test data, instead of interested in Concurrentexecution :: DoforalLobjects, you can now skip to the following "Test Thread Performance" with Concurrentexecution. Let's start from DoSerial, largely because it is a "no brainer guy": BOOL ConcurrentExecution :: DoSerial (int iNoOfObjects, long * ObjectArray, CONCURRENT_EXECUTION_ROUTINE pProcessor, CONCURRENT_FINISHING_ROUTINE pTerminator) {for (int iLoop = 0; iLoop

This code is from file Thrdlib.cpp in inherited, but in order to clear reasons, has been streamlined: int ConcurrentExecution :: DoForAllObjects (int iNoOfObjects, long * ObjectArray, CONCURRENT_EXECUTION_ROUTINE pObjectProcessor, CONCURRENT_FINISHING_ROUTINEpObjectTerminated) {int iLoop, iEndLoop; DWORD iThread ; DWORD iArrayIndex; DWORD dwReturnCode; DWORD iCurrentArrayLength = 0; BOOL bWeFreedSomething; char szBuf [70]; m_iCurrentNumberOfThreads = iNoOfObjects; HANDLE * hPnt = (HANDLE *) VirtualAlloc (NULL, m_iCurrentNumberOfThreads * sizeof (HANDLE), MEM_COMMIT, PAGE_READWRITE); for (iLoop = 0; iLoop

We hope to do this before / / before finding a new SLOT, so we can immediately call the terminator of the thread ... IARRAYIDEX = WaitFormultiPleObjects (icrrentaraylength, m_hthreadarray, false, 0); if (IaRrayIndex == WAIT_TIMEOUT) // no slot free ... {{if (iCurrentArrayLength> = m_iMaxArraySize) {iArrayIndex = WaitForMultipleObjects (iCurrentArrayLength, m_hThreadArray, FALSE, INFINITE); bWeFreedSomething = TRUE;} else // we can release a slot somewhere, now do ... {iCurrentArrayLength ; iArrayIndex = iCurrentArrayLength-1;}; // Else iArrayIndex points to a thread that has been nuked};} else bWeFreedSomething = TRUE;}; // here, iArrayIndex comprising an effective Indexing to store new threads. hNewThread = hPnt [iLoop]; ResumeThread (hNewThread); if (bWeFreedSomething) {GetExitCodeThread (m_hThreadArray [iArrayIndex], & dwReturnCode); // Error CloseHandle (m_hThreadArray [iArrayIndex]); pObjectTerminated ((void *) m_hObjectArray [iArrayIndex], ( void *) dwReturnCode);}; m_hThreadArray [iArrayIndex] = hNewThread; m_hObjectArray [iArrayIndex] = ObjectArray [iLoop];}; // loop end DoForAllObjects core is hPnt, it is an object array of objects when the object is ConcurrentExecution The assignment is assigned. The array can accommodate the maximum number of threads, which corresponds to the maximum and release number specified in the constructor; therefore, each element in the array is a "slot", and there is a calculation. The Slots algorithm that decides how to fill and release is as follows: The object array is traversed from beginning to end, and for every object, we do what we do: if there is no Slot has been filled, we use the current object to populate the The first SLOT in the array and continues to perform threads that will process the current object. If there is at least one slot is used, we use the waitformultipleObjects function to determine if any calculations are running have ended; if so, we call the terminal on this object and "reuse" this Slot for the new object ". Please note that we can first populate each idle slot until there is no remaining slots, and then start filling empty slot. However, if we do this, the endpower function of the empty SLOT will not be called until all Slot has been filled, so it violates us that when the processor ends an object, the terminator is immediately called. Requirements.

Finally, there is still no SLOT (that is, the current activated thread is equal to the maximum and release number allowed by the ConcURRENTEXECUTION object). In this case, WaitFormultiPleObjects will be called again to make DOFORALLOBJECTS in the "sleep" state until there is a SLOT flight; as long as this happens, the terminator is called on an empty Slot, and working The thread of the current object is continued. Finally, all the calculations are either end, or the Slot in an object array will be occupied. The following code will handle all the remaining threads: iEndLoop = iCurrentArrayLength; for (iLoop = iEndLoop; iLoop> 0; iLoop -) {iArrayIndex = WaitForMultipleObjects (iLoop, m_hThreadArray, FALSE, INFINITE); if (iArrayIndex == WAIT_FAILED) {GetLastError (); _ asm int 3; // here to do some smart things ...}; getExitcodethread (m_hthreadArray [IARRAYX], & DWRETURNCODE); / / Error? If (! CLOSEHANDEX])) MessageBox (GetFocus (), "Can't delete thread!", "", mb_ok); // make it better ... POBJECTTERMINATED ((void *) M_HObjectArray [arayindex ], (void *) dwReturnCode); if (iArrayIndex == iLoop-1) continue; // good here, there is no need to fill rearwardly m_hThreadArray [iArrayIndex] = m_hThreadArray [iLoop-1]; m_hObjectArray [iArrayIndex] = m_hObjectArray [ iLoop-1];}; Finally, clear: if (hPnt) VirtualFree (hPnt, m_iCurrentNumberOfThreads * sizeof (HANDLE), MEM_DECOMMIT); return iCurrentArrayLength;}; use ConcurrentExecution to test range thread performance performance test is as follows: test application Threadlibtest .exe's user can specify whether to test the CPU-based or I / O-based calculation, how much calculation is performed, how long is calculated, how is the calculation is sorted (in order to test the worst case and random delay), and calculate Is it concurrent execution or serial execution. In order to eliminate unexpected results, each test can be executed ten times, then take the result of ten times to produce a more trusted result. By selecting the menu option "Run Entire Test Set", the user can request a variation of all test variables. The calculation length used in the test varies between the fundamental 10 and 3,500 ms (I will discuss this problem later), the number of calculations varies between 2 and 20. If Microsoft Excel is installed on a computer running the test, ThreadLibTest.exe will dump the result in a Microsoft Excel worksheet, which is located in C: /Temp/Values.xls.

In any case, the result value will also be saved to a plain text file, which is located at C: /TEMP/Results.fil. Please note that I use hard-coded ways for the location of the protocol file. It is purely lazy behavior; if you need to regenerate the test results on your computer, and you need to specify a different location, you only need to recompile to generate the project. Changing the TEXTFILOC and SheetFileLoc identifier of the beginning of the file threadlibTestView.cpp. Keep in mind that running the entire test program will always sort the calculation with the worst case (that is, the order in which the execution is serial, the longest calculation will be executed first, followed by the second long calculation, Then be pushed in times). This solution sacrifices the flexibility of serial execution because concurrent execution of the response time has not changed under a non-worst scheme, and the response time of the serial execution is possible. As I mentioned earlier, in an active program, you should analyze if each calculated time is predictable. The code to collect performance data is located in ThreadlibTestView.cpp using the Concurrentexecution class. The sample application itself is a real single-document interface (SDI) MFC application. All code-related code resident in the implementation of the View class CTHREADLIBTESTSTVIEW, which is inherited from CeasyoutputView. (For the discussion of this class, please refer to "Windows NT Security In THEORY AND PRACTICE".) This is not included in this class all of the interesting code, and most of the included statistics section and the user interface processing section. Execute "MEAT" in the test In CTHREADLIBTESTVIEW: :: ExecuteTest, a test run cycle will be executed. The following is a schematic of codes associated CThreadLibTestView :: ExecuteTest: void CThreadlibtestView :: ExecuteTest () {ConcurrentExecution * ce; bCPUBound = ((m_iCompType & CT_IOBOUND) == 0); // global ... ce = new ConcurrentExecution (25); if ( QueryerFormanceCounter (& M_LiOLDVAL)) Return; // Gets the current time.

(! M_iCompType & CT_IOBOUND) if timeBeginPeriod (1); if (m_iCompType & CT_CONCURRENT) m_iThreadsUsed = ce-> DoForAllObjects (m_iNumberOfThreads, (long *) m_iNumbers, (CONCURRENT_EXECUTION_ROUTINE) pProcessor, (CONCURRENT_FINISHING_ROUTINE) pTerminator); elsece-> DoSerial (m_iNumberOfThreads, (long * ) m_iNumbers, (CONCURRENT_EXECUTION_ROUTINE) pProcessor, (CONCURRENT_FINISHING_ROUTINE) pTerminator); if (m_iCompType & CT_IOBOUND) timeEndPeriod (1);! delete (ce); } the The segment code first creates an object of a ConcURRENTEXECUTION class, then samples the current time, (for statistical calculation time and response time), and, according to the requested serial execution or concurrent execution, call the CONCURRENTEXECUTION object DOSERIAL or DoForlObjects member. Please note that for the current execution I request the maximum and release number of 25; if you want to run more than 25 calculated test programs, you should increase this value, make it greater than or equal to run your test program. Maximum and number of days. Let's take a look at the processor and terminator to get the results of accurate measurement: Extern "C" {long WinAPI PPROCESSOR (Long IARG) {pthreadBlockStruct PTARG = (pthreadblockstruct) IARG; Bool Bresult = true; int iplay = (ptarg-> iDelay); if (bCPUBound) {int iLoopCount; iLoopCount = (int) (((float) iDelay / 1000.0) * ptArg-> tbOutputTarget-> m_iBiasFactor); QueryPerformanceCounter (& ptArg-> liStart); for (int iCounter = 0; iCounter liStart); Sleep (ptArg-> iDelay);}; return bResult;} long WINAPI pTerminator (long iArg, long iReturnCode) {PTHREADBLOCKSTRUCT ptArg = (PTHREADBLOCKSTRUCT) iArg QueryperformanceCounter (& PTARG-> Lifinish); ptarg-> IENDORDER = IENDINDEX ; return (0);}} The processor simulates a calculation that has been placed in a computing data structure threadBlockStructure. ThreadBlockStructure maintains the relevant data related to the calculation, such as delay and terminating time (with performance count "tick", and the reverse pointer point to the view (view) of the practicalization of the structure.

Based on the time specified by "Sleep" to simulate the calculation of I / O. The CPU-based calculation will enter an empty FOR cycle. Some comments here are to help understand the functionality of the code: the calculation is based on the CPU, and assumes that its execution time is the specified number of milliseconds. In the earlier versions of this test program, I just want to loop to perform enough times to meet the needs of the specified delay, regardless of the actual meaning of the number. (According to the relevant code, the number based on I / O actually means milliseconds, and for CPU-based calculations, this number means iterations.) However, in order to be able to use absolute time to compare CPU-based calculations and Based on I / O calculations, I decided to rewrite this code, so regardless of the CPU-based calculation or based on I / O calculations, the delay related to the calculation is measured in milliseconds. I found that the CPU-based calculation with the specified, pre-defined time length is not a simple thing to write code to simulate it. The reason is that such code itself cannot query the system time, because the call caused to hand over the CPU sooner, which violates the CPU-based requirements. Attempting to use asynchronous multimedia clock events, the same is not satisfactory, because the Working mode of the timer service under Windows NT. The thread of a multimedia timer is actually hang until the timer callback is called; therefore, the CPU-based calculation suddenly turns into an I / O operation. So, finally I used a little punctual trick: CTHREADLIBTESTVIEW:: ONCREATE The code executed 100 cycles from 1 to 100,000, and the average time required to sampling through the loop. The result is saved in the member variable m_ibiasfactor, which is a floating point number, which is used in the processor function to determine how milliseconds are "translated" iterated. Unfortunately, because the height dramatic nature of the operating system, it is difficult to determine how many times a given cycle is actually running a specified length. However, I found that the strategy completed a very credible job in deciding the calculation time based on the CPU. Note If you recompile the generated test application, be careful to use the optimized option. If you specify "Minimize Execution Time" optimization, the compiler will detect for loops with empty main body and delete these loops.

Terminator is very simple: current time is sampled and saved in the calculated ThreadBlockstructure. After the test is over, the code calculates the time and termination of the EXECUTETEST to each calculate the difference between the time being called. Then, all calculations consume the time determined by the last calculation completed in all completed calculations, and the response time is the average value of each calculated response time, here, each response time, the same Define the time divided from the test start the thread consumption by the delay factor of the thread. Note that the terminator is serialized in the main thread context, so the incremental command on the shared IENDINDEX variable is secure. These actual is all of this test; the rest of the part is mainly to set some parameters for the test run, and some mathematical calculations are performed on the results. Fill results to the relevant logic in Microsoft Excel work orders will be discussed in "Interacting With Microsoft Excel: a Case Study IN Ole Automation.". Results If you want to recreate the test results on your computer, you need to do the following: If you need to change the test parameters, such as the maximum calculation number or protocol file, edit the threadlibTestView.cpp in THRDPERF sample project , Then recompile the application. (Note that you want to build the application, your computer needs to support long file names.) Make sure that file thrdlib.dll can link to its location in a threadlibTest.exe. If you want to use Microsoft Excel to view the results of the test, make sure Microsoft Excel is properly installed on a computer running the test. Execute ThreadLibTest.exe from Windows 95 or Windows NT and select "Run Entire Test Set" from the "Run Performance Tests" menu. Under normal circumstances, it takes several hours to complete a time. After the test is over, the results can be used to use a normal text protocol file C: /TEMP/RESULTS.FIL, and a work order file C: /TEMP/VALUES.XLS can also be used. Note that Microsoft Excel automation (Automation) logic does not automatically generate charts from raw data, I use several macros to rearrange this result and generate charts for you. I hate numbers, but I have to praise Microsoft Excel, because even if I am spectating the Spreadheet-Paranoid, I can provide such a beautiful user interface. Several data is loaded within a few minutes. Insert a useful chart. The test results I show are collected at a 486/33 MHz system with 16 MB RAM. The computer also installs Windows NT (3.51) and Windows 95; which, different test results on both operating systems are comparable because they are based on the same hardware system. So, let us explain these values. Here is a chart to summarize the calculation results; This chart should look at it: 6 values ​​for each chart (except for long calculation consumption schedules, this table only contains 5 values, because during my test run, for very long calculations The timer overflows).

A value represents multiple calculations; I run each test in 2, 5, 8, 11, 14 and 17 calculations. In the Microsoft Excel Results Worksheet, you will find the results of each calculation of CPU-based computing and I / O-based threads, delay bias, is 10 ms, 30 ms, 90 ms, respectively. , 270 ms, 810 ms, and 2430 ms, but in this chart, I only includes the results of 10 ms and 2430 ms, so all numbers are simplified, and more easily understood. I need to explain the meaning of "delay bias", if a test runs DELAY BIAS is N, each calculation has a multiple n as its calculation time. For example, if the test is 5 calculations of the delay bias 10, one of the calculations will execute 50 ms, the second will execute 40 ms, the third will execute 30 ms, the fourth will execute 20 ms, and the first Five will execute 10 ms. Also, when these calculations are executed by serial, it is assumed to be the worst case, so the calculation having the longest delay is first performed, and other calculations are arranged in descending order. So, in the case of "ideal" (that is, there is no overlap between calculations), for the CPU-based calculation, all required time will be 50 ms 40 ms 30 ms 20 ms 10 ms = 150 MS. For the time chart, the value of the Y-axis corresponds to milliseconds, for the response time chart, the value of the Y-axis and the relative (ie, the actual execution of milliseconds expected to expect milliseconds) Corresponding. Figure 1. Short calculation consumption time comparison, in Windows NT, Figure 2. Long calculation consumption time comparison, in Windows NT Character Character comparison, in Windows NT Figure 4. Long computation response time comparison, in Windows NT Figure 5. Short computation time comparison, in Windows 95 Character Character Character comparison, in Windows 95 Character Character Figure 7. Short computation response time comparison, in Windows 95 below 8. Long computation response time comparison, In Windows 95, based on I / O-based tasks, measures time and Turnaround time, based on I / O-based threads are much better than serial execution compared to serial execution. As a function, for concurrent execution, the time consumption is incremented in linear mode, and for serial execution, increment in an index mode (for Windows NT, please refer to Figure 5 for Windows 95) And 6). Please note that this conclusion is consistent with the analysis of I / O-based calculations, based on I / O calculations are excellent candidates for multi-threads, because a thread is hanged at the end of the I / O request. And this time the thread does not take up the CPU time, so this time can be used by other threads. For concurrent calculations, the average response time is a constant. For serial calculations, the average response time is incremented (please refer to Figures 3, 4, 7, and 8).

Please note that no matter in any case, only a few calculations executed, regardless of the execution of serial or concurrent, there is no obvious difference regardless of how the test parameters are set. The CPU-based task is as mentioned earlier, in a single processor's computer, the CPU-based task is not possible to perform fast than serial execution, but we can see that the thread is created under Windows NT. And the additional overhead of the handover is very small; for very short calculations, concurrent execution is only 10% lower than the serial execution, while the calculation length increases, these two times are very close. We can find a response time, we can find that the response gain performed relative to the serial execution can reach 50% for long calculations, but for short calculations, serial execution is actually better than concurrent execution. Comparison between Windows 95 and Windows NT If we take a look at the chart for long calculations (ie, Figure 2, 4, 6, 6, and 8), we can find that its behavior is extremely similar in Windows 95 and Windows NT. Please don't be confused by such a fact, that is, if Windows 95 is processed based on I / O calculations and CPU-based calculations than Windows NT. I concluded the reason for this result into such a fact, that is, the algorithm I used to determine how many test cycles and 1 milliseconds (as mentioned earlier); I found that this algorithm is When performing multiple times in the same environment, the difference between the results can reach 20%. Therefore, comparing CPU-based computing and based on I / O calculations is actually unfair. A different point between Windows 95 and Windows NT is when the short calculation is calculated. As we see from Figures 1 and 5, the effect of Windows NT is much better for concurrent I / O-based short computing. I conclude this result to a more efficient thread creation scheme. Please note that the difference between serial and concurrent I / O operation disappears, so we handle fixed and relatively small additional overhead. For short calculations, measure (as shown in Figures 3 and 7) in response time, please note that in Windows NT, there is a breakpoint at 10 threads, and more calculates and performs better results here. For Windows 95, serial calculations have better capacity. Note that these comparisons are based on the current version of the operating system (Windows NT 3.51 and Windows 95). If considering the problem of the operating system, the thread engine is very likely to be enhanced, so the two operating systems The difference in behavior is likely to disappear. However, there is a little interesting note that short computing is generally unnecessary to use multithreading, especially under Windows 95. It is recommended that these results can launch the following suggestions: The most important factor in determining multi-threading performance is based on the ratio of I / O calculations and CPU-based computing, deciding whether to adopt multi-threaded main conditions is the user response of the front desk. Let us assume that there are multiple sub-calculations to be potentially executed in different threads. In order to decide whether to use a multi-threading for these calculations, consider the following points. If the user interface response analysis determines that some things should be implemented in the second thread, then determine the task to be executed is based on I / O calculations or CPU-based calculations. Based on I / O calculations are preferably repositioned to the background thread.

转载请注明原文地址:https://www.9cbs.com/read-8124.html

New Post(0)