Win32 multi-threaded performance (1)

zhaozj2021-02-08  215

Author: Microsoft company Feeds

Ruediger R. Aschemicrosoft Developer Network Technology Team Summary This article discusses the policy of rewriting a single-threaded application to multi-threaded applications. It analyzes the performance of multi-threaded comparisons with Microsoft® Windows® 95 and Windows NT® platforms, compared with compatible single-threading comparisons, compared with compatible single-thread calculations. Introduction In the relevant multi-threaded information, most of them can find synchronization concepts. For example, how to serialize shared the thread of the public data. This focuses on discussion synchronization, because synchronization is an indispensable part of multi-threaded programming. This article retreats for a Takes A Step Back, which is mainly discussed with a few faces involved in multi-thread: Decide how a calculation can be meaningless to split into multiple threads. The sample programs used herein, THRDPERF, on the two platforms of Microsoft? Windows? 95 and Windows NT?, The two platforms are taken for the same calculation to take serial and concurrent two methods, respectively, and from the Test Suite, Both throughput and performance are compared. The first part of this article has established a number of vocabular applications to discuss the scope of the test suite and how the sample program kit is designed. The second part discusses the results of the test and includes recommendations for multi-threaded application design. Article "Interacting with Microsoft Excel: a Case Study IN Ole Automation" discussion about how interesting questions about the sample program suite, however, how the data obtained with the test collection is how to use Ole Automation in Microsoft Excel. If you are an experienced multi-threaded application programmer, you can skip the introduction section and go directly to the "Result" section below. Multi-line programs have been used for a long time, your app has been used - it runs outstanding, it is reliable, and the whole bit - but it is very late, and how you use multi-threaded ideas. However, please wait a while before starting this, because there are many traps that make you believe that some multi-threaded design is very perfect, but actually not this. Before you jump to the conclusions you want to enter, let us clarify the content that will not be discussed here in this article: Different from the Microsoft Win32? Application Programming Interface (API) is different, but we don't Pay attention to this problem. The sample program kit, Threadlib.exe, is written in a Microsoft Foundation Class Library (MFC) application, but you are using Microsoft C runtime (CRT) library, MFC library, or simple (Barebones) Win32 API to create and maintain threads, we don't care. In fact, each library will eventually call Win32 System Service CreateThread to create a working thread, and multi-thread itself always executes through the operating system. Which packaging mechanism you want to use will not affect the topics of this article. Of course, you can use some other packaging library (Wrapper library), may cause performance differences, but here, we mainly discuss the essence of multi-threads, and don't care about its packaging.

The multi-threaded application running on a single processor machine is discussed herein. Multiprocessor computers are a completely different topic, and conclusions discussed herein have almost no machines that can be applied to multiprocessors. I have not yet this opportunity to perform this example on a SCALABLE) symmetrical multi-thread (SMP) machine running a Windows NT system. If you have this opportunity, I am very happy to know your results. In this article, I prefer to collect "calculations" generally. Calculation is defined for a subtask for your application, which can be performed as a whole or part, which can be used earlier than another calculation, or at the same time as other calculations. For example, let us assume that an application requires the user's data, and needs to save these data to disk. We can assume that the input data contains a calculation, and saving these data is another calculation. According to the design of the application, the following two cases are possible: one is that the storage of data and the input of new data are simultaneous cross; the other is until the user has entered all of the data, but the data Save to disk. The first case can generally be implemented in some form of multi-thread; we call this organizational calculation for concurrent or interacting. The latter case can generally be implemented with a single threaded application, in this article, it is called serial execution. Designed for concurrent applications is a very complex process. Generally, very rich (Who Make a Ton of Money) is more likely because it is necessary to calculate how much the given task is implemented and implemented in the end, usually takes a study. This article does not want to teach you how to design multithreaded applications. Instead, I want to point out some multi-threaded applications, and I use the real-life performance test to discuss my example. After reading this article, you should be able to observe a given design and can determine whether a certain design increases the overall performance of the application. A portion of the multi-threaded application design step is to decide where there is a multi-threaded data access conflict that may potentially causing data destruction, and how to use thread synchronization to avoid this conflict. This task (later, this article will be called thread serialization) is the topic of many articles about multi-threads, (for example, "Synchronization on the fly" in MSDN Library or "Compound Win32 Synchronization Objects) In this paper, it will not involve discussions on it. As you will be discussed herein, we will assume that computing is required and does not share any data, and therefore do not require any thread order. This agreement seems to be a bit harsh, but please keep in mind, it is impossible to discuss the "general" discussion of synchronous multi-threaded applications, because each order will impose a unique "wait-wake up" structure (Waiting -AND-WAKING PATTERN to the desired thread, it will directly affect performance. Most of the input / output (I / O) operations under Win32 have two forms: synchronous or asynchronous. It has been proven that in many cases, a multi-threaded design using synchronous I / O can be used to simulate the design of asynchronous single-thread I / O. This article does not discuss asynchronous single-threaded I / O as multithreaded alternatives, however, I recommend that you should be considered in both design. Note that the Win32 I / O system design is to provide some mechanisms such that asynchronous I / O is better than synchronous I / O (eg, the I / O Almightie Ports). I plan to discuss issues related to synchronous I / O and asynchronous I / O in later articles.

As pointed out in the "Multiple Threads In the User Interface", multi-threaded and graphical user interface (GUI) does not work well. In this article, I assume that the background thread can do its work at all, do not need to use Windows GUI; I have the type of thread that is just "working thread", which is only calculated in the background without the need to be directly Interact. There is limited calculation, and there is also an unlimited calculation corresponding to it. A "listening" thread in the server-side application is an example of unlimited calculations, it does not have any purposes, just waiting for a client to connect to the server. After a customer has been connected, the thread sends a notification to the main thread and returns to the "listening" state until the next customer's connection. Naturally, such a calculation cannot reside in the same as the application user interface (UI), unless a asynchronous I / O operation is used. (Note that this particular problem can also be resolved by using asynchronous I / O and Almightion ports, rather than using multithreading, I am using this example here just as a demonstration). In this article, I will only consider limited calculations, that is, the sub-task of the application will end after a limited period of time. Based on CPU-based calculation and I / O calculations for a single thread, the most important factor for determining whether the calculation given is an excellent solution is that this calculation is based on CPU-based calculations or based on I / O calculations. . CDU-based calculations refer to this calculation of most of the time CPUs are very "busy". Typical CPU-based calculations are as follows: Complex mathematical calculations, such as complex calculations, graphics, or screen rear graphics calculations to the operation of file images residing in memory, such as in the memory of a text file. String. In contrast, the calculation based on I / O is a calculation, and most of its time is expected to wait for the end of the I / O request. In most operating systems, the device I / O that is entering will be treated asynchronously, which may be processed by a dedicated I / O processor, or by an efficient interrupt handler, and from The I / O request of an application will hang up the call thread until the I / O ends. In general, spending most of the time to wait for the I / O request to compete with other threads to compete for CPU time; therefore, I / O-based calculations may not reduce the performance of other threads than other threads compared to CPU-based threads. (Later, I will explain this argument), but please note that this comparison is a theoretical. Most of the calculations are not purely I / O-based or pure CPU, but based on I / O calculations and CPU-based computing are included. The calculation of the same episode may be carried out in a scheme, and the calculation is carried out in another aspect, depending on the CPU-based calculation and relative division based on I / O. Multi-threaded design goals should ask yourself what this shift is to ask yourself before you want to use multithreading for your application. Multi-thread has many potential advantages: enhanced performance enhanced capacity (THROUGHPUT) better users quickly respond (responsiveness) let us discuss every advantage over. Performance takes into account the time, let us simply define "performance" is all the time consumed by a given one or set of calculations. According to its definition, the comparison of performance is only for limited calculations. Whether you believe or whether the multi-threading scheme has a limited performance of the application.

The reason is not very obvious, but it is very reasonable: unless the application runs on a multiprocessor machine, (in this case, the sub-calculation is actually executed in parallel), based on the CPU calculation Under multithreading conditions, it is impossible to perform faster execution speed in the case of single-thread. This is because if the calculation is decomposed into small pieces (in the case of multi-thread) or large blocks (in the same thread, in the same thread is executed), there is only one CPU, and it must perform all calculations. As a result, for a set of given calculations, if it is performed in multiple threads, it is generally longer than the calculation of the completed mode because it adds the addition of the thread and switching the CPU between the thread. burden. In general, there must be some cases, regardless of whom who do one of the plurality of calculations, but their results must be synchronized. For example, using multiple threads to read multiple files into memory, then the order in which files being processed is uncomfortable, but it is necessary to wait until all data read into memory, the application will begin processing. We will discuss this idea in the "Capacity" section. In this article, we will measure the performance of the total time consumed to consume all the calculations. The capacity capacity (or response) refers to the time of each calculated average processing cycle (Turnaround). In order to demonstrate capacity, let us assume an example of a supermarket (it is always a great demo tool for the operating system): Suppose each calculation is a customer who is served at the settlement counter. For supermarkets, you can open a settlement counter for each customer, or all customers can concentrate through a settlement counter. In order to analyze the needs, it is assumed to have multiple settlement counters, but only one cashier (poor guy!) Comes to serve all customers, regardless of whether the customer is queued in front of a counter or more counters queue. This super cashier will jump high speed from a counter to the next counter, only handle one customer's item, then move to the next customer. This super cashier is like a CPU that is cut by multiple calculations. Just like we have seen in the previous "Performance" section, the total time to serve all customers does not decrease because there are multiple settlement counters to open, because the customer is served in a counter or multiple counters, Always this cashier to complete all the work. However, things are like this, and the customer still likes this super cashier than a settlement counter. This is because in general, the number of goods in the customer's hand truck is huge, and some customers have a lot of goods, and some customers just want to buy a few items. If you have only hoped to buy a box of Granola Bars and a quaver milk, but it is behind a somewhere for the whole family of 24 people procurement, then you know what I mean. Anyway, if you can serve in the Clark Kent, instead of queuing, you will not queue if the time is to complete the checkout, because no matter what, the two items will have been processed quickly. Finish. The trolley that is full of goods purchased for 24 people is processed on the other counter, so you can leave the checkout soon. Therefore, the capacity is how many calculations can be performed in a given time. Each calculation is such a measure of its process, that is to compare the following two times: How much time it takes to complete this calculation, and how much time it takes to assume that the calculation is first processed.

In other words, if you go to the supermarket, and I hope that you will leave there in two minutes, but actually spend two hours to settle your two product settlement because you have built Betty Crocker in the 1997 production line. Behind, then have to say that your process has failed. In this article, we define a calculated response time, calculating the time consumed by the time divided by the expected time. Then, if a calculation should be consumed by 10 milliseconds (MS), it actually consumes 20 ms, then its response processing period is 2, but if it is the same calculation, it consumes 200 ms (maybe because there is another A long calculation and competition is preferred to end, then the response processing cycle is 20. Obviously, the shorter response processing cycle is, better. We will see later, when introducing multithreaded into an application, even if the overall performance is lowered, capacity may still be a practical factor; however, to make capacity a factor with actual meaning It is necessary to meet some of the conditions: Each calculation must be independent of each other, as long as the end is calculated, the result of any calculation can be processed or used. If you are a team member of a university football team, and each of your players buy their own travel food in the same supermarket, your product is handled or after being processed, how long you spend more time is Two items of goods, and how long you have been waiting for this, these have nothing to do, because your car will not leave unless all players have bought food. Different are only your waiting time, or it cost is waiting in line waiting for checkout, or if the super cashier has already served, time spent on other people. This is important, but it is often ignored. As I mentioned earlier, most applications will be explicitly or implicitly synchronized or implicitheated. For example, if your application collects data from different files, you may want to display results on the screen, or save them to another file. In the previous case (display results on the screen), you should realize that most graphics systems perform some internal batch or serial operation unless all output data have been collected, otherwise it is not worth Will have good display; in the back, (save the result to another file) unless the entire prototype file has been written, it is generally not your application (or other application) can fully handle it. . So, if a person or something sequentially in some form, whether it is an application, operating system, or even a user, then the benefits you can get in handling files may disappear. There is a significant difference in quantity between calculations. If each customer in the supermarket needs to check out, the super cashier's way is not the advantage; if he has to jump between the three settlement counters, and each customer wants to be served There is only 2 (or 3, 4 or N) items to settle, then every customer has to wait for a few times to complete his or her settlement, which is better than letting all the customers queue together. bad. Here, the multi-thread imagination is a Shock absorption device: short calculations do not take the risk of being ranging after a long calculation, but they are divided into threads and spend more time, while they can be shorter The completion is completed. If the calculated length can be decided in advance, the serial processing may be better than the multi-threaded tension, you can use the ascending order to arrange the calculation with a long time short.

In the supermarket example, it is equivalent to the site of the number of customers (a variant of the Express Lane program), this idea is based on such considerations, only the customers of only very small goods like it, because they Will not delay a lot of time for a little, and those customers who have a lot of goods will not care, because all the settlements have to be taken for a long time, and everyone in front of them Products are less than it. If you just know a range of calculation time, your application cannot sort these calculations, then you should make a worst situation for some time. In such an analysis, you should assume that these calculations are not sorted in an ascending order of time. In contrast, they are sorted in descending sequences. From a response to this perspective, this scheme is the worst case because each calculation will have its highest possible response processing cycle as defined above. Quick Response I will discuss here, the last criterion of the application multi-threaded is a fast response (very close to the response on the language, enough to make you confused). In this article, if an application is designed to ensure that users can always be able to interact with the application within a very short time (very short time referring time, making the user feel that the application is suspended) Then we are simple, define the application to respond to fast applications. For a WIN32 application with GUI, the fast response can be implemented very simple, as long as the long calculation is delegated to the background thread, but the structure required to achieve a fast response may require higher skill, as I mentioned earlier If some people may wait for a calculation to return at a certain time, so executing a long calculation in the background may need to change the user interface (for example, you need to add a "cancel" button and rely on the menu item of the calculation result. It has also been made to mess. In addition to performance, capacity, and fast response, other reasons may affect multithreading design. For example, in certain structures, it is necessary to calculate the calculation of a pseudo-random manner (the example in the mind is the neural network of the Bolzmann machine type, in which the network is only asynchronously When calculating, the intended behavior of the Internet can operate). However, in this article, I will limit the scope of the discussion to the three factors mentioned above, that is, performance, capacity, and fast response. I have heard that I have heard a lot of discussion about the abstraction mechanism, saying that it encapsulates all multi-threaded bad (Nasty) to a C object, and thus makes an application get a multi-thread all the advantages. Not the shortcomings. In this article, I designed such an abstraction at first. I will define a prototype for a C class ConcURRENTEXECUTION that will contain member functions such as: DoconCurrent and DOSERIAL, and the parameters of these two member functions will be an array of ordinary objects and a callback function, which will be An object is called concurrently or serially. The C class will package all the truth details on maintaining the thread and the internal data structure. However, for me, I am very clear from the beginning, such an abstract use is very limited, because the maximum number of work when designing a multi-thread application is a task that cannot be completed, this work is Decide how to implement multiple threads.

The first restriction of Concurrentexecution is that the callback function will not allow explicit or implicit shared data; or callback functions require any other form of synchronization operation, and these synchronization operations will immediately sacrifice all the advantages brought by all the abstraction, and Open all "Wonderful" synchronous world traps and circles, such as dead locks, competitive conflicts, or very complex composite synchronization objects. Similarly, it is not allowed to call the UI, because it is like me, Win32 API has forced many implicit synchronous operations for calling the UI thread. Note that there are many other API subsets and libraries to share their threads forced implicit synchronization operations. These limits make Concurrentexecution only with extremely limited functions, saying that a specific point is an abstraction of a pure worker thread (completely independent calculation is limited to mathematical calculations in a non-continuous memory area). However, it is true that it is very useful to implement the Concurrentexecution class and use it in performance testing, because when I implemented this class, when designing and running the test, many of the hidden details about multithreading are exposed. . Please clearly below, although the Concurrentexecution class makes multiple threads easier to handle, but if you want to use it in commercial products, then this type of implementation requires some other work. In particular, I ignored all the wrong processes, which is invalid. But I assume that it is only used for testing (I obviously use the concurrentexecution, the error will not appear. ConcurrentExecution class The prototypes ConcurrentExecution class: class ConcurrentExecution { public: ConcurrentExecution (int iMaxNumberOfThreads); ~ ConcurrentExecution (); int DoForAllObjects (int iNoOfObjects, long * ObjectArray, CONCURRENT_EXECUTION_ROUTINE pObjectProcessor, CONCURRENT_FINISHING_ROUTINE pObjectTerminated); BOOL DoSerial ( int iNoOfObjects, long * ObjectArray, CONCURRENT_EXECUTION_ROUTINE pObjectProcessor, CONCURRENT_FINISHING_ROUTINE pObjectTerminated);}; This class is derived from Thrdlib.dll library, and the library is a project Thrdlib.dll example test suite THRDPERF in.

Before discussing the internal structure of the class, let us first discuss the semantics of member functions (semantics): ConcurrentExecution :: ConcurrentExecution (int iMaxNumberOfThreads) {m_iMaxArraySize = min (iMaxNumberOfThreads, MAXIMUM_WAIT_OBJECTS); m_hThreadArray = (HANDLE *) VirtualAlloc (NULL, m_iMaxArraySize * sizeof (hANDLE), MEM_COMMIT, PAGE_READWRITE); m_hObjectArray = (DWORD *) VirtualAlloc (NULL, m_iMaxArraySize * sizeof (DWORD), MEM_COMMIT, PAGE_READWRITE); // of course, necessary to achieve a true process provided herein for the error. ..}; You may notice that the constructor ConcURRENTEXECUTION has a number parameter. This parameter specifies the "concurrency maximum degree" supported by the instance of the class; in other words, if a ConcURRENTEXECUTION is created, N is a parameter, then there is no more than N in any given time. Calculation is executed. According to our previous analysis, this parameter means "no matter how many customers are waiting, the number of open settlement counter is not more than N". INT DOFORALLOBJECTS (int inoofoBjects, long * ObjectArray, const_execution_routine pobjectprocessor, concurrent_finishing_routine pobjectterminated); this is the only interesting member function implemented here. The main parameters of DOFORALLOBJECTS are an array of objects, a processor function, and a terminal function. Regarding the object completely no mandatory format; each time the processor is called, there will be an object to be passed to it, and the object is completely interpreted by the processor. The first parameter inoofObjects, just wants to know the number of elements in an object array. Note that when you call DOFORALLOBJECTS, if the length of the object is 1, then it is very similar to that call CreateThread (a little different, that is, CreateThread does not accept a terminator parameter). The semantics of DOFORALLOBJECTS are as follows: The processor will call every object. The sequence of objects is processed is not specified; all of all the guarantees is only passed to the processor at a certain time. The maximum number of concurrent is determined by the parameters of the constructor passing to the ConcURRENTEXECUTION object. Processor functions cannot access shared data and cannot call UI or do anything else needed to explicitly or implicitly operate. Currently, there is only one processor function to work on all objects; however, to replace the processor parameter using a processor array will be simple. The prototype of this processor is as follows: Typedef DWORD (WinAPi * Concurrent_execution_routine) (LPVOID LPPARETERBLOCK); When the processor has completed the work on an object, the terminal function will be called immediately. Unlike the processor, the terminal function is serialized in the environment of the call function, and all the routines can be invoked and all the data that access the calling program can be accessed. However, it should be noted that the terminator should be optimized as possible because the length calculation in the terminator affects the performance of DOFORALLOBJECTS.

转载请注明原文地址:https://www.9cbs.com/read-1585.html

New Post(0)