Pile: joy and pain
Murali R. Krishnanmicrosoft Corporation
February 1999
Summary: Discuss common heap performance issues and how to prevent them. (9 pages in a total)
Foreword
Are you dynamically allocated C / C object faithful and lucky users? Do you frequently use "automation" in round-trip communication between the modules? Is your program running very slow due to a heap assignment? Not only you encounter such problems. Almost all items will encounter a pile of problems sooner or later. Everyone wants to say, "My code is really good, but it is too slow". That's just part of it. It is very useful to understand the heap and its use, and what will happen.
What is a pile?
(If you already know what is a pile, you can jump to "What is a common heap performance problem?"
In the program, use the heap to dynamically allocate and release the object. In the following cases, call the heap operation:
I don't know the number and size of the object you need in advance. Objects are too large and not suitable for stack allocation programs.
Heaps use partial memory that are allocated to code and stacks in runtime. The figure below gives the different layers of the heap allocation program.
GLOBALLOC / GLOBALFREE: Microsoft Win32 Piles, these calls are directly dialogue directly with the default pile of each process.
Localalloc / LocalFree: Win32 Piles (in order to compatibility with Microsoft Windows NT), these calls are dialing directly with the default pile of each process.
COM's Imalloc Assignment (or CotaskMallAlloc / CotaskMemFree): Function uses the default pile of each process. Automated Program Using the Distribution Program for Component Object Models (COM) ", and the application is used for each process heap.
C / C Runtime (CRT) Assignment: Provides Malloc () and Free () and New and Delete operators. For languages such as Microsoft Visual Basic and Java also provide new operators and use garbage collection instead of pile. CRT creates its own private pile, resides on top of the Win32 pile.
In Windows NT, Win32 is a thin layer around the Windows NT runtime allocation program. All APIs forward their request to NTDLL.
The Windows NT runtime assignment provides a core heap allocation program within Windows NT. It consists of a front-end allocation program with 128 sizes from 8 to 1,024 bytes. The backend distribution program uses virtual memory to reserve and submit pages.
At the bottom of the chart is the "virtual memory allocation program", the operating system uses it to retain and submit the page. All allocations use virtual memory for data access.
Isn't it that simple to distribute and release block? Why spend so long?
Precautions for piles
Traditionally, the operating system and runtime library are coexisting with the reactors. In the beginning of a process, the operating system creates a default pile called "Process Heap". If there is no other stack available, the block allocation uses "Process Heap". Separate a heap can be created in the process while the language is running. (For example, Create its own heap when C is running.) In addition to these dedicated stacks, applications or many loaded dynamic link library (DLLs) can create and use separate stacks. Win32 provides a complete set of APIs to create and use private piles. For detailed guidance on the heap function (English), see MSDN.
These stacks are in the process space when the application or DLL creates a private bunch and is accessible within the process. The data allocated from the given pile will be released on the same heap. (You cannot assign from a heap to another pile.)
In all virtual memory systems, the pile resides at the top of the "Virtual Memory Manager" of the operating system. Language is running in the top of the virtual memory. In some cases, these stacks are layers in the operating system stack, while the language is running through large block allocation to perform their own memory management. Do not use the operating system heap, and use the virtual memory function more beneficial to the allocation and block of the block. A typical heap implementation consists of the front and back-end allocation procedures. The front-end allocation program maintains an idle list of fixed larger blocks. For one distribution call, the heap attempts to find a free block from the front list. If fails, the heap is forced to assign a large block from the backend (reserved and submit virtual memory) to meet the request. The general implementation has overhead of each assignment, which will cost the execution cycle, and also reduce the available storage space.
Knowledge Base Article Q10758, "Use Calloc () and Malloc () Manage Memory (Search) (Search), contain more background knowledge about these topics. In addition, detailed discussion on heap implementation and design can also be found in the following works: "Dynamic Storage Allocation: A Survey and Critical Review", Author Paul R. Wilson, Mark S. Johnstone, Michael Nely, and David Boles; "International Workshop On Memory Management, Author Kinross, Scotland, UK, September 1995 (http://www.cs.utexas.edu/users/oops/papers.html).
Windows NT implementation (Windows NT version 4.0 and updated version) uses 127 sizes from 8 to 1,024 bytes of 8-byte alignment block idle lists and a "big block" list. "Big Block" list (idle list [0]) saves a block greater than 1,024 bytes. The idle list contains an object that is linked with a two-way lin list. By default, "Process Heap" performs collection operations. (Collection is to combine adjacent empty blocks into a large block of operation.) Collecting an extra cycle, but reduces the internal fragmentation of the block.
Single full-local lock protection, prevent multi-line usage. (See "Server Performance and Scalability Killers", George Reilly, on "MSDN Online Web Workshop" (site: http://msdn.microsoft.com/workshop/server/iis/ Tencom.asp (English).) Single global lock is essentially used to protect the stack data structure to prevent random access across a multi-thread. If the heap operation is too frequent, a single global lock will have an adverse effect on performance.
What is a common heap performance problem?
Here are the most common questions you have encountered when you use the pile:
The speed caused by the distribution operation slows down. Light distribution takes a long time. Most likely to slow running speed is that the idle list has no block, so the runtime allocation program code will consume a larger idle block, or allocate new blocks from the backend allocation program. The speed caused by the release operation slows down. The release operation is consumed more cycle, mainly to enable collection operations. During the collection, each release operation "Find" its adjacent block, remove them and configure a larger block, and then insert this larger block into the idle list. During the lookup, the memory may randomly, causing the cache that can not hit, performance decrease. The speed caused by the competition is slow. When two or more threads accesses data, and a thread must wait for another thread to complete when the other thread is completed. Competition is always caused; this is also the biggest problem that is currently encountered in multiprocessor systems. When a large number of applications or DLLs that use memory blocks are run (or run on a multiprocessor system) in a multi-threaded manner, the speed is slowed down. Single locking use - commonly used solutions - means that all operations using the heap are serialized. Serialization occurs when the lock is waiting to cause the thread to switch the context. It is possible to imagine the speed of the red light that the intersection flicker is stopped. Competition usually causes the context of threads and processes to switch. The overhead of the context switch is large, but the overhead is the data from the data from the processor cache, and the data reconstruction when the thread is resurrected. The speed caused by the damage caused. The reason for causing the damage is the application of the application to the incorrect use of the stack. Usually, the situation involves releasing the released stack or uses the released stacks, and the offshore rewrite of the block. (Destroying is not within the scope of this article. For other details such as memory rewriting and leakage, see the Microsoft Visual C (R) debug document.) The speed of frequent allocation and redistribution is slowed down. This is a very common phenomenon when using scripting languages. If the string is repeatedly allocated, the distribution is increased and released. Don't do this, if possible, try to assign big strings and use buffers. Another method is to use less connection operation. Competition is a problem that causes speed slower in allocation and release operations. Ideally, I hope to use a pile without competition and fast distribution / release. Unfortunately, there is no such universal heap, perhaps in the future.
In all server systems (such as IIS, MsProxy, DatabaseStacks, web servers, Exchange, and other), the stack lock is really a large bottleneck. The more processors, the more competition will deteriorate.
Try to minimize the use of piles
Now you understand the problem that exists when you use the pile, don't you want to have a super magic stick that can solve these problems? I hope there is. But there is no magic to speed up the stack - so don't expect to be greatly changed before the last week before the product is shipped. If you plan a stack in advance, the situation will greatly improve. Adjusting the method of using a heap, reducing the operation of the heap is a good way to improve performance.
How to reduce the use of a heap action? The number of heap operations can be reduced by using the location within the data structure. Consider the following examples:
Struct Objecta {
// Objecta's data
}
Struct ObjectB {
// ObjectB data
}
/ / Use Objecta and ObjectB simultaneously
//
// Use a pointer
//
Struct ObjectB {
Struct Objecta * Pobja;
// ObjectB data
}
//
// Use embedded
//
Struct ObjectB {
Struct Objecta Pobja;
// ObjectB data
}
//
// Collection - use Objecta and Objectb in another object
//
Struct ObjectX {
Struct Objecta Obja;
Struct ObjectB objb;
}
Avoid using a pointer to associate two data structures. If the pointer is used to associate two data structures, objects A and B in the previous instance will be assigned and released separately. This will increase additional overhead - we have to avoid this practice. Embed a child object with a pointer into the parent object. When there is a pointer in the object, it means that there is a dynamic element (80%) and a new location without reference. Embedding an increased position to reduce the need for further distribution / release. This will improve the performance of the application. Merged small objects form large objects (polymerization). The polymerization reduces the number of blocks allocated and released. If there are several developers, different parts of their respective development, will eventually have many small objects to be merged. The integrated challenge is to find the correct aggregate boundary. The inline buffer can meet the needs of 80% (AKA 80-20 rules). In individual cases, a memory buffer is required to save string / binary data, but do not know the total number of bytes in advance. It is estimated that one of the inline can meet the buffer required by 80%. For the remaining 20%, a new buffer can be allocated and pointers to this buffer. This reduces the distribution and release of the position space to increase the data, and fundamentally improves the performance of the code. Assign objects (blocks) in blocks. Blocking is a method of assigning multiple objects at a group manner. If you continue to track the list of items, such as a list of {name, value} pair, there are two options: Select One is assigned to each "Name-Value" pair; selecting the second is to assign a accommodation (eg Five) "Name - Value" pair structure. For example, in general, if you store four pairs, you can reduce the number of nodes, and if an additional amount of space is required, an additional linked list pointer is required. Blocking is a friendly processor cache, especially for L1-caches, because it provides an increased location - many data blocks will be in the same virtual page for block allocation. Use _AMBLKSIZ correctly. C Runtime (CRT) has its custom front-end allocation, the allocation program allocates the block of _AMBLKSIZ from the backend (Win32 heap). Setting_AMBLKSIZ to a higher value potentially reduce the number of calls to the backend. This only applies to extensive programs that use CRT. The benefits of using the above techniques will vary depending on the type, size, and workload. However, it can always be gains in performance and deliverability. On the other hand, the code will be a bit special, but if you think care, the code is still easy to manage.
Other technology to improve performance
Here are some technologies that improve speeds:
With Windows NT5 piles due to several colleagues' efforts and hard work, several major improvements in Microsoft Windows (R) 2000 in early 1998:
Improved lock within the bunch code. Stack of code is a lock for each pile. The global lock protective stack data structure prevents the use of multi-wire. But unfortunately, in the case of high communication, heap is still trapped in global locks, leading to high competition and low performance. In Windows 2000, the critical regions of the lock code will minimize the possibility of competition, thereby increasing scalability. Use the "Lookaside" list. The stack data structure uses all idle items of blocks using a fast cache between 8 and 1,024 bytes (incremental increment). Fast caches were initially protected in the global lock. Now, use the LOOKASIDE list to access these fast cache idle lists. These lists do not require locking, but use 64-bit interlock operations, thus improving performance. The internal data structure algorithm is also improved. These improvements avoid demand for cache, but do not exclude other optimizations. Use Windows NT5 pile to evaluate your code; it is optimal to blocks less than 1,024 bytes (1 kb) (blocks from the front-end allocation program). GlobalAlloc () and Localaloc () are built on the same pile to access generic mechanisms for each process heap. If you want to achieve high local performance, use the Heap (R) API to access each process heap, or create your own stack for the assignment operation. If you need to operate on a large block, you can use the VirtualAlalloc () / VirtualFree () operation. The above improvements have been used in Windows 2000 Beta 2 and Windows NT 4.0 SP4. After the improvement, the competition rate of the stack is significantly reduced. This benefits all direct users of all Win32 heaps. The CRT pile is built on top of Win32 pile, but it uses its own small blocks, so it cannot benefit from Windows NT improvement. (Visual C version 6.0 also has improved heap allocation programs.) Use the Assigning Cache Assigning Cache Allows the cache allocated blocks to be reused. This reduces the number of allocation / release calls for process stacks (or global stacks), and allows for maximum reuse that have been allocated. In addition, allocation caches allow for collecting statistics in order to better understand the use of the object at a higher level. Typically, the custom heap allocation program is implemented at the top of the process stack. The custom heap allocation program is similar to the behavior of the system. The main difference is that it provides a cache at the top of the process stack for allocation. The cache is designed into a fixed size (such as 32 bytes, 64 bytes, 128 bytes, etc.). This is a good strategy, but this custom heap allocation program loses "semantic information" related to the object associated with allocation and release. In contrast to the custom heap allocation, "Assign Cache" is implemented as each type allocation cache. In addition to providing all benefits to custom heap allocation programs, they can also retain a lot of semantic information. Each assignment cache handler is associated with a target binary object. It can initialize a set of parameters, which represents the concurrent level, the object size, and the number of elements held in the idle list. Assigning a cache handler object maintains your own private free solid pool (no more than the specified threshold) and uses a private protection lock. Together, allocate caches and private locks reduce traffic with the main system stack, thereby providing increased concurrency, maximum reuse and higher scalability. You need to use the cleaning program to periodically check all the activities of allocating the cache handler and reclaim unused resources. If there is no activity, the pool of the assigned object will be released, thereby increasing performance. You can review each assignment / release activity. The first level information includes the total number of objects, allocation, and release calls. The semantic relationship between the individual objects can be obtained by viewing their statistics. This relationship can be used to reduce memory allocation with one of the many technologies described above. Assigning a cache also plays a role in debug assistant to help you track the number of objects that do not have completely cleared.
By looking at the dynamic stack returns the trace and the signature other than the object that is not cleared, it is even able to find the exact failed caller. MP stack MP stacks are packaged for multi-processor-friendly distributed distribution, which can be obtained in Win32 SDK (Windows NT 4.0, and Update). Implemented by JVERT, the abstraction is built on top of the Win32 stack package. The MP pile creates multiple Win32 heaps and attempts to distribute allocation calls to different piles to reduce competition in all single locks. This package is a good step - an improved MP-friendly custom heap allocation program. However, it does not provide semantic information and lack of statistical functions. MP piles are usually used as the SDK library. If you create reusable components using this SDK, you will greatly benefit. However, if this SDK library is created in each DLL, the work setting will be added. Recommend algorithms and data structures to scale on multiprocessor machines, algorithms, implementation, data structures, and hardware must be dynamically expanded. Please see the most often allocated and released data structures. Try, "Can I do this with different data structures?" For example, if a list of read-only items is loaded when the application is initialized, this list does not have to be a list of linear links. If it is a dynamically assigned array, it is very good. The dynamically assigned array will reduce the stacks and debris in the memory to enhance performance. Reduce the number of small objects to reduce the load of the heap allocation program. For example, we use five different objects on the critical processing path of the server, each object separately, and release. To cache these objects together, add the pile to five to one, significantly reduce the load of the heap, especially when processing more than 1,000 requests per second. If you use the "Automation" structure, consider deleting "Automation BSTR" from the main line code, or at least avoid duplicate BSTR operations. (BSTR connection results in too much redistribution and distribution / release operation.) Summary
There is a huge overhead for all platforms. Each individual code has a specific requirement, but the design can use the basic theories discussed herein to reduce the interaction between heaps.
Evaluate the use of your code. Improve your code to use fewer piles: Analyze critical paths and fixed data structures. Use the method of quantizing the pile calling cost before the custom packaging program. If you are dissatisfied with your performance, ask OS group to enter the heap. More such requests mean more attention to improving stacks. Require C rule group to make a small distribution packager for the stack of ices. With the improvement of the OS heap, the cost of the piles of C operation will be reduced. The operating system (Windows NT family) is constantly improving the heap. Please pay attention to and use these improvements.
Murali Krishnan is the Chief Software Design Engineer of the Internet Information Server (IIS) group. From the 1.0 version, he started design IIS and successfully issued a version 4.0 version of version 4.0. Murali organizes and leads IIS performance group for three years (1995-1998), affecting IIS performance from the beginning. He holds M.S.s, M.S. and India Anna University, India. Outside work, he likes to read, play volleyball and family cooking.