Pile: happiness and pain

xiaoxiao2021-03-05 43

MSDN LIBRARY Viewer Help

Murali R. Krishnanmicrosoft Corporation

February 1999

Summary: Discuss common heap performance issues and how to avoid these issues. (Total 9 Page Print Page)

Introduction

Are you using dynamically allocated C / C objects carelessly? Do you have a wide range of automation communication between modules? Your program may slow down because of the heap allocation? These conditions are not just a person. Almost all items will have a pile of problems sooner or later. The general tendency is said: "It is a slow, and my code is really a good code." This is not completely correct. This article helps you learn more about the use of stacks, heaps and possible problems.

What is a pile?

(If you already know what is a pile, you can jump forward to "What are the common heap performance issues?" Section)

Stacks used in dynamic allocation and release of objects used. In the following cases, the heap operation is called:

I don't know the number and size of the object you need in advance. Object is too large, not suitable for use of stack dispensers.

The heap is assigned to some memory outside the code and the stack during operation. The figure below shows the different layers of the heap dispenser. GlobalAlloc / GlobalFree: MICROSoft Win32 piles directly with the default stack of each process.

Localalloc / LocalFree: Win32 piles directly with the default stack of each process (for compatibility with Microsoft Windows NT).

COM's IMALLOC dispenser (or CotaskMalloc / CotaskMemFree): Function uses the default pile of processes. Automation Using the Distributor of the Component Object Model (COM) and requests the use of each process.

C / C Runtime (CRT) Distributor: Provides Malloc () and Free () and New and Delete operators. Programming languages such as Microsoft Visual Basic and Java also provide new operators, which use garbage recycling rather than piles. CRT Creates a special heap that resides above Win32 pile.

In Windows NT, Win32 pile is a thin layer of a dispenser around the Windows NT runtime. All APIs forward their request to NTDLL.

In Windows NT, the Windows NT runtime distributor provides the core heap dispenser. It contains a front-end dispenser with a free list of 128 sizes from 8 to 1,024 bytes. The backend dispenser uses virtual memory retention and submits the page. [Tomorrow Empire: Memory Pool and Free List of SGI STL Alloc; whether it is said that on the Win32 NT platform for small objects such as String, don't you have to do allocator containing the memory pool? Because OS has done it.]

The bottom of the chart is a virtual memory distributor that retains and submits the page for use in the operating system. All dispensers use virtual memory devices to access data.

Isn't it very simple to distribute and release blocks? Why do this take a long time?

Description of the heap implementation

Traditionally, the operating system and the runtime library come with a heap implementation. When the process begins, the operating system creates the default pile called the process stack. If other stacks are not used, use the process stack allocation block. Language runtime libraries can also create a separate heap in one process. (For example, C running time libraries create their own stacks.) In addition to these dedicated stacks, applications or many loaded dynamic link library (DLLs) can also be created and used separate stacks. Win32 provides a set of rich APIs for creating and using special stacks. For an excellent tutorial for a heap function, see the MSDN Platform SDK node.

These stacks are rested in the process space when the application or DLL creates a dedicated heap and is accessible within the process range. Any data for a given reactor assignment should be released in the same heap. (Distribute from a heap and released to another pile.) There is no meaning in all virtual memory systems, the stack is located on the virtual memory manager of the operating system. Language is running on the virtual memory when the language is running. In some cases, these stacks are on the upper layer of the operating system, but the language is running through the allocation of large blocks to perform their own memory management. Wrap the operating system stack to use the virtual memory function to make the stack better allocate and use blocks.

A typical heap implementation consists of a front-end dispenser and a rear end dispenser. The front-end dispenser maintains a free list of fixed-size blocks. When the heap receives the assignment call, it tries to find free blocks from the front list. If this operation fails, the heap will be forced to allocate a large block from the backend (reserved and submit virtual memory) to meet the request. The usual implementation has overhead with each block allocation, which costs the execution cycle, and also reduces the available storage area.

Knowledge Base Article Q10758 "Managing Memory with Calloc () and malloc ()" (Search by article ID) contains more background knowledge about these topics. In addition, please refer to Paul R. Wilson, Mark S. Johnstone, Michael Nely, and David Boles' "Dynamic Storage Allocation: A Survey and Critical REVIEW". "International Workshop On Memory Management", Kinross, Scotland, UK, September 1995 (http://www.cs.utexas.edu/users/oops/papers.html).

Windows NT (Windows NT 4.0 and later) uses 127 free lists and 1 mixed list of 8-byte alignment blocks from 8 to 1,024 bytes. Mixed list (Free List [0]) contains blocks of more than 1,024 bytes of size. The free list is included in the two-way linked table. By default, the process stack performs a merge operation. (Merging operation is a combination of adjacent free blocks to generate a larger block operation.) The merge operation takes an additional cycle, but reduces the internal fragments of the block.

A single global lock prevents multiple threads from using a heap. (See George Reilly Write the server performance and the first commandment of scalability killer.) This lock is primarily used to protect the stack data structure from any access to multiple threads. This lock will have a negative impact on performance when the heap operation is too frequent.

What are the common heap performance issues?

The following is the most common problem that will encounter when using the pile:

The allocation operation results in a reduction in speed. It is only a long time to distribute. The most likely reason for speed reduction is that the free list has no block, so the runtime distributor code takes time to find a large free block or allocate new blocks from the backend dispenser. The release operation causes a reduction in speed. The release operation consumes more time, mainly when the merge operation occurs. During the merge, each release operation should "find" its neighbor block, remove the blocks to constitute a larger block and re-insert this larger block into the free list. You can access memory in any order during this search, which causes the cache to be lowered and performance decrease. The stack is used to reduce the speed. When two or more threads are trying to access data simultaneously, then one thread must wait for another thread to continue. It is always caused in trouble; it is the biggest problem that has been encountered in multiprocessor systems so far. A large number of applications or DLLs using memory blocks are running at multi-threaded (and on the multiprocessor system) will reduce the speed. Using a single lock (common solution) means that all operations that use the heap are serialized. Serialization causes the thread to switch the context at the wait lock. Imagine the speed reduction caused by the time forward when the flashing red stop lamp is flicked. Stronging usually causes context to switch threads and processes. The cost of context handover is very expensive, but more expensive is that the data is lost from the processor cache and then rebuilding this data at the time of the thread. Heap damage causes a reduction in speed. Heap damage occurs when the application is not correctly used in the stack. Common situations include dual releasing or use blocks after releasing blocks, as well as obvious issues other than the block boundary. (Heap damage is not within the scope of this article. For additional details, see the Microsoft Visual C (R) debug documentation on memory rewriting and memory leakage.) Frequent assignment and reassignment operations result in a reduction in speed. This is a very common phenomenon when using scripting languages. The string is repeatedly assigned, and the reassigned is increased and then released. Please don't do this. If possible, try to assign a big string and use a buffer. Another method is to minimize series operation. In allocation and release operation, the contention is why the speed is reduced. Ideally, we want to have a heap that is not contention and fast distribution / release. Hey, such universal use stacks do not exist, although at some time in the future, it may appear.

In all server systems (such as IIS, MsProxy, DatabaseStacks, web servers, Exchange, etc.), the stack is a "big" bottleneck. The more the number of processors, the more it is.

Protecting yourself without the impact

Since you know some questions about the pile, don't you want a magic stick to solve these problems? I hope there is a magic stick. But there is no magic to make the stack run faster, so don't expect speeds to speed up in the previous week before the release product. Please plan your strategy as soon as possible so much. Changing the way you use the heap and reduce the number of heap operands is to improve performance reliability.

How to reduce the use of heap operations? You can reduce the number of stack opens by using locations within the data structure. Consider the following example:

struct ObjectA {// data for objectA} struct ObjectB {// data for objectB} // Use of ObjectA and ObjectB together.//// Use pointers // struct ObjectB {struct ObjectA * pObjA; // data for objectB} / /// use embedding // struct ObjectB {struct ObjectA pObjA; // data for objectB} //// Aggregation - use ObjectA and ObjectB inside another object // struct ObjectX {struct ObjectA objA; struct ObjectB objB;}

Avoid using a pointer to associate two data structures. If the pointer is used to associate two data structures, the objects A and B in the previous example are all assigned and released, respectively. This is an additional overhead and is what we want to avoid. Embed the parent object to the child object. Whenever there is a pointer in the object, it means there is a dynamic element (80%) and a new location to cancel the reference. The embedded operation adds a position and reduces the need for further allocation / release. This will improve the performance of the application. Combined with smaller objects to form a larger object (polymerization). Aggregation reduces the number of pieces allocated and released. If there are multiple developers to develop different parts responsible for design, it is finally possible to get a number of small objects that can be merged. The difficulty of this merge is to find the correct aggregate boundary. Inline a buffer (also known as 80-20 rules) that meets 80% required. In some cases, you need to store string / binary data with a memory buffer, and the total number of bytes is unknown. The measurement is measured and the inline can satisfy the buffer required for 80%. For the remaining 20%, a new buffer can be assigned and pointed to the buffer with a pointer. This reduces allocation and release calls and increases the spatial location of the data, which will eventually increase the performance of the code. Assign an object as a block area (twitch). The block refers to a method of assigning multiple objects in a group at a time. If you have to track item lists, for example {name, value}, there are two ways: Method 1 is assigned a node for each name-value. Method 2 is an allocation of a structure that can accommodate, for example, five names. For example, if stores four pairs of data is a common solution, the amount of additional space required for the number of nodes and additional link table pointers can be reduced. The block is a processor cache friendly, especially for the L1 cache, because it provides an increased position, moreover, where some of the data blocks are located in the same virtual page for block allocation. Use _amblksiz appropriately. C Runtime Library (CRT) has a custom front-end dispenser that is used to assign the _AMBLKSIZ size from the backend (Win32 heap). Setting _AMBLKSIZ to a larger value can potentially reduce the number of calls to the backend. This only applies to a large number of programs that use CRT.

The savings obtained by using the above tricks varies from object types, size, and workload. But it is always possible to achieve performance and scalability. Take a step, the code will be a bit dedicated, but if you think about it, the code can be easily managed.

More performance improvement

Here are some of the improvement speeds:

Using Windows NT5 piles due to several efforts and hard work, Microsoft Windows 2000 has made some significant improvements in early 1998:

Improvements in the inside of the bunch code. Stack of code uses a lock on each pile. This global lock is used to protect the stack data structure to be used by multiple threads. Unfortunately, in the case of high communication, the heap may still fall into this global lock, resulting in high contributions and low performance. Windows 2000 reduces the key area of the lock code to minimize the possibility of contention, thereby improving the scaling. Use a backup list. The stack data structure uses a fast cache for all free items between the block size between 8 and 1,024 bytes (incremented by 8 bytes). Initially, the fast cache was protected in the global lock. Now use a backup list to access the fast cache free list. These lists do not need to lock and use 64-bit interlock operations, thus improving performance. The internal data structure algorithm is also improved. These improvements eliminate the needs of allocation caches, but do not exclude other optimizations. Evaluate your code for Windows NT5; it is ideal for blocks less than 1,024 bytes (1 kb) (block from the front-end dispenser). GlobalAlloc () and Localaloc () are built on the same pile and is a common mechanism to access each process. Using the Heap * API can access each process heap or you can create a heap you use to assign an operation when you need a highly dedicated performance. You can also use the virtualalloc () / virtualfree () operation when you need it. The improvements described above are reflected in Windows 2000 Beta 2 and Windows NT 4.0 SP4. After these improvements, the dispute rate of the stacked lock is significantly reduced. This facilitates all direct users of Win32 piles. The CRT pile is built on the Win32 pile, but it is not benefited from these improvements from Windows NT. (Visual C 6.0 also has an improved heap dispenser.) Use allocation cache assignments to cache you can cache allocated blocks for future reuse. This reduces the number of allocation / release calls for process heap (or global heap), making the block once the assignment is reused. In addition, allocation caches allow you to collect statistics to better understand the use of objects on the high layer. Typically, custom heap dispensers are implemented on top of the process. The custom heap dispenser is very similar to the system stack. The main difference is that the custom heap dispenser provides cache for allocated objects above the process stack. Cache is designed for a set of fixed sizes (eg, 32 bytes, 64 bytes, 128 bytes, and the like). This is a good strategy, but this type of custom heap dispenser does not have semantic information related to the assigned and released object. "Assign Cache" is implemented as each type of allocation compared to the custom heap dispenser. In addition to providing all the advantages of the custom heap dispenser, they can also retain many semantic information. Each assignment cache handler is associated with an object in the target binaries. It can be initialized by a set of instructions concurrent level, the object size, and retains the parameters of the number of elements in the free list. Assign Cache Processor Objects Maintain your own dedicated pool (no more than the specified threshold) and protect it with a dedicated lock. The distribution cache and dedicated locks have reduced the amount of communication to the main system, thereby providing enhanced concurrency, maximum reuse and higher scalability. Requires cleaning procedures to regularly check all the cache handlers and reclaim unused resources. If (when) does not find any activities, you can release the allocated object pool to improve performance. You can review each assignment / release activity. The first level of information includes the total number of free calls of objects, allocation, and existing free calls. You can find the semantic relationship between them by viewing statistics of different objects. This relationship can be used to reduce memory allocation by using one of the plurality of techniques just mentioned. Assigning Caches can also be used as a commissioning auxiliary means to help you track the number of objects that are not fully cleaned. In addition to the non-cleared objects, you can even find the exact errors by viewing the dynamic stack reverse tracking and signature.

The MP stack MP stack is a package for multi-processor-friendly distributed allocation and can be used in Win32 SDK (Windows NT 4.0 and higher). This pile initially implemented by JVERT, which is built on the Win32 bracket. MP stacks Create multiple Win32 piles and attempt to distribute allocation calls to other stacks to reduce any single lock. This package is a good step that is calculated to be an improved MP-friendly custom heap dispenser. However, it does not provide semantic information and lacks statistics. The common way to use MP stack is to use it as an SDK library. If you use this SDK to create reusable components, you will benefit. And if you add this SDK library to each DLL, the work set will be added. Re-considering the algorithm and data structure to telescopically, algorithm, implementation, data structure, and hardware must be dynamically expanded in multiprocessor computers. Please check the most commonly assigned and released data structures. Ask yourself: "Can I use different data structures to do this?" For example, if a read-only item list is loaded when the application is initialized, this list is not necessarily a linear link table. It can be a dynamically assigned array. Dynamically assign arrays reduce the stacks in memory and reduce fragments, providing performance improvement. Reduce the number of small objects required to reduce the load on the heap dispenser. For example, we use five different objects on the critical processing path of the server, each with separate assignments and release. Cache objects together make the pile call from five to one, and significantly reduce the load on the pile, especially when the request is processed per second than 1,000. If the automation structure is widely used, it is considered to separate automation BSTR from the main line code, or at least the repetitive operation on the BSTR can be avoided. (BSTR series leads to excessive redistribution and assignment / release operations.) Summary

Heap implementation tends to remain universal for all platforms, so there is a huge system overhead. Everyone's code has a specific requirement, but the design can be adapted to the principles discussed herein to reduce the interaction.

Evaluate the use of the pile in the code. Improve the efficiency of the code to use fewer pile calls: Analyze the key path and repair the data structure. Perform metrics before custom packaging to quantify the cost of the pile call. If you are dissatisfied with your performance, please ask the operating system group to improve the heap. The more such requests, mean that the more energy will be improved. Ask the C runtime group to make the dispenser into thin packaging on the heap provided in the operating system. As a result, the cost of the pile of C is reduced with the improvement of the operating system heap. The operating system (Windows NT family) continues to make a reactor. Keep the pace and use these improvements.

Murali Krishnan is a chief software design engineer of the Internet Information Server (IIS) team. He started research IIS from version 1.0 and successfully upgraded IIS from version 1.0 to 4.0. Murali organizes and leads the IIS performance group for three years (1995-1998), and starting to change the performance of IIS from the first day. He received a bachelor's degree in computer science in Indiana at Anna University and received a Master of Computer Science in Wisconsin-Madison University. In addition to work, he likes to study, play volleyball and cooking at home.

Send document feedback to MSDN

转载请注明原文地址:https://www.9cbs.com/read-33936.html

9cbs

New Post(0)