Pile: joy and pain

zhaozj2021-02-08  253

Pile: joy and pain

Murali R. Krishnanmicrosoft Corporation

February 1999

Summary: Discuss common heap performance issues and how to prevent them. (9 pages in a total)

Foreword

Are you dynamically allocated C / C object faithful and lucky users? Do you frequently use "automation" in round-trip communication between the modules? Is your program running very slow due to a heap allocation? Not only you encounter such problems. Almost all items will encounter a pile of problems sooner or later. Everyone wants to say, "My code is really good, but it is too slow". That's just part of it. It is very useful to understand the heap and its use, and what will happen.

What is a pile?

(If you already know what is a pile, you can jump to "What is a common heap performance problem?"

In the program, use the heap to dynamically allocate and release the object. In the following cases, call the heap operation:

I don't know the number and size of the object you need in advance. Objects are too large and not suitable for stack allocation programs.

Heaps use partial memory that are allocated to code and stacks in runtime. The figure below gives the different layers of the heap allocation program.

GLOBALLOC / GLOBALFREE: Microsoft Win32 Piles, these calls are dialogue directly with the default pile of each process.

Localalloc / LocalFree: Win32 Piles (in order to compatibility with Microsoft Windows NT), these calls are dialing directly with the default pile of each process.

COM's Imalloc Assignment (or CotaskMallAlloc / CotaskMemFree): Function uses the default pile of each process. Automated Program Using the Distribution Program for Component Object Models (COM) ", and the application is used for each process heap.

C / C Runtime (CRT) Assignment: Provides Malloc () and Free () and New and Delete operators. For languages ​​such as Microsoft Visual Basic and Java also provide new operators and use garbage collection instead of pile. CRT creates its own private pile, resides on top of the Win32 pile.

In Windows NT, Win32 is a thin layer around the Windows NT runtime allocation program. All APIs forward their request to NTDLL.

The Windows NT runtime assignment provides a core heap allocation program within Windows NT. It consists of a front-end allocation program with 128 sizes from 8 to 1,024 bytes. The backend distribution program uses virtual memory to reserve and submit pages.

At the bottom of the chart is the "virtual memory allocation program", the operating system uses it to retain and submit the page. All allocations use virtual memory for data access.

Isn't it that simple to distribute and release block? Why spend so long?

Precautions for piles

Traditionally, the operating system and runtime library are coexisting with the reactors. In the beginning of a process, the operating system creates a default pile called "Process Heap". If there is no other stack available, the block allocation uses "Process Heap". Separate a heap can be created in the process while the language is running. (For example, Create its own heap when C is running.) In addition to these dedicated stacks, applications or many loaded dynamic link library (DLLs) can create and use separate stacks. Win32 provides a complete set of APIs to create and use private piles. For detailed guidance on the heap function (English), see MSDN.

These stacks are in the process space when the application or DLL creates a private bunch and is accessible within the process. The data allocated from the given pile will be released on the same heap. (You cannot assign from a heap to another pile.)

In all virtual memory systems, the pile resides at the top of the "Virtual Memory Manager" of the operating system. Language is running in the top of the virtual memory. In some cases, these stacks are layers in the operating system stack, while the language is running through large block allocation to perform their own memory management. Do not use the operating system heap, and use the virtual memory function more beneficial to the allocation and block use. A typical heap implementation consists of the front and back-end allocation procedures. The front-end allocation program maintains an idle list of fixed larger blocks. For one distribution call, the heap attempts to find a free block from the front list. If fails, the heap is forced to assign a large block from the backend (reserved and submit virtual memory) to meet the request. The general implementation has overhead of each assignment, which will consume the execution cycle, which also reduces the storage space available.

Knowledge Base Article Q10758, "Use Calloc () and Malloc () Manage Memory (Search) (Search), contain more background knowledge about these topics. In addition, detailed discussion on heap implementation and design can also be found in the following works: "Dynamic Storage Allocation: A Survey and Critical Review", Author Paul R. Wilson, Mark S. Johnstone, Michael Nely, and David Boles; "International Workshop On Memory Management, Author Kinross, Scotland, UK, September 1995 (http://www.cs.utexas.edu/users/oops/papers.html).

Windows NT implementation (Windows NT version 4.0 and updated version) uses 127 sizes from 8 to 1,024 bytes of 8-byte alignment block idle lists and a "big block" list. "Big Block" list (idle list [0]) saves a block greater than 1,024 bytes. The idle list contains an object that is linked with a two-way lin list. By default, "Process Heap" performs collection operations. (Collection is to combine adjacent empty blocks into a large block of operation.) Collecting an extra cycle, but reduces the internal fragmentation of the block.

Single full-local lock protection, prevent multi-line usage. (See "Server Performance and Scalability Killers", George Reilly, on "MSDN Online Web Workshop" (site: http://msdn.mi crosoft.com/workshop/server/iis /tencom.asp (English).) Single global lock is essentially used to protect the data structure to prevent random access across a multi-thread. If the heap operation is too frequent, the single global lock will have an adverse effect on performance.

What is a common heap performance problem?

Here are the most common questions you have encountered when you use the pile:

The speed caused by the distribution operation slows down. Light distribution takes a long time. Most likely to slow running speed is that the idle list has no block, so the runtime allocation program code will consume a larger idle block, or allocate new blocks from the backend allocation program. The speed caused by the release operation slows down. The release operation costs more cycles, mainly to enable collection operation. During the collection, each release operation "Find" its adjacent block, remove them and configure a larger block, and then insert this larger block into the idle list. During the lookup, the memory may randomly, causing the cache that can not hit, performance decrease. The speed caused by the competition is slow. When two or more threads accesses data, and a thread must wait for another thread to complete when the other thread is completed. Competition is always caused; this is also the biggest problem that is currently encountered in multiprocessor systems. When a large number of applications or DLLs that use memory blocks are run (or run on a multiprocessor system) in a multi-threaded manner, the speed is slowed down. Single locking use - commonly used solutions - means that all operations using the heap are serialized. Serialization occurs when waiting for the locking time causes the thread to switch the context. It is possible to imagine the speed of the red light that the intersection flicker is stopped. Competition usually causes the context of threads and processes to switch. The overhead of the context switch is large, but the overhead is the data from the data from the processor cache, and the data reconstruction when the thread is resurrected. The speed caused by the damage caused. The reason for causing the damage is the application of the application to the incorrect use of the stack. Usually, the situation involves releasing the released stack or uses the released stacks, and the offshore rewrite of the block. (Destroying is not within the scope of this article. For other details such as memory rewriting and leakage, see the Microsoft Visual C (R) debug document.) The speed of frequent allocation and redistribution is slowed down. This is very common when using scripting languages. If the string is repeatedly allocated, the distribution is increased and released. Don't do this, if possible, try to assign big strings and use buffers. Another method is to use less connection operation. Competition is a problem that causes speed slower in allocation and release operations. Ideally, I hope to use a pile without competition and fast distribution / release. Unfortunately, there is no such universal heap, perhaps in the future.

In all server systems (such as IIS, MsProxy, DatabaseStacks, web servers, Exchange, and other), the stack lock is really a large bottleneck. The more processors, the more competition will deteriorate.

Try to minimize the use of piles

Now you understand the problem that exists when you use the pile, don't you want to have a super magic stick that can solve these problems? I hope there is. But there is no magic to speed up the pile run - so don't expect to be greatly improved in the last week before product shipments. If you plan a stack in advance, the situation will greatly improve. Adjusting the method of using a heap, reducing the operation of the heap is a good way to improve performance.

How to reduce the use of a heap action? The number of heap operations can be reduced by using the location within the data structure. Consider the following examples:

Struct Objecta {

// Objecta's data

}

Struct ObjectB {

// Objectb data

}

/ / Use Objecta and ObjectB simultaneously

//

// Use a pointer

//

Struct ObjectB {

Struct Objecta * Pobja;

// ObjectB data

}

//

// use embedded

//

Struct ObjectB {

Struct Objecta Pobja;

// Objectb data

}

//

// Collection - use Objecta and Objectb in another object

//

Struct ObjectX {

Struct Objecta Obja;

Struct ObjectB objb;

}

Avoid using a pointer to associate two data structures. If the pointer is used to associate two data structures, objects A and B in the previous instance will be assigned and released separately. This will increase additional overhead - we have to avoid this practice. Embed a child object with a pointer into the parent object. When there is a pointer in the object, it means that there is a moving element (80%) and a new location without reference. Embedding the location thereby reducing the need for further distribution / release. This will improve the performance of the application. Merged small objects form large objects (polymerization). The polymerization reduces the number of blocks allocated and released. If there are several developers, different parts of their respective development, will eventually have many small objects to be merged. The integrated challenge is to find the correct aggregate boundary. The inline buffer can meet the needs of 80% (AKA 80-20 rules). In individual cases, the memory buffer is required to save string / binary data, but do not know the total number of bytes in advance. It is estimated that one of the inline can meet the buffer required by 80%. For the remaining 20%, a new buffer can be allocated and pointers to this buffer. This reduces the distribution and release of the position space to increase the data, and fundamentally improves the performance of the code. Assign objects (blocks) in blocks. Blocking is a method of assigning multiple objects at a group manner. If you continue to track the list of items, such as a list of {name, value} pair, there are two options: Select One is assigned to each "Name-Value" pair; selecting the second is to assign a accommodation (eg Five) "Name - Value" pair structure. For example, in general, if you store four pairs, you can reduce the number of nodes, and if an additional amount of space is required, an additional linked list pointer is required. Blocking is a friendly processor cache, especially for L1-caches, because it provides an increased location - many data blocks will be in the same virtual page for block allocation. Use _AMBLKSIZ correctly. C Runtime (CRT) has its custom front-end allocation program, which assigns the block of _AMBLKSIZ from the rear end (Win32 heap). Setting_AMBLKSIZ to a higher value potentially reduce the number of calls to the backend. This only applies to extensive programs that use CRT. The benefits of using the above techniques will vary depending on the type, size, and workload. However, it can always be gains in performance and deliverability. On the other hand, the code will be a bit special, but if you think about it, the code is still easy to manage.

Other technology to improve performance

Here are some technologies that improve speeds:

Use Windows

转载请注明原文地址:https://www.9cbs.com/read-1340.html

New Post(0)