Wild <Programming>: Type Cache (II)

zhaozj2021-02-16 108

Wide : Type Cache (II) Andrei AlexandRescu

We review the priorities of the foreman in a common way in TV dramas. We have taught a template buffer that is very similar to std :: Vector, except that buffer has no volume concept and adds some basic functions, such as grow_noinit and shrink_nodeStroy. In addition, the above mentioned Type Traits as a technical means to provide optimization. Finally, there is a villain threat to the issue of copy object and memory allocation. This article does not talk directly to Buffer, but to talk about two operations you often use on Buffer: Fill a buffer with a value, copy object between Buffer and Buffer and between different containers. We will compare comparisons between several fills and copy methods.

Fill you know - is the copy the same value to all objects in a range. The C standard library provides two fill functions: std :: fill and std :: uninitialized_fill. The second operation assumes that the filled object is completely not initialized. The simplest flooding memory fill function can be like this

// Example 1: A simple fill operation template Void Fill (T *BEG, T * End, const t & value) {for (; becom! = End; ) * beg = value;}

The problem is whether this is the best implementation? Usually answers: "The resulting optimization code is the compiler" - then I will test first, I viewed the code generated by Microsoft Visual C 6.0 (MSVC) and MetrowERLS CodeWarrior 6.8 (MWCW). They have produced a simple loop assembly code. However, like the X86, like many other modern processors, providing special assembly instructions to quickly fill the memory blocks. The C library function MEMSET may use these instructions. Unfortunately, MEMSET can only set the memory into the same byte, as long as you need to make memory in a manner longer than one byte, you can no longer use MEMSET, so MEMSET has no scalability for general code. (Here I speaking a few questions. Exigney. The memory operation function MEMSET, MEMCPY and MEMCMP are unparalleled. These functions may be highly optimized by compiler vendors, and the optimization ranges include the compiler to detect the call to these functions and use inline. The assembly instruction is done in replacing them - MSVC is doing this. Thus, the programmer of the desire speed, using the MEM * function is considered to be a cool way) uses a copy to implement the fill is a way. That is, you can take this: Once you fill the target interval, you can use the quick copy mode to copy the filled part to the unfilling section! Cool is that you can double the size of the memory block each time. For example, you have to use 1.0 to populate a range with 1000 Double Type members. The first step, you assign 1.0 for the first bit value. In the second step, you copy the position you just assigned to it adjacent the location. In the third step, you copy two new positions to the next two adjacent positions. Step 4, you copy four values and then you get eight - in this class. In the 10 step, you filled the entire 1000 Double interval. Most of you need to do actually in the last step, when 512 locations have been filled, then 488 in them are copied to the remaining 488 locations. Suppose your available quick copy function of your availability is:

Template Void Quickcopy (Const T * SourceBegin, Const T * SourceEnd, T * Destination);

Then the FillByCopy algorithm is like this: Template Void FillBycopy (T *BEG, T * End, Const T & Value) {if (beg == End) Return; * beg = value; t * dest = beg 1; for (;;) {const size_t alreadyfilled = dest - beg; if (alleadyfilled> = end - dest) {QuickCopy (beg, beg (end - dest), DEST); Break;} else {QuickCopy (Beg, Dest, DEST ); Dest = alreadyyfilled;}}}

If QuickCopy is indeed a fast function, FillByCopy is a cool algorithm. FillByCopy is similar to the "Russian Farmers Algorithm", this algorithm uses the minimum step to calculate the integer power [1]. Many people invented copy fill in different fields - one of the algorithms is filled with a single-byte file full disk. If this is my original idea, I will hurry to fill the copy of the "Romanian farmers" algorithm. (Translation: The author is Romanian) and then I can't wait to write a test and get interesting results. But first, let me introduce another algorithm for you.

Duff's Device Duff's Device [2] is a C encoding technique for accelerating a loop statement. Its basic idea is that if in a FOR loop, the operation is performed if it is fast enough (such as, um, one assignment) - then testing the cycle condition (in Example 1 is BEG! = End) takes up the time taken by the loop Very large part. The loop should be partially unspeakable, so that several operations are completed once, the test operation is also made. For example, if you fill an object section, you may have to pay two or more consecutive objects in a cycle. You must pay attention to the details of the termination conditions and others. Here Duff's Device is a novel, creative solution. Let's quickly look at the implementation of the generic filling algorithm based on Duff's Device.

Template Inline Void Fillduff (T * Begin, T * End, Const T & obj) {switch ((end - begin) & 7) {case 0: while (begin! = end) {* begin = Obj; Begin; Case 7: * Begin = Obj; Begin; Case 6: * Begin = Obj; Begin; Case 5: * Begin = Obj; Begin; Case 4: * Begin = Obj; Begin Case 3: * Begin = OBJ; Begin

Case 2: * begin = Obj; Begin; Case 1: * Begin = Obj; Begin;}}} Now, if you have never seen such a code, then look at it once, because you may see What is missing? The function contains a Switch statement that is located in a while in a While cycle (with a CASE statement outside). Expression within Switch calculates the remainder of eight. Perform which location starting within the While loop is determined by this remainder. Final cycle exit. (Without Break) Duff's Device, you can solve the problem of boundary conditions in order to simply and beautiful. By the way, why is "Case 0" marked outside the loop? Is this not broken aesthetic? The only reason to do so is to handle empty sequences. When the remainder is zero, "CASE 0" needs to perform a redundant test to determine the possibility of the empty sequence. All of this is invincible. The result is BeGin! = End test less than eight times. In this way, the test itself has decreased eight times this factor during the cost of cycle duration. Of course, you can also try other factors. The negative impact of Duff's Device may come from code expansion (some processors are more good at handling compact cycles rather than large cycles) and special structures. The optimizer is made as "ah!" When encountering more familiar simple loop code, and may not know some of the more conservative code when encountering some more skillful structures. The number must be known about optimization (after you spend "Don't do this" and "still don't do this", it is effective to optimize. The above fill algorithms may sound good, but only tests can prove that they are valid. I wrote a simple test program that fills an array of Double Double Size as count, using three algorithms described above: for loops, copy fill, and Duff's Device. Test Repeat. I used three compilers on a PIII750 MHz: MSVC, MWCW, and GNU G 2.95.2, testing each algorithm several times. I get the following results by changing count and repeat.

* When a large cache is filled (COUNT is 100,000 and more), the direct for loop and Duff's Device are almost the same. The copy filling algorithm is actually shown in 1-5% [3] * When populating 12,000 Double Double intervals, the copy of the MSVC and MWCW faster 23%, but G is the favorite for loop, the result is currently For all compilers and all methods (20% -80%). * When populating 1,000 Double intervals, MSVC and MWCW produce similar results. Duff's Device is 20% fast, and the copy is filled with a discount of 75%, and it is compared to the direct FOR cycle. Once again, G has a different performance, and the results generated in all methods are surprising (100% fast than other compilers) * 100 Double, MSVC and MWCW results, once again G Use half of the time to complete the task (Duff's Device and copy filling than the for cycle is 12%).

We look for this phenomenon by checking the architecture of modern PC. The processor is 5-10 times faster than the main memory bus. In order to accelerate memory access, there is a secondary cache. The first level is in the processor (level 1), and the other level is on the edge of the processor (in the Pentium Third, the processor is packaged), it is, one operation, all the memory required to process is within level 1 cache . The worst case is that you have a randomly dispersed memory access, so that you can't hit the cache, eventually hit the main memory. Copy filling is not conducive to cache, because each round is also hit by two memory areas - source and target areas. For example, if you fill 1MB data, the last step will copy 512kb to the other place. This makes the cache management very unpleasant. Or don't deal with direct fill cycles, then pleasant. This is why it is slightly slower than the forfrity when filled with large memory blocks. (Exercise: You can improve its cache-friendly degree by simply modifying the FillBycopy code. Tip: Consider local access) When populating a large amount of data, you cannot benefit from the cache, so the fill speed will be mainly limited to the main Memory bandwidth. You cannot bring great improvements to the optimization of the loop itself, because the bottleneck is in memory, not the processor operation. Regardless of the loop you write, use registers, or do not expand loops, the processor will always wait for the main memory. This is why Duff's Device and for loops are the same as large memory blocks. The situation changes when filled with less data. There are more data that may be placed within the cache, while the cache is as fast as the processor. Now, the code executed by the processor determines the speed of the loop. Memcpy (Funby Function in IPY) uses a special loop assembly statement (the term in x86 is REP STOS). For cache copies, this compilation statement is fast than the loop based on jump, assignment, and relatively based. This is why FillByCopy is the fastest in moderate data. Similarly, Duff's Device has an advantage than the for loop because it performs less comparison.

Quick copy another typical memory operation is copy data to cache or slave copy data. In order to find a quick copy method, I tested three different copy methods for type Double data: one is a direct for loop, one is based on the implementation of Duff's Device, and Memcpy (strangely, despite the copy filling Algorithm, there is no filling copy algorithm) We don't expect any accidents - Memcpy should put all other methods behind. In any case, Memcpy is a highly optimized, very mature copy approach provided by your standard library. It can't be used in all types, it is too bad! In addition, the Duff's Device compares the FOR cycle should be the same as the result of the fill. Of course, as often happens, the real results are somewhat different. The results are as follows: * When copying a large amount of data (with a megabytes), all methods (and all compilers) are basically the same. This is because the memory bandwidth is restricted to a large amount of data. * When copying a small amount of data, the compiler begins with inconsistency. For example, when copying 100,000 types of Double Double: * MWCW for loop is very slow, Duff's device and memcpy exceeds it 20% * MSVC and G can be said to be the same code, all methods are almost the same - and NWCW is the fastest as fast as fast. * Differently becomes more obvious when further reduction of copied data. Here is the result of a copy of 10,000 DOUBLE. * MSVC generated code in Duff's Device is 25%, and Memcpy is 67%. * NWCW (Be prepared) generated code, Duff's Device is fast 9%, Memcpy is slow 20%, comparing a For cycle. The speed of the FOR cycle is basically the same as the MSVC. * G is really cool. At first, G For cycles were 42% faster than MSVC and MWCW. Then, Memcpy's performance is like MSVC's Memcpy, 10% faster than G , but G implementers must love Duff's Device, because its duff-based copy mode defeated all competitors, 11% faster than Memcpy , This should be the fastest. * When the amount of data is reduced to 1,000 Double: * The FOR cycle of MSVC is very slow. DUFF is 50%, MEMCPY is 200%. * MWCW generates a faster for loop, Duff is 5% faster than it, Memcpy is 130% faster than it. * G FOR cycle and NWCW is the same, Duff is 75%, and Memcpy is 130%. * All compilers produce similar and fastest results for Memcpy. * Finally, when copying 100 Double, I get a similar result in 1,000 Double, except for G again in Duff's Device, more than 75% faster than the for cycle, 40% faster than Memcpy.

If you feel doubt, I am the same. It looks difficult to find the best way to find all the compilers and all data. It also surprised me is that as a free, open source compiler is not only more than more than two famous commercial compilers. All of these tests make me jealousies who write compiler libraries - they have this privilege to optimize their code with a compiler and a computer architecture. But this article is "generic " rather than "special " - although this is more challenging. By the way, we now discuss the generics of "generic programming" - how many "pan" we just mentioned? So far, we assume that the type of test can be copied by Memcpy. In fact, only part of the C type has this feature. This part of the type is called POD ("direct old data"). All basic types of all basic types and similar structural functions and virtual functions are included in the POD collection. For all other types, they use memcpy results to undefined behavior. Such Duff's Device has an important advantage relative to FillByCopy and Memcpy. Duff's Device is 100% "universal". BEST, Duff's Device is not single and points to a pointer to the memory interval, and can work with any random iterator type. But you can know if you can use Memcpy to copy a type or important. This is very interesting, not only for the purpose of speed, but also, despite some contradictions, for the purpose of memory allocation, as you will see in the next section. Define an Memcopyabel flag Requires Typetraits Namespace (defined in the previous section) and implanted the following definition:

Namespace Typetraits {Template struct ismemcopyable {enum {value = isprimitive :: value! = 0};

Then you can easily write a NativeCopy generic code segment to the Typetraits :: IsMemcopyAble :: Value to assign call memcpy or more conservative methods. This is left to the reader as an exercise. Duff's Device is only valuable if the copy / fill type of copy / fill action is low enough. Otherwise, it has become negligible compared to comparing the cost and assignment itself. So we need to define a type of characteristics Cheaptocopy, as follows:

Namespace Typetraits {Template struct cheaptocopy {enum {value = copythrows :: value == 0 && sizeof (t) <= 4 * sizeof (t *)};

Very interesting, cheaptocopy is a bit bad piece of elephant? Formo: a type is considered to copy inexpensive if its copy operation does not thrown (this usually does not have the price of resources) and its size is below one set value . If a type is more than a pointer, it means that the loop should be the type of normally. Using Typetraits :: Cheaptocopy :: Value, you can determine how to use Duff's Device or anything else at compile.

Is the summary not a little dizzy? First, there are more than one way to fill and copy objects may be an accident on you. Then, there is no single fill and copy method in all compilers, data set size, and hardware platform (I guess if I test the same code on a small cache machine, I will get very different. Results. Not to mention other hardware systems) The best work is best. As a criterion, use Memcpy as much as possible (equally applicable to fill) - For big data sets, use Memcpy does not have any difference, for smaller data sets, may be much faster. For inexpensive copy objects, Duff's Device may be faster than a simple for loop. And all of these ultimately depends on your compiler and hardware platform of aura and special behavior. There is a very deep and sad reality here. We live in the 21st century, space travel era. We have developed electronic computers for more than 50 years. We strive to design more and more complex systems, and the results get the results cannot be satisfied. Software development is a messy behavior. Is this because the basic tools and methods we use are low-level, low efficiency, and inexpensive? Let's jump out of the circle and look at your own - 50 years later, we still don't do it in the fill and copy memory. I have reached the word limit of this article, but I didn't exclude all of these issues, sorry, buffer. Mobile objects are another important topic, and we have not even told memory allocation. However, like the movie is often said, there is no too much thing in the head (Well? What movie?). So I have to stop now. Goodbye. Reference Bibliographic and Note [1] Matt Austern. "The Russian Peasant Algorithm," C Report, June 2000. Matt has no inventive algorithm, but his article has a good discussion on the algorithm. [2] See . [3] I have to say that G is sure to use a special way to populate data. This method is only sometimes effective. Duplicate test fills 1 million DOUBLE, Duff's Device code compiled with G is sometimes 40-50% faster than any other method, this is a problem of speed testing?

Andrei Alexandrescu is a doctor in Washington, Seattle, is also the author of the book "Modern C Design" book. You can contact him via www.moderncppdesign.com. Andrei is also a C seminar (). A superior lecturer. You can get this source code from the CUJ website or http://merced.go.nease.net/code/buffer.zip.

转载请注明原文地址:https://www.9cbs.com/read-23039.html

9cbs

New Post(0)