Optimize your app over the programming model of Visual C and compiler
Release Date: 2/25/2005
| Update Date: 2/25/2005
Kang Su Gatlin
Part of this article is based on the Visual Studio pre-release version (previous code "whidbey"). All information related to the beta is reserved.
This article discusses:
• Why C is a powerful language for .NET • How to program high performance in .NET • C and JIT optimizer effect • Delayed loading and STL / CLI This article uses the following techniques: C and Visual Studio
This page
Optimized MSILJIT and compiler Optimize interactive public sub-expression cancellation and algebraic simplification Optimization 64-bit NGEN Optimized Double THUNK Eliminating Native Code in C Interop Single Image High Performance Seale Have Template with .NET Type Template and STL Deterministic Help Performance Delay Loading Why DLLEXPORT cannot always apply a summary
Although Microsoft® .NET Framework does improve the work efficiency of developers, many people are still worried about the performance of managed code. The new version of Visual C ® will let you eliminate these concerns. For Visual Studio® 2005, C grammar itself has been greatly improved, making it more rapidly. In addition, a flexible language framework is also provided to intersect each other with a public language runtime (CLR) to facilitate writing high-performance programs.
Many programmers think that C can bring high performance because it generates this machine code, but even if your code is fully hosted, you can still get outstanding performance. Through flexible programming models, C will not let you bind to process programming, object-oriented programming, renewable programming or meta programming.
Another common misunderstanding is: No matter what language is used, you can get the same performance in .NET Framework - Microsoft Intermediate Language (MSIL) generated by various compilers is essentially equivalent. Even in Visual Studio .NET 2003, it is not possible, but in Visual Studio 2005, the C compiler team is committed to ensuring that all experience gained by this machine code for many years can be applied to managed code optimization. C provides you with sufficient flexibility to make better optimization, such as high-performance encapsulation processing, which cannot be done in other languages. In addition, the Visual C compiler also generates an optimized MSIL in any .NET language. The result is the optimized code in .NET from the Visual C compiler.
Optimized MSIL
In the .NET environment, compilation is divided into two different parts. The first part is compiled and optimized for programmers through the language compiler (C #, Visual Basic or Visual C ) to generate MSIL. The second part includes sending MSIL to a real-time (JIT) compiler or NGEN, which reads MSIL and then generates optimized native code. Obviously, language compilers and JIT are unseakable components, which means to generate good code, both must work together.
Visual C always provides the highest level optimization settings for any compiler. This has not changed in the hosted code. Even in Visual C .NET 2003, this is also apparent, it is only enabled by starting optimization by a native compiler that generates MSIL code.
In Visual C 2005, the compiler can optimize a large subset of the MSIL code execution standard native code. From the optimization of the data stream to the expression optimization, then it is included in the loop. Other languages in the platform cannot be optimized. In Visual C .NET 2003, the full program optimization (WPO) does not support the use of / CLR switch, but Visual C 2005 adds this feature for the managed code. This feature enables span module optimization, which will be discussed later. In Visual C 2005, a unique optimization of managed code is Profile Guided Optimizations, although it may be available in future versions. For more information, see Write Faster Code with The Modern Language Features of Visual C 2005.
Back to top
JIT and compiler optimization interaction
The optimized code generated by Visual C is supplied to JIT or NGEN to generate this machine code. Regardless of the code generated by the Visual C compiler is MSIL or non-managed code, an optimizer that generates code is also developed and adjusted more than ten years ago.
Optimization of MSIL code is a big subset of optimized non-hosting code. It should be noted that the allowable optimization class is different from the compiler to the verifiable code (/ CLR: Safe) or a non-verifiable code (/ clr or / clr: pure). In a few cases, the compiler cannot complete the operation due to metadata or verifiable limits, including the reduction of the amount of operation (adding the multiplying transformation into a pointer), and the access to a private member of a class. Union to another class of method.
After the Visual C compiler generates the MSIL code, it can be handed over to JIT. JIT reads MSIL and starts to perform optimization, which is very sensitive to changes in MSIL. A MSIL instruction sequence may be optimized well, but the other (semantically equivalent) sequence may suppress optimization. For example, register allocation is an optimization, in which JIT optimizer attempts to map variables into registers; registers are actual hardware used as an operand that performs arithmetic and logical operations. Sometimes, the code written in semantics but written in two different ways may make the optimizer take a variable time spent on the implementation of good register allocation. The loop deployment is an example of a conversion that may result in a problem with the JIT assignment register.
The C compiler completed the loop deployment that can disclose more instruction levels, but also creates more live variables, the compiler needs to use them to track registers allocation. The CLR JIT can only track the fixed number of registers allocated variables; once the number of tracking is required to exceed this number, it starts moving the contents of the register to memory.
Therefore, it is necessary to fine-tune the Visual C compiler and JIT to generate the best code. Optimization of the Visual C compiler is those optimizations that are too time consuming to JIT, as well as the optimization of too much information loss during compilation from C source code as MSIL.
Let's take a look at some optimizations of Visual C to managed code.
Back to top
Public sub-expression cancellation and algebraic simplification
Common Subexpression Elimination, CSE and Algebraic Simplification are two powerful optimizations that allow compilers to perform some basic optimization in expression levels so that developers can focus on research algorithms and architectures. The code snippet shown below is compiled as C # and C ; both are compiled under the Release configuration. Variables A, B, and C are copied from an array transmitted as a parameter to a function containing this code:
INT D = A B * C;
INT E = (C * B) * 12 A (A B * C);
Figure 1 shows the MSIL generated by the C # compiler and the C compiler through this code, all of which are optimized. C # requires 19 instructions, while C only needs 13. In addition, you can see that C code can CSE for B * C expression. The compiler can simplify the A A, which is changed to generate 2 * a, or the algebraic simplification of (C * b) * 12 c * b, that is, changed to generate (C * b) * 13. I found that this CSE is particularly useful, because I have seen the programmer's simplification of this algeset in the actual code. Please refer to the supplementary content "C # Compiler Optimization".
Back to top
Optimization of full program
Visual C .NET adds WPO to the non-hosting code. In Visual C 2005, this feature extends to the managed code. It is not compiled and optimized for a source file, but compiles and optimizes all source files and header files.
Now the compiler can perform analysis and optimization across multiple source files. For example, if there is no WPO, the compiler can only function in a single compilation domain. With WPO, the compiler can be in line with all source files in the program.
In the following examples, things that the compiler can do include cross-compiler inline and constant delivery, and other types of process optimization:
// main.cpp
...
Msdnclass ^ msdnobj = gcnew msdnclass;
INT x = msdnobj-> Square (42);
Return X;
...
// msdnclass.cpp
Int msdnclass :: Square (int X)
{
Return x * x;
}
In this example, main.cpp calls the Square method, and this method is part of the MSDNClass in another source file. When compiled, / O2 is optimized, without full program optimization, Msil generated in main.cpp as follows:
ldc.i4.s 42
Call Instance Int32 MSDNClass :: Square (int32)
You can see that it first loads the value 42 into the stack and then calls the Square function. As a control, for the same procedure, when the full program is opened, the generated MSIL is as follows:
ldc.i4 0x6e4
It does not load 42, nor does it call the Square function. Instead, under the optimization of the full program, the compiler can inline functions from msdnclass.cpp and converge. The final result is just a simple instruction - the result of loading 42 * 42, the hexadecimal is expressed as 0x6E4.
Although some of the analysis and optimizations performed by the Visual C compiler can also be implemented in theory, but the time limit for the JIT compiler makes many optimizations mentioned here. Under normal circumstances, NGEN will implement these types of optimizations than JIT compilers because NGen does not have this type of response time limit that must face. Back to top
64-bit NGEN optimization
For this article, I will collect JIT and NGEN as JIT. For 32-bit version of the CLR, JIT compiler, and NGEN executions the same. But the 64-bit version is not like this, in the 64-bit version, NGEN is more effective than the optimization of JIT.
64-bit NGen uses this fact: it can compile more time than JIT, because JIT's throughput directly affects the response time of the application. I specially mentioned 64-bit NGEN in this article because it is relatively tuned for C style code, and some optimizations are made (such as Double THUNK Elimination Optimization) play a great help to C , these optimizations are other JIT and NEGN does not have. 32-bit JIT and 64-bit JITs were implemented by two different teams in Microsoft. 32-bit JIT is developed by the CLR team, and 64-bit JIT is developed by the Visual C team and is based on the Visual C code base. Because 64-bit JIT is developed by the C team, it pays more attention to problems related to C .
Back to top
Double THUNK elimination
One of the most important optimizations executed by 64-bit NGen is the so-called dual THUNK elimination. This optimization solves a conversion through the function pointer or virtual call in the C code compiled with / CLR switch, which occurs when the entry point is hosted by the hosted code. (This conversion does not happen in / clr: pure or / clr: SAFE compile code.) This conversion occurs because there is not enough information on the function pointer and virtual call on the Callsite to determine whether they call the entry point ( MEP) or a non-managed entry point (UEP).
Always select the UEP for backward compatibility. But if you actually call the Callsite that is managed? In this case, in addition to the initial Thunk enters the UEP from the hosted Callsite, there will be a THUNK to enter the target managed method from the UEP. This hosted - hosted Thunk process is often referred to as double THUNK.
The 64-bit NGEN implements the call to "from unmanaged to hosted" (that is, in turn to the invocation of the managed code THUNK), it is optimized. You can make a check to determine if it is this; if so, it will skip these two THUNK and jump directly to the managed code, as shown in Figure 2. This saves many instructions. In the actual code modeling baseline, I found 5-10% improvement (in human test, you can see more than 100% performance).
Figure 2 Double THUNK Elimination
However, it is important to note that this optimization is only effective when it is located in the default application domain. There is a good little rule, that is, remember that the default AppDomain usually gets better performance.
Back to top
C interop
C Interop is a technology for this host's hosted intero, which allows the standard C code band / CLR switch to compile to call the native function directly without the addition of any other code. The code generated when the / CLR switch is used is MSIL (except for a few special cases), and the data can be managed or unmanaged (stored by the user). I tend to think that C Interop is the most important .NET feature that no one knows. It is truly breakthrough reform, but it is necessary to truly understand the power of C Intero. In other languages with .NET, you want to use the machine code to add this unit code in a DLL and use the DLLIMPORT to call a function with explicit P / Invoke (or something similar to this. The practice depends on the language you are using). Otherwise, you must use a bulky COM Intero to access the native code. This is obviously inconvenient, and often encounters a lot of performance than C .
Generally do not think C Interop is the performance characteristics of C language, but as you will see, the flexibility and convenience provided by C Interop allow you to get better performance with CLR.
Back to top
Native code and hosting code in single-image
Visual C can select which functions are hosted by the programmer (according to the functionality of the function), which is a host. This is achieved by #pragma management and #pragma unmanaged, Figure 3 shows one of the examples. In many cases of large-scale tasks, make the core function to compile the machine and other code to host compilation can bring great benefits. In a single image, C can mix the managed code and the native code, call the managed function (vice versa) without special syntax through the local function. In this particle size, C can easily control the conversion from the hosted code to the local code, and vice versa.
When transitions (or reverse) from the managed code to the local code, the process is performed by the THUNK generated by the compiler / linker. This Thunk requires a certain price, and the programmers strive to avoid such a price. There is a lot of work to be done in the CLR, and the compiler will minimize the cost of the conversion, but developers can help reduce costs by reducing the frequency of such conversion.
The A section of Figure 4 is a C application, and its partial code (Z.CPP) is compiled to generate MSIL (/ CLR), while other parts (X.CPP and Y.CPP) are compiled with this machine code. In this program, some functions in Y.CPP and Z.CPP have been called multiple times. This will result in a large number of managed / native conversions, thereby reducing the execution speed of the program.
Figure 4 Change the border limit
Part B in Figure 4 shows how to optimize the program to minimize managed / native conversion. Its thinking is to determine the common interface, move them on one side of the hosted / native boundaries, eliminating all transitions of all across interfaces. Tools that use Visual C for Interop can easily complete this work.
For example, from the A in Figure 4 to B, just use the / CLR switch to recompile Y.CPP. Now y.cpp is compiled as a managed code, from z.cpp calls without the conversion cost from the host to the unit. Of course, you also need to consider generating MSIL's related performance costs from Y.cpp, and ensure that this compromise is beneficial to the app. Back to top
High performance sealing processing
Sealing processing is one of the highest costs in hosting / native intero. In C # and Visual Basic .NET and other languages, the encapsulation process is implicitly completed when the P / INVOKE is called (using the default encapsulator or when you implement the ICUSTOMMARSHALER). In C Interop, programmers can explicitly encapsulate processing data in the code. The advantage of this is that the programmer can seal the data to the local data at once, and then spread the processing cost by multiple calling processing results of the reuse data.
Figure 5 shows a code snippet compiled with / CLR switch. There is a for loop in this code, calling this function (getchar) in this loop. In FIG. 6, the same code is implemented using C #, and the CsharPType class is sent to NATIVETYPE by calling getChar, as shown below:
Class nativecode_api natidype {
PUBLIC:
NativeType ();
INT POS;
Int length;
Char * thestring;
}
In C , the user explicitly uses the native type, so it is not necessary to implicitly encapsulate. This type of optimization costs is quite large. In this example, C achieves a faster than C # implementation is fast.
Back to top
Templates and STL with .NET type
Some more interesting new performance features in Visual C 2005 are templates with managed types (including STL / CLI), optimized with managed code, delayed loading, and deterministic termination. Visual C .NET 2003 can generate MSIL on the native type, but cannot use managed types as parameterized types in this template. In Visual C 2005, this problem has been corrected, and the template can now use the managed type or the non-managed type as a parameter. Now, the powerful feature of the template can be used to write code written in .NET (you should also look at the work made by the BLITZ and Boost library).
The C Standard Template (STL) is a major innovation in the library design. It allows you to use containers and algorithms without sacrificing performance, this has been confirmed. In Visual C .NET 2003, the limitations of managed types in the template means that there is no supporting STL of the managed type. In addition to this limitation, Visual C 2005 has also introduced STL / CLI - a STL version that has been proven to handle managed types. The base library (BCL) in .NET introduces the container in .NET, the Plan in the Visual C group is superior to the performance of STL / CLI. If you want to know more about STL / CLI, there is an excellent article in the Visual C Developer Center, a STAN LIPPMAN STL.NET Primer.
With STL / CLI, you can implement all the contents you like with STL, including vector, list, dual-end queue, mapping, collection, and hash mapping and collection. You can also get algorithms such as sorting, search, collection, and internality and convolution. A STL / CLI algorithm is an amazing thing to use the same implementation to this unit and STL / CLI version. The good design of STL will benefit each C programmer through portable powerful code. Back to top
Determinative help performance
Since you can use powerful patterns and terminology, you can write a lot of effective C . Many modes and terms (including Resource Acquisition IS Initialization (RAII) use a functionality called deterministic termination in the C language. Its principle is that when an object is deleted by the delete operator (for stack assignment objects) or outside the scope (for stack assignment objects), the destructor is called. Deterministic termination can save performance because a longer time a target has a resource (longer than it truly needs), the more performance, because other objects are trying to get the same resources.
Using the CLR terminator can cause the termination program code to be outside the object (assuming the code to be released is the termination program) but is not called the termination thread on the object's termination program. Obviously, this is not ideal because it may not be executed when the programmer expects to perform the termination thread. In addition, it will be reclaimed with the object-related memory until the termination program is executed, which will increase the program to the memory.
In the .NET-based code, a common term that helps to avoid this problem is DISPOSE mode. To use it, developers need to implement a Dispose method for their classes, and then call this method when the object is no longer needed. When C programmers want to call Delete to the object, you can call this method at the same time, but even in C , it is easily erroneous and too complicated. Language such as C # adds "using" constructs, which helps to solve the following two problems, but for special circumstances, it will be complicated and easily erroneous.
Instead, RAII This C term automatically obtains and releases resources, and it is not easy to erode, because the programmer does not need to write additional code. Visual C .NET 2003 does not support the stack assignment. Net object's deterministic termination, but supports this feature in Visual C 2005.
In the upper half of Figure 7, it is possible to note the type of type Socket_t uses a stack-based syntax and will have a stack-based clearance. In this way, what happens when generating an abnormality in the third line? For the stack-based semantics, it is determined to run the destructor for the MAINSOCK, but because the backupsock is not created in the stack, there is no object to be destructed.
To write a semantic code that is equal to C #, it is difficult and easy to errors; see the lower half of Figure 7. Of course, this example is small, but with the increase in this task complexity, the larger the possibility of error will be larger.
Back to top
Delay loading
Although .NET Framework is fine-tuning for improved performance, the Loading the CLR is still slightly delayed at startup. In many applications, there may be some code paths and scenarios without hosted code, especially when the existing legacy program is modified by the .NET function. In these cases, your application should not have this associated startup delay. You can use the existing features in Visual C - DLL delay loading to solve this problem. The idea is to load the DLL only when certain content in the DLL actually needs to be used. This idea can also be applied to the DLL of the load content for the .NET assembly. By using the linker option / DelayLoad: DLL (by specifying the .NET assembly you want to delay), in addition to delay the load listed .NET program set, you can also delay the loading of the CLR (if all .NET assembly Delay loading). As a result, the startup speed of the application can be completely as fast as this machine, thereby eliminating one of the most common drawbacks of hosting applications.
Back to top
Why do DLLEXPORT can't always apply
Use __declspec (dllexport) with its own defects. When you have two mappings (DLL or EXE) are managed mappings, but through DLLEXPORT instead of the #using public feature, DLLEXPORT issues are exposed. Because DLLEXPORT is a unit, each time you use __Declspec (DLLEXPORT), you will first trigger the conversion from managed to this unit, and then trigger the conversion from this unit to the host. This is difficult to get good performance.
Solving this performance problem is limited. There is no simple "switch" can immediately let __declspec (dllexport) a structure that does not have a related THUNK for the managed code. The recommended fix method is to package the exported functionality in a managed type (reference or value "/ structure), the import program is sent to the" #using "on the DLL to access this type, so that the functionality in the DLL is directly accessed directly. With this change, it is not necessary to convert when the host is called from the managed client. This is explained in Figure 8, wherein the A portion shows the cost associated with using __DECLSPEC (DLlexPort), the Part B shows the optimization of using #using and package the function package in the .NET type. A potential problem with this method is to export the DLL's non-hosting import program that cannot perform __Declspec (DLLIMPORT) for DLL functionality. This should be considered before making changes.
Figure 8 Reduce THUNK cost
The A section of Figure 8 shows the conversion path of the hosted function to the managed code using __DECLSPEC (DLLEXPORT). In the Part B, the function is packaged into a hosted type and uses #using to access the function. Compared with the process in the A section, the result is the THUNK that is highly costly.
Back to top
summary
Visual Studio .NET 2002 introduces a .NET Framework with Visual C has a long time. C makes programmakers to write high performance hosted code with great flexibility, and they work in a natural way to programmakers in C . There are many languages available for .NET programming; if you want to get the greatest performance, Visual C is an obferous choice.
Kang Su Gatlin is the program manager of the Microsoft Visual C team. Most of his working hours are trying to find a system-like system that allows programs to run faster. He is engaged in high performance and grid calculation before you work to Microsoft.