.NET frame unapproved features - String class

zhaozj2021-02-16  58

.NET framework unapproved features

--------- String class

Many places in many places are just briefly telling about the point of view. It is not easy to understand that translators have added some arguments for these perspectives, and all arguments are based on Rotor bags. According to the basis, they indicate their location in the Rotor package, if there is no Puitless, welcome to correct.

This article analyzes the String class of .NET Framework, which provides details of the String class implementation, while describing the efficiency of various different ways of use using String. Many of the information provided here cannot be found in MSDN or other books, because Microsoft does not disclose them. Background knowledge: String is a primitive type of .NET. CLR and JIT have special processes for some special classes, and String is one of them. Other include: other primitive types, StringBuilder, Array, Type, Enum, Delegate, and some Reflection classes, such as MethodInfo. In .NET 1.0, all objects allocated to the heap contain two stasis: an object head (4 bytes) and a pointer to the method table (4 bytes). The object head is provided with 5 bitmarks, one of which is reserved for GC, which identifies whether the object is "reachable object" ("reachable object" is a noun in the garbage collection algorithm, simply that it is Objects used by the application). The remaining 27 bits are used as an index, referred to as syncindex, pointing to a table. This index has a variety of uses: First, when using the "Lock" keyword, it is used for thread synchronization; in addition, it is used as the default hash code when calling the object.gethashcode () method (inherited category Object to Object.getHashcode ()). Although this index cannot provide the best distribution characteristics for the hash code, for another requirement of the hash code --- the object with the same value returns the same hash code, it is satisfied. Syncindex remains unchanged at the entire survival cycle of the object. All objects can be calculated based on the object header to get its actual memory occupation, the calculated formula is as follows (from Rotor package SSCLI20020326 / SSCLI / CLR / SRC / VM / Object.h): Mt-> getBaseSize () ((ObjectTyperef) > GetSizefield () * MT-> getcomponents ()) For most objects, their size is fixed. String class and array class (including Array inheritance) is only two variable long objects, also That is to say, after they create, the length of the object can change. String is a bit similar to the OLE BSTRS - the Unicode character array at the beginning of the length data, the empty character ends. The following is the three fields of String in internal maintenance: [Nonserialized ] private int m_Arrayength; [Nonserialized] private int m_stringLength; [Nonserialized] private char m_firstchar; their specific meaning as shown in the following table:

M_ArrayLength This is the actual length (in characters) assigned to the string. When you create a string, M_Arrayength is the same as the logical length of the string (M_StringLength). However, if you returns a string using StringBuilder, the actual length may be larger than the logic length. M_StringLength This is the logical length of the string, you can get it through the string.length property. For optimization performance, some of M_StringLength is used as an identifier, so the maximum length of String is much smaller than uint32.max (32-bit operating system). Some of these identifiers are used to indicate whether String is simple characters (such as Plain ASCII), which does not use complex Unicode algorithms when sorting and comparison. m_firstchar This is the first character of the string. This is an empty character if it is an empty string. String always ends with an empty character, which enhances its interoperability with the non-hosting code and the traditional Win32API.

String's total memory is: 16 bytes 2 bytes * characters 2 bytes (last empty characters). Table 1 has already been described. If StringBuilder is used to create a string, the actually allocated memory will probably be larger than String.length.

Very efficient StringBuilder

The StringBuilder is closely related to String. Although StringBuilder is placed in system.text namespace, it is not a normal class. Running and JIT compiler have special processes for StringBuilder. You want to write a class that is effective as it is not easy.

StringBuilder maintains a string variable internally - M_StringValue and allows it to be modified directly. By default, M_StringValue's M_ArrayLength field is 16. Here is 2 internal fields for StringBuilder maintenance:

INTERNAL INT M_MAXCAPACITY = 0; INTERNAL STRING M_STRINGVALUE = NULL; their specific meanings are shown in Table 2:

MaxCapacity StringBuilder's maximum capacity, which specifies the number of characters that can be placed to m_stringvalue, the default value is int32.maxvalue. You can specify a small amount of capacity, and M_MaxCapacity will not be changed again once it is specified. M_StringValue StringBuilder maintained a string (Jeffrey Richter) tells StringBuilder maintenance in "Applied Microsoft .NET Framework Programming" StringBuilder maintenance is a character array, the translator believes that it is easier to understand as a character array, but from the source code of the Rotor package, actual maintenance The should be a string

Let's take a look at the related code of an APPEND method of StringBuilder (

Excerpt from Rotor bag SSCLI20020326 / SSCLI / CLR / SRC / BCL / System / text / StringBuilder.cs):

public StringBuilder Append (String value) {// If the value being added is null, eat the null and return if (value == null) {return this;.} int tid; // hand inlining of GetThreadSafeString String currentString = m_StringValue; TID = INTERNALETCURRENTTHREAD (); if (m_currentthread! = TID) CurrentString = String.getstringForstringBuilder (currentstring, currentstring.capacity);

int currentLength = currentString.Length; int requiredLength = currentLength value.Length; if (NeedsAllocation (currentString, requiredLength)) {String newString = GetNewString (currentString, requiredLength); newString.AppendInPlace (value, currentLength); ReplaceString (tid, newString );} else {currentString.AppendInPlace (value, currentLength); ReplaceString (tid, currentString);} return this;} private bool NeedsAllocation (String currentString, int requiredLength) {// <= accounts for the terminating 0 which we require on strings return (currentString.ArrayLength. <= requiredLength);} private String GetNewString (String currentString, int requiredLength) {int newCapacity; requiredLength ;. // Include the terminating null if (requiredLength> m_MaxCapacity) {throw new Argu mentOutOfRangeException (Environment.GetResourceString ( "ArgumentOutOfRange_NegativeCapacity"), "requiredLength");} newCapacity = (currentString.Capacity) * 2; // To force a predicatable growth of 160,320 etc. for testing purposes if (newCapacity m_maxcapacity) {newcapacity = m_maxcapacity;}}}}} {throw new argumentoutofrangeException

ArgumentOutOfRange_NegativeCapacity "));} return String.GetStringForStringBuilder (currentString, newCapacity);} We can know from the above code, when you add a new character, string length if construction exceeds the capacity of m_StringValue (m_arrayLength), it will create a new character String. The new string capacity is generally twice the original (if not beyond M_maxcapacity). Let's take a look at the source code of StringBuilder.Tostring () (taken from Rotor Pack SSCLI20020326 / SSCLI / CLR / SRC / BCL / System / Text /stringbuilder.cs): public override String ToString () {String currentString = m_StringValue; int currentThread = m_currentThread; if (currentThread = 0 && currentThread = InternalGetCurrentThread (!!)) {return String.InternalCopy (currentString);}

IF ((2 * currentstring.length)

currentString.ClearPostNullChar (); m_currentThread = 0; return currentString;} // Used by StringBuilder to avoid data corruption internal static String InternalCopy (String str) {int length = str.Length; String result = FastAllocateString (length); FillStringEx (result , 0, Str, Length); // The underlying's string Chan Changed;} When a string is returned using StringBuilder.toString (), the actual string is returned (that is, the method returns StringBuilder. Internal maintenance string field (M_StringValue) reference, instead of creating a new string). If StringBuildth is more than twice the capacity of the actual number of characters, StringBuilder.toString () will return a concise version of a string. After calling the StringBuilder. ToString () method, modify the StringBuilder again generate a copy action, which will create a new string; then the new string is the new string, so that the string that has been returned There is no change. In addition to the memory used by the string, StringBuilder has an additional 16 bytes, but the same StringBuilder object can be used multiple strings multiple times, so StringBuilder only brings an additional overhead. We can see that it is very efficient to use StringBuilder. Other Performance Tips: 1> When using the " " connection string, such as "a" "B" "C", the compiler will call the Concat (A, B, C) method, which eliminates additional A large number of string copies (there is a specific discussion here). 2> Using the StringBuilder than a string connection To minimize the garbage collection garbage collector using StringBuilder to create a string can significantly reduce memory allocation. It should be noted that many tests indicate that all garbage recovery is only a second one second - it is almost unable to detect. Therefore, it is not advisable to avoid garbage recovery if you do not analyze the procedure, but there is a possibility that frequent garbage recovery may cause damage. Run .NET programs sometimes encounter some uninterrupted stagnation, it is difficult to determine that this is a JIT compiler, and the garbage collector is still caused by other factors. Previous programs (such as Windows Shell, Word also have IE) also have similar stagnation. .NET uses three generations to recover memory, this practice is based on a hypothesis: the more newly allocated memory, the more frequent recycling; the more the recovery is not frequent.

0 generations are the youngest generation, after the end of the 0 generation garbage collection, the survivors will be moved into 1 generation; the same, after the generation of garbage collection, the survivor will be moved into 2 generations. Usually garbage recovery only occurs at 0 generations. According to Microsoft's statement, the time required to perform 0 generation garbage is equivalent to a page error --- 0-10 milliseconds; 1 generation garbage collection requires 10-30 milliseconds, 2 generations of garbage collection needs to see the working environment. In addition, my own analysis shows that the number of garbage recovery of 0 generations is 10-100 times more than 1, 2 generations. The "Applied Microsoft .Net Framework Programming" written by Jeffrey Richter: When the CLR is initialized, different thresholds are selected for 3 generations, respectively, 0 generations 256kb, 1 generation 2MB, 2 generation 10MB. However, my own research on the Rotor package CLI is another result (below the code from Rotor bag SSCLI20020326 / SSCLI / CLR / SRC / VM / GCSMP.cpp): 0 generation 800kb, 1 generation 1MB. Of course, these are undisclosed, if the change is not written, it will be explained. These initial values ​​are automatically adjusted according to the memory allocation of the actual program, and if the 0 generation of memory is rarely recycled (many moved to 1 generation), then the 0 generation threshold will increase. void gc_heap :: init_dynamic_data () {dynamic_data * dd = dynamic_data_of (0); dd-> current_size = 0; dd-> promoted_size = 0; dd-> collection_count = 0; dd-> desired_allocation = 800 * 1024; dd-> new_allocation = dd-> desired_allocation; dd = dynamic_data_of (1); dd-> current_size = 0; dd-> promoted_size = 0; dd-> collection_count = 0; dd-> desired_allocation = 1024 * 1024; dd-> new_allocation = dd -> desired_allocation;

// dynamic data for large objects dd = dynamic_data_of (max_generation 1); dd-> current_size = 0; dd-> promoted_size = 0; dd-> collection_count = 0; dd-> desired_allocation = 1024 * 1024; dd-> new_allocation = DD-> Desired_Allocation;} Different String Class Method String class provides unnecessary memory allocation, which increases the frequency of garbage recovery. For example, the TouPper () method or the TOLOWER method will generate a new string, regardless of whether the string has changed. A more efficient method should be modified to the original string and return to the original string. Similarly, the substring () method also returns a new string, although the return is the entire string or an empty string. The .NET class library has a large number of hidden memory allocations, avoiding this is difficult. For example, numeric types (such as int, float) formatted to strings (via string.format or console.writeline), a new hidden string is created. In this case, we can write a formatted code, you can control your code does not let it generate new strings. Obviously, it is possible to write such a code, but it is more difficult. The other parts of the .NET class library have also exposed a similar low efficiency. In the Widows Forms library, the text property of Control is almost always returns a new string. This may be understood because the attributes cannot be stored, so it must call the Windows API function getWindowText () to get the value of the control. GDI is the worst abuse of garbage collectors, because every time MEASURETEXT or DRAWTEXT will create a new string. A better String class method is gethashcode (), which considers every character of the string to generate an integer value, and the resulting value is very well distributed in the storage range of int. In order to reduce the pressure of the garbage collection StructLayout (LayoutKind.Explicit, Size = 514)] public unsafe struct Str255 {[FieldOffset (0)] char start; [FieldOffset (512)] byte length; #region Properties public int length {get {return length;} set {length = Convert.Tobyte (Value); Fixed (char * p = & start) p [length] = '/ 0';}}

Public char this [int index] {get {if (index <0 || index> = length) throw (new indexoutofrangeexception ()); fixed (char * p = & start) Return P [INDEX];} set {=} <0 || index> = length (new indexoutofrangeexception ()); fixed (char * p = & start) p [index] = value;}} #ENDREGON} STR255 is a stack assigned value type, which pair The operation does not cause pressure to the GC. It can handle 255 characters, including a length byte. The reference to this structure can be passed directly to a Windows API call because the beginning of the structure is the first character of C-String. Of course, you can write some additional methods to edit the string, but you need to pay attention to you: you must ensure that the string is the end of the empty character (compatible with Windows). In order to allow other CLR methods to use it, you need to write a conversion program to convert this structure into a .NET string (like the stringbuilder.tostring () method). If you use it to create a .NET string, it is more advantageous than StringBuilder: STR255 requires less memory allocation. 3) "Originally" modified string public static unsafe void tourper (string str) {fixed (char * pfixed = str) for (char * p = pfixed; * p! = 0; p ) * p = char.toupper * p);} The above example tells how to modify a non-variable string by using the UNSAFE pointer. This method has a high efficiency, an example is compared to str.toupper (), str.toupper () returns a new string, regardless of whether the contents of the string actually change, create a new string. The above code is just modified for the original string. The Fixed keyword fixes the string on the heap, which avoids moving it when garbage is recycled, and the address of the string can be converted into a pointer, the pointer points to the start position of the string. Since the string can actually be changed, why should it be emphasized that the .NET string is "fixed unchanged"? There are two reasons, the first, the "fixed" string of the "fixed" does not exist the problem of thread synchronization; second, "fixed" string allows multiple references to point to the same string object, which can reduce the system The number of strings is reduced by memory overhead. By modifying the internal field M_Arraylength of the string object, the length of the string can also be changed. However, this is very dangerous, and the implementation of the CLR in future implementations or non-Windows systems may change the integrity of the string implicit. To modify the string using this method, there is a point where M_StringLength's high-character bit contains some flags, which cannot be changed. Some of these flags are related to the content of the string, such as whether the indicator string contains only 7 ASCII characters. Elimination range Check that indexes for array or strings are generally needed to check.

According to Microsoft's statement, the compiler has made some special optimization to improve the performance of traversal array or string. Let's compare the following three places of traversal string to see that a faster. 1) int hash = 0; for (int i = 0; i

Let's take a look at the following code: use system; public class abc {static void main (string [] args) {if (args.length> 0) {switch (args [0]) {CASE "a": console.writeline "a"); break; default: Break;}}}} This is the Il.Method Private HideBysig Static Void Main (String [] args) CIL Managed {.EntryPoint // Code Size 52 (0x34) .maxstack 2. Locals init (string v_0) IL_0000: ldarg.0 IL_0001: LDLEN IL_0002: Conv.i4 IL_0003: LDC.I4.0 IL_0004: BLE.S IL_0033 IL_0006: LDSTR "a" IL_000B: Leave.S IL_000D IL_000D: ldarg.0 IL_000E : ldc.i4.0 IL_000f: ldelem.ref IL_0010: dup IL_0011: stloc.0 IL_0012: brfalse.s IL_0031 IL_0014: ldloc.0 IL_0015: call string [mscorlib] System.String :: IsInterned (string) IL_001a: stloc. 0 IL_001B: LDLOC.0 IL_001C: LDSTR "a" IL_0021: BEQ.S IL_0025 IL_0023: Br.s IL_0031 IL_0025: LDSTR "a" IL_002A: Call void [mscorlib] System.console :: WriteLine (String) IL_002F: BR. S IL_0033 IL_0031: Br.s IL_0033 IL_003 3: Ret} // end of method abc :: Mainclt Automatically maintain a hash table named "Intern Pool", which contains a single instance of each unique string constant declared in the program, and Any unique instance of String added by programming methods. Now let's see the above IL code, first IL calls the ISInterned method to pass the string Args [0] specified in the Switch statement. If IsInterned returns NULL, it indicates that args [0] does not match any of the string of the CASE, and turn it on the DEFAULT code. If IsInterned found Args [0] in "Detention", it returns a reference to the string object in the hash table, and then compares the reference to the address of the string specified by each CASE statement. The comparison address is much faster than all characters in the comparison string, and the code can quickly determine which case statement should be executed. What is mentioned above refers to a small amount of case, if the number of case is large, then the compiler will generate a hash table, this hash table's load factor is 0.5, the initial capacity is twice the number of case (actually The ratio of the hash table is around 1/3, because the hash table will maintain the ratio of the element and the bucket at 0.72.0.72 is the speed and memory optimal balance, which is a number obtained by Microsoft's performance test). The string of the CASE statement will be added to this hash table, and there is no difference between other comparison steps.

转载请注明原文地址:https://www.9cbs.com/read-26523.html

New Post(0)