Author: Scott Mitchell
Time: November 2003
Summary: This article is the beginning of the important data structure and its application's series of articles. This series of articles have a total of six parts. We will study the data structure in .NET Framework and the data structure we created by our own. The first part we mainly pay attention to the data structure, how to effectively analyze the data structure and why this analysis is important. In this article, we will study two data structures commonly used in .NET Framework: Array and ArrayList.
Preface:
Welcome to the first part of the series of articles: Use the data structure in .NET. Although in this series of articles, we will study different data structures, and some include in the .NET Framework underlying library, there are some data structures we create. If you are not familiar with the "data structure" terminology, I will introduce it. The data structure is an abstract structure or class for tissue data and provides different operations on the data. The most commonly used, perhaps the most famous data structure is an array, which contains a continuous data set accessed through an index.
Before entering the content of the article, let me briefly introduce a total of six articles series, and have a whole understanding of it. If you think there is any topic is wrong, please send me: mitchell@4guyfromrolla.com, share your thoughts. As long as there is time, I will be very happy to increase your suggestion to the appropriate part of the article, if necessary, I can add the seventh part in this series.
In the first part of this series, we will understand why the data structure is important and its impact on the performance of algorithm. In order to determine the influence of the data structure, we need to carefully analyze the different operations completed by a data structure. Finally, we will pay attention to the two data structures in .NET Framework: Array and ArrayList. Maybe you have used these two data structures. In this article, we will study what operations provide and the efficiency of these operations.
In the second part, we will deepen the ArrayList class and the Queue and Stack class similar to it. Like ArrayList, Queue and Stack class store consecutive data sets and exist in the .NET Framework underlying library. However, unlike ArrayList to access any data items at will, Queue and Stack only allow data to be accessed in a predetermined continuous order. We will study some applications using queues and stacks, and implement these two classes by extending the ArraList class. After studying Queues and Stacks, we will learn about Hashtables, which allows direct access as ArrayList, but indexes through the string keyword.
ArrayLists is suitable for direct access to storage data, but it is not suitable for storing data that needs to be searched. In the third part, we will study the Binary Search Tree data structure, which provides a more efficient search method than ArrayList. .NET Framework does not provide a built-in Binary Search Tree data structure, we have to create itself.
Search for the efficiency of a Binary Search Tree and the order in which the data is inserted into the tree. If the data is inserted in order or in the order of neither sort, the Binary Search Tree will lose all the advantages compared to ArrayList. In order to solve this problem, in the fourth part, we will study interesting random data structures - Skiplist. Skiplist can search efficiently a binary search tree without being affected by the data insertion order.
In the fifth part, we will pay attention to the data structure of the performance graph. One figure is a collection of nodes and has a side line connecting different nodes. For example, a map can be represented by a diagram, and the city node represents the edge line between the highway nodes. Many real-world issues can be defined abstract, so the figure is a commonly used data structure. Finally, in the sixth part we will discuss the data structure indicating a collection and non-intersecting set. A collection is the sequential aggregation of the data. A collection of non-intersecting is a collection of no public elements between any two collections. These two sets are often used frequently, and we will introduce them in the last part.
Analyze the performance of data structure
When considering a specific application or programming problem, many developers (including me) have fun to enhance the user experience by writing an algorithm to solve the problem, or increasing the characteristics of Cool for the application. You rarely hear that someone is excited because of the data structure they use. However, for a specific algorithm, different data structures have a large impact on performance. A usual example is to find an element in a data structure. For arrays, this time you need to look forward to the number of lookup elements. For Binary Search Trees or Skiplists, the time spent is sublineable when searching for a large amount of data. The selection of the data structure will affect the performance of the application performance.
Since the data structure used by the algorithm has a great impact on the performance of the algorithm, it is critical to have a strict method to compare the efficiency of different data structures. For developers using data structures, main interest is how the performance of the data structure changes when the number of data stored increases. That is to say, how much affects the operation of the data structure in each new element in the data structure?
Consider such a situation, you have a program that returns a list of files in a specific directory stored in a string array with the System.io.Directory.GetFiles (Path) method. Now, if you want to discover the XML file existing in the file list by searching this array (ie, the extension is .xml file). One way is to check the entire array and set some flags when you encounter XML files. The program code is as follows:
Using system;
Using system.collections;
Using system.io;
Public Class Myclass
{
Public static void
Main
()
{
String [] fs = Directory.getFiles (@ "c: / inetpub / wwwroot);
Bool Foundxml = false;
INT i = 0;
For (i = 0; i IF (String.Compare (path.getextension (fs [i]), ".xml", true) == 0) { Foundxml = true; Break; } IF (FoundXML) Console.writeLine ("XML File Found -" FS [i]); Else Console.writeline ("NO XML Files Found."); } } Let's take a look at the worst case. When there is no XML file or XML file in the file list, we have to search for each element of the entire array. In order to analyze the efficiency of the array in a certain order, we must ask yourself "assume that there is an array of n elements. If another element is added, this array has N 1 element, how much is the new run time after adding elements "(The term" runtime "is not an absolute time of the metrics run, but the number of steps that represent the program to complete the specified task. When using an array, the number of steps is to complete how many array access.) In arrays Search for a value, we need to access each array value, so if we have n 1 array element, we must check N 1 element. That is, the time to search for an array is proportional to the number of elements in the array. This analysis method is referred to as an asymptotic analysis, even when the number of data in the data structure is close to infinity, how is the efficiency of the data structure varies. The symbol representation commonly used in the progressive analysis is BIG-OH. Use O (n) in BIG-OH to search for the performance of an array. The uppercase letter O is a dedicated symbol in the BIG-OH symbol. n Number of steps of operation of the search array, which linearly increases as the array increases. Calculate the gradual time of step segment, more systematic methods, follow the steps below: 1. Determine the steps to form an algorithm run time. As previously described, the steps considered should be to read and write access to the array. Other data structures may vary. You cannot simply think that the operation steps contained in the data structure are the atomic operations completed by the computer. That is, for the above code segment, when I analyze its runtime, only the number of operations of the access array is calculated, and the time used to create and initialize the variables and the time used by comparison characters. 2. Find the code you use to calculate the runtime. At those of the code lines, one 1. 3. For those code lines that target 1, see if they are in a loop. If so, 1 will be changed to the highest count of the loop. If there are two or more nesting loops, the count of each loop multiplied. 4. Find the maximum value you write, this is the runtime. Let us use these steps in the code segments above. We have marked the steps we are interested in, which is the array access operation, the first step is completed. Go to the second step, pay attention to the two lines of code of the array FS: in the string.compare () and console.writeline () methods are used as parameters, so a 1 is marked at each line. Now, enter Step 3, in the loop, String.comPare () accesses the number of fs is up to N times (n is the size of the array). Therefore, 1 and change to N is wiped off. Finally, we have found that the maximum value is n, so the runtime is represented by O (N). O (n) or the linear time, exhibits one of countless asymptotic runs. Other contained O (LOG2 N), O (N Log2 N), O (N2), O (2N), and the like. When the n value is increased, the smaller the value of the expression in parentheses, the better the performance of the data structure. For example, the efficiency of operations running on O (log N) is higher than operating on O (N) because log n Note: Here, you need to review mathematical knowledge. Another representation of Loga B = Y is AY = B. Therefore, log2 4 = 2, because 22 = 4. Similarly, log2 8 = 3, because 23 = 8. Obviously, the growth rate of LOG2 N is much slower than a separate N, because when n = 8, log2 n = 3. In the third part we will study Binary Search Trees, its search operation only costs O (log2 n) running time. Through this series of articles, we will calculate the progressive runtime of each new data structure and compare the running time of similar operations in different data structures. Everyone is familiar, linear, direct access, data structure of the same type of data - arrays Array is the simplest and widely used data structure in a computer program. In any programming language, arrays have some common features: The data content in the array is stored in a continuous memory space. All elements in the L array are the same type; the array is called the same data structure. L array elements can be accessed directly. (Many other data structures are not like this. For example: in the fourth part of this series, we will study the Skiplist data structure. In order to access each element of Skiplist, you must search for the elements you are looking for through other elements. For For arrays, whenever you want to access an element in an array, you just use this code: ArrayName [i]. Commonly used on arrays: l Assign space l Access element l change the size In C #, when you first declare an array, it is assigned with a null value. For example, the following code simply creates a variable Booleanarrary, which is equal to NULL: Bool [] BooleanArray; We must assign a certain number of elements before we use the array. Implement the following syntax: BooleanArray = new bool [10]; Or is more common to represent: ArrayName = New ArrayType [allocationsize]; This will assign a sufficiently large continuous memory block in the CLR-Managed pile to save the ALLOCATIONSIZE number of ARRAYTYPES. If ArrayType is a value type, the Unboxed ArrayType value of allocationsize is created. If ArrayType is a reference type, the ALLOCATIONSIZE number of ARRAYTYPE references are created. (If you are not familiar with the reference type and value type, and the difference between Managed Heap and Stack, you refer to understanding .Net's Common Type System) To help understand how .NET Framework stores internal data of arrays, consider the following example: Bool [] BooleanArray; Fileinfo [] FILES; BooleanArray = new bool [10]; FILES = New fileinfo [10]; Here, BooleanArray is an array of value type System.Boolean, and the FIES arrays are reference types.. Figure 1 shows the allocation of the CLR-Managed heap after this four lines of code. Figure 1 Continuous distribution in the managed heap in the array content in memory It is worth noting that 10 elements in the Files array are references to the FileInfo instance. Figure 2 shows the distribution of memory if we assign the FileInfo instance to some of the elements in the files array: Figure 2 Continuous distribution in the managed heap in array content in memory All arrays in .NET allow read and write. The syntax for accessing array elements is: // read an Array ELEMENT Bool B = BooleanArray [7]; // Write to an Array Element BooleanArray [0] = FALSE; The run time of an array accessed is represented by O (1) because it is a constant. That is, no matter how many elements are stored in the array, finding an element is the same time. The array runtime is unchanged because the array element is continuously stored, so looking for an element only needs to know the number of array in memory, the memory size of each array element, and an index of the elements. In Managed Code, the quotation of arrays is slightly complex than this because the CLR is checked for each array access, to ensure that the request's index is within the range of data. If the specified array index is out of range, an indexoutofrangeException exception will be generated. This inspection ensures that we enter other memory regions when we traverse an array. However, this check does not affect the run time of the array, because the time required to complete the check does not increase as the array increases. Note: In carrying out some large number of array access, index binding checks a small amount of performance consumption. For some unmanaged codes, the index crossed check can be skipped. For more information, please refer to Chapter 14 Applied Microsoft .NET Framework Programming By Jeffrey Richter. When you use an array, you may need to change the size of the array. To do this, you will need to create a new array instance and copy the contents of the old array to a new array. This process is called redimensioning, implemented by the following code: Using system; Using system.collections; Public Class Myclass { Public static void Main () { // Create an INTEGER ARRAY WITH THREE Elements int [] FIB = New Int "; FIB [0] = 1; FIB [1] = 1; FIB [2] = 2; // redimension message to a 10 Element Array Int [] temp = new int [10]; // Copy THE FIB Array to Temp FIB.COPYTO (TEMP, 0); // Assign Temp to FIB FIB = TEMP; } } After executing the last line of code, the FIB references an array containing 10 INT32 data. The value of the third to 9th elements is the default int32 value 0. The array is suitable for storing data sets that require only directly accessible. Search for a linear run time without sorted arrays. This is acceptable when a small array is used or a few search. If your application stores often require a large number of arrays, some other data structures will better meet your requirements. We will discuss such data structures in the following articles. (If you search for an array through some attributes and the data is sorted by attribute, you can use an algorithm called a Binary Search, it searches for an array to spend o (log n) run time, this is the same as the search binary search Trees time In fact, the Array class contains a static binarytsearch () method. For more information on this method, please refer to my previous article, Efficiently Searching a sorted array. Note: .NET Framework also supports multi-dimensional arrays. The multidimensional array and one-dimensional array are the same, the access element spends a constant run time. Search for the time required for one-dimensional array of N elements to represent o (n). For the two-dimensional array of NXN elements, the runtime is represented by O (N2) because the search must check N2 elements. In general, a k-dimensional number has an O (NK) search time. ArrayList: The array array of self-redimensioning in the design is some restrictions on the design because the array can only store the same type of element. When using an array, you must specify the number of elements. However, many times, developers hope to flexibly manage simple collections of different types of objects without worrying about memory allocation issues. The .NET Framework provides such a data structure, named System.Collections.ArrayList. The following code snippet illustrates that any type of element can be added to ArrayList. Moreover, we don't have to care about the change of ArrayList size. All processing is automatically completed in the background. ArrayList countdown = new arraylist (); countdown.add (5); countdown.add (4); countdown.add (3); countdown.add (2); countdown.add (1); CountDown.Add ("Blast Off!"); County Down.Add (New ArrayList ()); Use the object.Array using the Object type in the background. Since all types are inherited directly or indirectly from Object, an Object array can save any type of element. By default, ArrayList creates a 16 element Object array, with a parameter or CAPACITY property of the arraylist constructor, you can specify the size of ArrayList. When an element is added to an add () method, the number of internal array elements should comply with the capacity of the array. If a new element is added to exceed the capacity of the array, its capacity will be doubled, and the size of the array will be changed. ARRAYLIST, and the array, using the same syntax directly: // read access INT x = (int) countdown [0]; String y = (string) countdown [5]; // Write Access COUNTDOWN [1] = 5; // ** Will Generate An ArgumentOutofRange Exception ** Countdown [7] = 5; Since ArrayList is an array of storage objects, when reading values from ArrayList, it needs to explicitly convert the data type when the store is stored. Moreover, if you reference an element beyond the range of ArrayList, you will generate System.ArgumentOutofrange exception. ArrayList provides greater flexibility than a standard array, but also brings a performance sacrifice. Especially in the ArrayList type. We remember a value type (such as System.Int32, System.double, System.Boolean, etc.) in the Managed Stack Continuously stored in a unpacking manner. ArrayList is an array that stores object references. So, even if you use the ArrayList only stored value type, each ArrayList element is also a reference for a box, as shown in Figure 3. Figure 3 ArrayList contains a continuous memory storage block for an object reference When you use an arraylist that is large in your application and performs read and write operation frequently, the boxes and unpacking operations affect the performance of the program. Figure 3 illustrates that the ArrayList and array are the same when using the reference type. Compared to the array, ArrayList will not cause any performance due to its self-redimensioning. If you know the exact number of elements stored in ArrayList, you can close Self-Redimensioning by specifying initial capacity in ArrayList's constructor. For arrays, if you don't know the accurate size of the array, when the number of elements insert exceeds an array, you have to manually change the size of the array. A classic computer science problem is that if the program is run out of the cache, how much new space should be assigned. One option is to change the size of the array so that more elements are stored in a new array. That is, if the array assigns five elements in the initial time, the size of the array changes to the size of the array before the sixth element is inserted into the six elements. Obviously, this method can save a lot of memory, but after the first change array size (redimensioning), each insertion element needs to change the size size, which has a great impact on performance. Another option is to change the size of the array to the original 100 times in changing the array. That is, if the array is assigned 5 elements in the initial time, the size of the array changes to 500 elements before insertion of the sixth element. Obviously, this approach reduces the number of changes to the number of arrays, but if only a small amount of elements need to be inserted, do hundreds of useless elements, waste space. Solving this problem has a reliable, correct compromise method, is that when the space space is exhausted, the size of the array is doubled as the original. So, for an array of 5 elements initially assigned, adding the sixth element will cause the array size to become 10. This is the method used by arraylist. This is the best solution that made everything for you. The asymptotic run time of ArrayList is the same as the standard array. Even if ArrayList works in high overhead, in particular, the relationship between the number of elements and the time spent on each operation is also the same as the standard array. summary: This article begins with the discussion of the data structure by explaining the research data structure, and provides some ways to analyze the performance of the data structure. This article also tells you that when you are programming, when you encounter which data structure you should use, the runtime analysis of the operation of different data structures is an effective way. After studying how to analyze the data structure, we have studied the most common two data structures in the .NET Framework underlying library - System.Array classes and System.Collections.ArrayList classes. The array stores the same type of data in memory in the memory. Their main advantage is to quickly read and write array elements. Their disadvantage is that each array element is accessed (in the unsorganized array) when searching the array. ArrayList provides a more flexible data structure similar to array. ArrayList can store different types of data by using an object array compared to a normal array. Moreover, ArrayList does not need to explicitly perform memory allocation when adding more elements, which will automatically increase with the increase of elements. In this series of articles, we will explore the Stack and Queue class. We will also discuss Associative Arrays that indexes through the string keywords and not integers. Associative Arrays is implemented in the HashTable class in the .NET Framework underlying library.