Separation of storage and calculation

zhaozj2021-02-16  110

Separation of storage and calculation

Ma Yili Fu Xianglin Han Xiaoming Xu Lu

1 New features of the storage application has experienced three waves: the first wave is based on calculation technology, with the development of processors as the core power, producing a computer industry, promoting the rapid spread and application of computers; second The tide is centered on transmission technology, promoting the development and popularization of computer networks. These two waves have greatly accelerated information digital process, making increasing human information activities to digital form, resulting in digital information explosive growth, which leads to the third development wave of IT technology - storage technology. In the new technology wave, the application of data storage presents the following new features: 1) Data becomes the most valuable wealth. Data loss For companies, the loss will not be estimated, even devastating, which requires the data storage system to have excellent system reliability. 2) The total amount of data is expected to grow. People continue to have digital information in information activities, and various new applications have emerged, such as streaming multimedia, digital TV, IDC (Internet Data Center), ASP (Enterprise Resource Planning), digital image, Transaction processing, e-commerce, data warehouse and mining, so that the total amount of data is increased in geometric grade. The advancement of computer network technology, especially the promotion of the Internet and web applications, not only greatly enhances human information capacity, but also makes information more global. In recent years, human production has exceeded the sum of all information accumulated in the network era, and the speed of information is still continuously rising. According to data released by UC Berkley 2001, the data generated in the next three years will exceed the sum of data generated in the past 40,000 years, and 93% of newly generated information exists in digital form. The rapid development of information technology has driven a huge demand for information storage. This requires a modern storage system to have a high degree of scalability, and this extension should not interrupt the ongoing business, and dynamically scalable. 3) I / O becomes a new performance bottleneck. Early computer is only used for calculation, the CPU computing power is a bottleneck of computer technology development; later in network applications, computer communication has become the most time, network bandwidth becomes new technology bottleneck; currently, the main application model of computers has been transformed Become data storage and access. Due to the limit of mechanical components, the disk data access time can only increase 7 to 10% a year, and the data transfer rate can only be developed at a rate of 20% per year, while modern microprocessors and memory systems are growing annually. 50 ~ 100% speed development, performance gap between processors and disks has become more obvious. According to the AMDAHL theorem, the performance of computer system performance is limited to the slowest components in the system. Therefore, the data storage system has become a new performance bottleneck of computer systems, so-called I / O bottlenecks. Traditional storage structures are difficult to solve this problem, using new storage structures, which greatly improves the demand for storage system performance. 4) All-weather service has become a general trend. In e-commerce and most network services applications, 365 × 24 hours of 24 hours of service have been a general trend, which requires a modern data storage system to have excellent high availability. 5) Storage management and maintenance require centralized, automation, and intelligence. Most of the previous storage management and maintenance is performed by manual. Since storage systems are more complex, the quality requirements of management and maintenance personnel are also increasing, and the possibility of data loss is greatly increased due to poor management. This requires modern storage systems with easy management, preferably with intelligent automatic management and maintenance. 6) Requires interoperability and data sharing of multi-platform. There are a variety of information platforms due to historical reasons. This requires the storage system to implement multiplay interoperability and data sharing, thereby highly system openness. 7) The proportion of storage systems is constantly rising in the value of medium and high-end computer system.

As shown in Figure 1, the proportion of storage in the IT budget has increased year by year, which has been more than 75%. Source: Dataquest from CIO Estimates Through 2002 Changes in storage system requirements can be slightly spoiled from the development of the storage market in recent years. In 2001, although global IT industry depression was filled the entire IT field, the entire information storage market in the world was still 14.3% over the previous year, reaching 36.47 billion US dollars. Although there was a slowed in 2002, it still maintained a growth rate of nearly 12%. According to IDC's forecast, by 2005, the global storage market can reach 58.0343 billion US dollars. From 2000 to 2005, the capacity of the storage system has reached 80%, and the average annual complex growth rate of sales reached 12%. It is for the continuous growth of the functional requirements and capacity requirements of the storage system, making the traditional storage method can no longer meet the user (AMDAHL's Theorem: In the computer programming parallel handler, a small number of required order execution is a feature affecting performance Even if the new processor is added, the running speed cannot be improved.) The traditional storage system is a DAS (Distributed Annotation System) storage method. In the DAS mode, the storage system is attached to the server. Due to the limitations of the server bus technology, the DAS mode system is scalable. When the number of client connections increases, the server will become the performance bottleneck of the entire system, because: (1) The host's bandwidth limit: Although the development of computer technology makes the host bus bandwidth increase, it is still not able to catch up with modern storage applications. (2) Host memory capacity limit: Since the host's memory capacity is limited, when there is a continuous large amount of data access request, the host's memory capacity will soon reach saturation, and the remaining data transfer request cannot be processed. (3) The management overhead of the file system will also increase the data access time. The research of CMU (Carnegiemelon University) shows that when a large number of client requests, the bottleneck effect of the server even causing the resource utilization rate of the storage system to only 3%. In addition, since the data is stored on a plurality of independent servers to form a so-called "information island". The existence of "information islands" is not conducive to the integration of information, hindering user comprehensive utilization information to make correct decisions. Moreover, it requires the administrator to manage the system of different platforms physically dispersed, increasing the difficulty of information management, thereby reducing management efficiency, greatly increasing the total cost of ownership of the storage system. In short, the server-centric DAS access mode is increasingly difficult to meet the modern storage applications of large capacity, high reliability, high availability, high performance, dynamic scalable, easy maintenance, and openness. The key to solving this problem is to translate access mode from server-centered to data and network as centered, which drives the separation of storage and calculation. 2 Differential storage and calculation of storage and calculation are obvious, in addition to this, there are many different ways in the user's view. In computer applications, storage resources and computing resources are the necessary resources for applications. In the early days of computer applications, due to hardware costs, storage resources and computing resources are extremely rare resources, so users do not have a lot of selection in terms of resources. With the development of computer technology, especially the development of VLSI technology and magnetic recording technology, storage resources and computing resources have been greatly increased. This makes it possible to make a great change in the fields and patterns of computer applications. The computer's application is mainly based on early scientific research, which has been developed into various industries and become an indispensable tool in daily work. The computer's application is evolved from early computing power to the core of data processing as the core. Therefore, for the vast majority of users, the importance of the computer is more embodied in the storage, processing, etc. of user data.

In this environment, data is much higher than the value of computer system equipment. This is especially true for finance, telecommunications, business, insurance and military and other departments. Therefore, the reliability, availability of information storage systems is often the problem that enterprise users should first consider. In addition, in order to prevent damage to the data of data for irresistible factors such as irresistible factors, critical data should also consider issues such as off-site backup and disaster. Therefore, the selection of storage resources must consider factors such as reliability, availability, scalability, capacity, and performance. Once the storage resource is selected, it is more serious to replace the various overheads caused by other storage resources. It is more serious for massive storage systems. Therefore, the selection of storage resources is more emphasized for long-term and fixedness relative to computing resources. The choice of computing resources is not. First of all, users pay attention to more is whether the computational resource meets the amount of user computation, that is, the calculation capacity. Therefore, the development of computer technology not only makes computing resources more rich, but also makes users with more choices, more flexible. Also, the selection of computing resources can be temporary, such as a computing resource utilized for a certain system can be different from the computing resources used in the execution of the last system. During the alternative process of computing resources, there is no migration problem with data. Even if the calculation resource is switched during the same system execution, the cost of data migration is also much lower than the switching resource. Because data is saved in various registers in various registers, the data is saved in various registers, and the speed is higher than the mechanical operation. Since storage resources and computing resources have a relatively large difference, and there is no necessarily fixed relationship between the two, storage resources and computing resources can be separated. Below will be described in more detail below. 3 The possibility of system separation storage resources and calculation resources can be divided into both physical separation and logic independent levels. In the early days of the development of personal computers and servers, due to soft, hardware technology restrictions, computational resources and external storage resources are physically correlated and logically closely correlated, and cannot be separated. The main reason is: ■ Hardware communication technologies are immature, and the ability to communicate is relatively weak. The local host has no access to the remote storage to reach the effect of local storage access, so that the physical distance between the calculation and storage is limited to the range of local bus. ■ The design of system software (including motherboard firmware) is calculated as centered, remote communication modules (such as network components) are not primary function modules in the system, or even need to have. This allows local external storage devices that have always been the main or even unique vectors of the system computing environment, local computing resources and remote storage resources are difficult to establish logical associations. With the rapid development of information technology, this situation has changed: ■ In hardware technology, with the development of network communication technology, the distance high bandwidth access technology is mature, and the network bandwidth is gradually close to the local bus bandwidth. In this case, the high-speed network connection technology (Gigabit Ethernet, fiber, etc.) can be used to achieve access to the storage system within a few hundred meters to a few kilometers, and the same effect can be reached locally access. Moreover, due to the enhancement of processor capabilities, efficient and flexible communication protocols can be achieved, providing hardware technology guarantees for system separation, making computational resources and storage resources in physical separation possible. ■ In software technology, with the transfer of the operating system to the networked distributed system, the network protocol function has become the function necessary to have modern operating system software. In the design of the computer motherboard, support for network startup is also added. Thus, during the system started, the carrier of the computing environment can be selected at a certain stage, and use the network storage resource as the carrier to establish a computing environment. Compared with the local IO bus protocol, the network protocol is hierarchical, and the high-level network connection relationship is independent of the physical link. So, using the network established connection has natural dynamic connectivity. The relationship between the computing system and the storage system on both ends of the network can be dynamically changed without changing the physical connection.

Therefore, the current software environment makes the computational resource and storage resources to implement logic independence and dynamic combinations. In summary, the separation of resources and storage resources today is now available. For existing technologies, you can separately separate reconfigurable systems on a resource. 4 The necessity of storage and calculation separation traditionally, the storage system has always been tightly dependent on the computing system. Due to the evolution of technical and historical, storage systems are in the phenomenon of computing systems not only laid the real foundation of computer applications, but also curing into computer design thinking. With the popularity of computer applications, the storage system is obvious from the traditional direct link (DAS) trend days, and the network is stored as a mainstream of storage. This is mainly due to the following reasons: ■ Computer users have greatly improved the high performance, scalability, and high availability of data services, and have become increasingly important in terms of security, redundancy, backup and recovery. In traditional mode, the data storage resource is associated with a fixed computing device and constrained during the run. Storage device changes require computing devices to change. The storage device and DAS mode of operation that can be connected to separate computational resources have no problem that the user's online scalable needs is unable to meet the needs of users. After calculating resources and storage resources, multiple computing devices and multiple storage devices can form computing resource groups and storage resource groups through the network, which can independently provide calculation and storage services, but also dynamically combined, establish computing systems. surroundings. This way, it can fundamentally solve the above problem. ■ Since the amount of information increases, the storage management fee is also greatly increased while the storage capacity is increased. According to Gartner Group, for end users, the overhead of storage capacity and storage management is approximately 1: 3. The storage management efficiency is low to cause administrative costs in management costs. Using the DAS mode storage system is dispersed in its independent computing system, it is difficult to centrally manage, inevitably increase storage management human cost. In a network storage environment, the management efficiency is greatly improved due to the storage device to leave the physical location of the host. As shown in Figure 2, the storage capacity managed by each person can be managed greatly improved relative to the conventional storage environment. Source: International Data Corporation Figure 2 The amount of data managed per person is different due to storage management efficiency, traditional storage and network storage also has different purchase and total cost of ownership (TCO-Total Cost Ownership) features. Figure 3 shows a research result of Gartner Group. Although traditional storage has a lower purchase cost, its operating cost is much higher. This achievement shows that during the 3 years, the total cost of ownership of different forms of storage systems is very different, and the traditional way TCO is more than double. Source: Gartner Group 2000 Figure 3 The cost of storage devices ■ User's pursuit of information sharing is also an important factor in promoting network storage development. In the Internet age, data sharing becomes the basic needs of people. In the DAS mode of work, the operation of the storage of data, data replication, and data migration, etc. must be done by the server. On the one hand, this work mode has seriously affected the work efficiency of the server; on the other hand, data sharing and interoperability between different types of hosts are not easy to implement for complex heterogeneous computing environments. The network storage method is resolved by these issues. These reasons have promoted the rapid development of the network in recent years, and this trend is fully displayed in the market. 5 Separation of storage and calculation Separation is only a representation, and we are more concerned about storage content, that is, data. The data can be substantially divided into two parts: system data and user data. System data refers to computing resources to run the software environment, including operating systems, applications, and more. The system data is created, which is generally more stable, and the variation is not large, and it is relatively easy to reconstruct.

Although system data is generated by user custom systems, such as registry information, we temporarily return them to system data. User data is the most valuable part of the user's truly care for the user's actual production because of the actual production of a certain purpose. Based on this premise, we can divide the computer system into three parts shown in Figure 4: Figure 4 Computer system composition therein does not have any attributes in which the computational resource itself does not have any attributes, and must be composed of a complete, specific computing environment, User data is processed. Traditional computer system mode is to closely bind the three. System data and user data are stored in the local disk of the computing resource. Calculation resources can only be processed in a fixed manner because only a limited computing environment constructed by local system data. However, with the growing data of the total amount of data, the data type is increasing, and people have long been aware of the need for data sets. After the storage system manufacturer's efforts, users have gradually separated from traditional computer systems, centrally stored, unified backups, and disaster recovery, forming a computer application model of so-called centralized storage, distributed computing computing. This mode is a big step forward in data security or usability. However, the computational environment and system data constituting the computing environment are still bound together, and the computing environment determined by both is fixed. In a sense, this computer application mode still does not get rid of the traditional architecture, and the fixed computing environment is still tightly bonded to the performance and flexibility of computer system. If system data can be separated from a conventional computer system, the separation of system data and computing resources can be achieved, and the two are connected to the high-speed network, and then the storage and calculation separation is completely realized. The computer system thus constructed can greatly exert the flexibility and availability of the computer system and can use some new computer application modes. The model refactoring after system separation is shown in Figure 5: Figure 5 Calculating resource and storage resource separated system model calculation resources and storage resources are connected via network, and system data, user data, and public data are stored in storage resources, pass through The management is managing the mapping relationship between system data and computing resources. Each computing resource can bind a certain system data to form a computing system having the corresponding function, and can constitute M different computing systems by binding M different system data, respectively. 6 Separation of storage mode storage and calculation separation has a profound impact on traditional storage patterns. In the traditional DAS storage mode, the storage subsystem and the calculator subsystem are a tight coupling, both connected by bus technology. The storage subsystem is attached to the calculating subsystem. In many cases, it is stored in the case where the calculated rays are stored, even as a peripheral device of the subsystem. After storage is separated from the calculation, the storage subsystem is separated from the original computing system and form a separate system. Both are interconnected by network technology, and their relationship has also become a peer-to-peer relationship. Storage starts from the background to the front desk and is received more and more attention. The intervention of the network makes the main relationship between the storing and calculation of the main slave relationship evolved into loosely coupled peer relationships. The two are highly autonomous, and the development is independently. The storage mode develops rapidly in network storage mode centered on data and network. At present, two mainstream technology in the network storage arena is NAS (NETWORK ATTACHED Storage, network attachment) and SAN (Storage Area Network, storage area network). NAS is a storage structure centered on data, as shown in Figure 6. According to the definition of the Storage Network Industry Association (SNIA): NAS is a storage device that can directly link to the user directly to the user. The NAS actually implemented the user data from the computing environment (computing resources and system data constitutes a computing environment).

Its structure and the protocols adopted such that NAS has the following advantages: (1) File sharing under heterogeneous platforms: Multiple clients under different operating system platforms can easily share the same file in NAS. (2) Protect existing investments using existing LAN network structures. (3) It is easy to install, use and manage is very convenient, and install plug and play. (4) Have extensive connectivity: due to IP / Ethernet and standard NFS (Network File System, Network File System), and CIFS (Common Internet File System), it can be adapted to complex network environments. (5) Lower total cost. Figure 6 NAS Structure Schematic SAN is a network-centric storage structure as shown in FIG. According to the SNIA definition, SAN is a network that connects data directly between servers and storage systems using high-speed interconnected protocols such as FC (FIBRE CHANEL). SAN is a dedicated storage network that is different from the original LAN network built with unique techniques (such as FC). The unique architecture and construction techniques make SANs have the following advantages: (1) High performance, high-speed access: Current fiber channels provide 2Gbps bandwidth, new 10Gbps standards have also been developed. Other transaction-oriented applications such as databases, the advantages of SAN are more obvious. (2) High scalability: the server and storage device are separated, and the extension of the two can be performed independently. (3) Centralized storage and management: A unified storage pool is formed by integrating various storage devices, providing services to users, storage capacity can be easily expanded. (4) Support a large number of devices, theoretically with 15 million addresses. (5) LAN-Free Backup can be implemented, and the data backup does not occupy the LAN bandwidth. In addition, SAN also separates the storage subsystem to the calculating subsystem - separating the computational resource with both system data and user data. Figure 7 San Structure Schematic views from the market performance, the traditional DAS storage mode has begun to decline, and the new network storage system represented by NAS / SAN has begun to grow rapidly. According to the data released by IDC in December 2001, from 2001 to 2005, in the global market, DAS declined. Its composite annual growth rate is -26.5%. The network storage has a significant increase. Its composite annual growth rate is 26.3%. According to the market findings of Gartner Dataquest, the information storage market will continue to grow rapidly. Among them, the fastest growing is the strongest demand is the network information storage system. Figure 8 fully illustrates this. Source: "Worldwide External RAID Controller-Based Storage Forecast, 2000-2006", Gartner Figure 8 Storage Equipment Market Trends Figure 9 shows the forecast for the next few years of China Storage Market. In the figure, the growth of SAN and NAS is very obvious; and the DAS portion gradually enters the decay state, which will be much lower than the market size of SAN or NAS in the future. Source: IDC ASIA / PACICIC 2001 Figure 9 China Storage Market Change Prediction Practical Application, regardless of NAS or SANs, there is a respective limitations. Take NAS as an example: (1) Low file access speed, not suitable for applications with high access speed requirements, such as database applications, online transaction processing. (2) Extensibility, its capacity and performance extension is limited by NAS Head, and NAS Head is easily enabled by performance bottlenecks or forms a single point of failure. (3) Data backups need to occupy the bandwidth of the LAN, waste valuable network resources, and even affect the smooth progress of the customer application.

For FC-SAN, its limitations are mainly manifested in the following aspects: (1) The interoperability of the equipment is poor: for different manufacturers, the specific implementation of the Fiber Channel protocol is different, which is objectively caused different manufacturers. It is difficult to operate between products, and users use a lot of limitations in the use and selection of products. (2) Fibre Channel has caused many system management difficulties as a new type of network technology. If the re-understanding of network technology, the lack of management tools; therefore, the construction and maintenance SAN needs to have extensive experience, and accept special training Professionals, this greatly increases the construction and maintenance costs. (3) In sharing, the sharing of resources in the SAN is generally referring to the sharing of storage space under different platforms, which is difficult to achieve data file sharing in the heterogeneous environment. More importantly, the network interconnecting devices of the Fiber Channels used in the current storage area network are very expensive, which hinders the popularity, application and promotion of SAN technology. A hot spot in SAN is the use of other high-speed interconnected techniques to build SAN, and IP-SANs are thereby emerged. The respective advantages of NAS and SANs can make up for the other party's defects. In terms of technologies, NAS and SAN are not competing with each other, but complement each other. Only the two can be fully utilized. Currently, NAS / SAN has a general trend, and the concept of network storage only covers NAS and SANs, and all the essence of storage and network concentration. From the industry, similar products are coming. Such as EMC's HIGHROAD, it has NAS's file sharing, data lock mechanism, and simple and other characteristics, while inheriting SAN's high performance and high scalability. HIGHROAD implements high performance of data access by adopting a mechanism separated by control channels and data channels. The user first makes requests to the Celera File Server, and the data is directly returned by SAN. HIGHROAD is a multi-channel, multi-protocol file system. It can access data in the storage device through channels or networks, which can support CIFS and NFS protocols, which can support multiple data access methods such as DAS, NAS, SAN, and can be dynamically switched in these access methods. Similar products include IBM Tivoli SANERGY, IBM Network Attach Storage 300g, Veritas's SAN Point, FalconStor's IPStor, Truesan's Fusionos?, And Auspex's NetServer3000, etc. They are pursuit in varying degrees all the fusion of NAS and SANs. In addition, as a national 863 project, the National High Performance Computer Engineering Technology Research Center has developed large blue whale network storage systems. Through self-developed virtual storage and high parallel file systems, the system fully combines NAS and SAN's storage technology to fully realize the system's on-demand expansion. The existing NAS / SAN storage mode implements the physical separation of storage resources and the physical separation of the computing resource rather than logic. In NAS / SAN storage mode, storage resources can be exist from the customer's application system (computing resource) at the physical location. There is a certain "mapping" relationship between the customer's application system and these storage resources. When the customer is using resources, it is based on these "mapping" to access storage resources. Typically, this "mapping" has been determined and "cured" in the system configuration. After the system configuration is complete, the storage resources bound by the computing resource multiple times are the same. The logic separation takes a dynamic mapping method. For configuration of the same system, its mapping relationship is not fixed, but the startup information such as the user's identity when the system is started to determine the storage resources that should be bound. This will bring reconstructed flexibility and multi-selectivity. After the storage resource is separated, the future storage mode should also have physical separation and logical separation of storage resources, and separate separation of user data and system data.

Separation of system data allows system services to be further separated, while the implementation of logic separation makes the system service and storage resources flexible and diverse. The user can determine the type of service that should be bound and store resources according to their own needs, thereby implementing system services, scaling of data, and flexibility in the reconstruction. 7 Separated computing mode The system data of traditional computers is blended with local computing resources. This computing environment is fixed, and the user cannot construct it according to its own application, and can only be selected in the existing configuration. Calculation and storage implementation is physically, after separation of two levels of logically, the relationship between computing nodes and storage nodes is changed from a static tight coupling relationship to a dynamic loose relationship. At this time, the working attributes of each computing node are no longer subject to local storage, and the user is dynamically determined and changed during operation. Users can dynamically establish a computing environment of a specific application during system operation according to their own needs. The specific process is as follows: ■ Select the appropriate computing node group as the calculation component according to the needs of the application. ■ Select the appropriate storage node group as the storage component. ■ Put the calculation environment and storage connections, dynamically constitute an application computing environment. In this case, the computing resource in the system constitutes a scalable virtual calculation pool, providing data processing services in accordance with the user's needs. The object (data) and the calculation method (program) are determined by the user-specified storage resource. As shown in Figure 10, the user, the computing resource, and the storage resource are connected by a high-speed interconnect network, which constitutes a three-dimensional reconstructing computing environment. In this computing environment, the calculation process will appear as data-driven on-demand computing mode. The so-called on-demand is calculated in the computing environment, computing resources and storage resources are completely virtualized. Users don't have to have computing entities and storage entities, and only need to set up computational environments based on their own applications, complete the calculation and storage of data. After the app is completed, the computing environment is revoked, and the computing resources are released, and the other applications are used. The so-called data drive is in the calculation process, the operation of the calculation resource is controlled to the data in the storage resource. To a certain extent, change the data and programs in the storage resource, and use the same calculation resources to constitute a new application. Therefore, this calculation mode actually converts calculation from relatively fixed operations into dynamic scalable capabilities. Because of the limitations of physical resource positions and operating environments, computing resources can serve users in the largest range, just like electrical energy in modern society, everywhere. Figure 10 Schematic Structure of Schematics Figure 11 Storage and Calculation Separation Classification, we can classify storage and computational separations: data access, data content and resource binding methods, as shown in Figure 11 Indicated. The data access method can be block-level access (such as a SAN environment), file access (such as NAS environment); data content can be system data (more stable, shared), user data (often change, weaker sharing is weak); resource tie The fixed mode can be physical separation (static binding), logical separation (dynamic binding). Correspondingly, Fig. 11 also shows a type belonging in a commonly separated manner. If NAS is accessed by file, the physical separation of user data (as shown); SAN is the physical separation of user data by block level, as shown in the figure); the usual diskless workstation is passed through file access Physical separation of system data (not marked in the figure); SOND is a new product developed by the National High Performance Computer Engineering Technology Research Center, which implements the logic separation of system data.

转载请注明原文地址:https://www.9cbs.com/read-11356.html

New Post(0)