Chen Xun Software Studio R & D Manager
This article is the third article "The scalability of the cluster and its distributed architecture". It mainly introduces the hierarchical model of the cluster's hardware and software structure, the main classification methods and the four major elements that determine the cluster design: HA, SSI, job management and communication. The author aims to construct a cluster abstract model through several different entry points, enabling readers to analyze and design clusters in reality.
Several questions about the primary consideration of cluster design
The foregoing is a classification method for several types of mainstreams in clusters. But when you consider what system makes your cluster, let it have any function, it is not so important when you can meet what needs. Before using or constructing a cluster, first analyze the following major problems, these issues are not independent, but the integrated factors that are mutually influenced.
Availability support
That is, we usually say HA (High Availability). The cluster provides a high availability of cost effectiveness through redundant processors, memories, disks, I / O devices, networks, and operating system images. In order to dig the potential of these excess resources, it is necessary to use some techniques to smoothing availability.
From a key transaction / task-calculated perspective, such as a commercial server or an important data server, a cluster is a set of independent server configurations that can be managed as a single system, which can share namespaces and design to tolerate tandem. User transparently accessible computing resources, its focus is not performance but high availability.
There are different practices for HA in different applications. In OLTP (Online Transaction), you usually use online hot backups to solve fault tolerance problems. This is a reflection of redundancy, conceptually, very simple: When the hardware and hardware in the main system is invalid, the important application and the task being processed can be transferred to the "slave" server, which avoids the failure and guarantee the server. Overality availability. The process is also called Failover, although it will temporarily reduce the performance of the server, but fully guarantees the normal task. This redundancy technology is different from the component-level redundancy technology that completely covers the error component. It uses a clustered system-level redundancy, and component-level redundancy is often used in hardware for hardware in order to ensure continuous service. .
The difficulty of system-level fault tolerance is migrated from the process and transaction. To ensure that online transactions can be fault tolerant, and at this level of the online transaction, there is a need to process or transaction migration. In the database layer, the OS layer, or the middleware layer, there are different vendor targeted products. Only I know the Oracle database products, IBM operating systems and middleware products or other third-party components have different degrees of implementation of HA.
When designing robust, high availability, the following three items should be considered at the same time: reliability, availability, and maintenanceability (referred to as RAS). Among them, the availability standard is most interesting, which combines two concepts of reliability and maintainability:
Reliability: Measuring how long is a system that is unable to work. Availability: A system can be a percentage of time used by the user, that is, the percentage of normal running time. Maintainability: It means that the system is easy to maintain, including hardware and software maintenance, maintenance and upgrade.
The reliability of the system can be represented as an average time MTTF (Mean Time to Failure) that has failed, ie the average time of operating before the system (or system) is faulty. The maintenanceability indicator is until the average time MTTR (Mean Time to Repair), i.e., the average time used to restore it to the working state after repair. The availability of the system is defined as:
Availability = MTTF / (MTTF MTTR)
There are basically two ways to increase the availability of the system: increasing MTTF or reduces MTTR. Add MTTF to increase the reliability of the system. The manufacturing reliability system of the computer industry is now manufactured in the MTTF range of the workstation from hundreds hours to a few thousand hours. However, it is very difficult to further increase the MTTF and cost it. The cluster can be reducing the system's MTTR to get the availability, the MTTF of the multi-junction cluster is lower than one workstation, so the cluster reliability is low, which is larger than the possibility of faults in the workstation. However, if you can quickly deal with these faults, higher availability can be improved. High availability is a technique that can make the cluster failure is fast, smooth to switch, and ensure the continuous operation of the system.
Single system image
SSI is the necessary ability to cluster system. Not to refer to a SMP or a workstation, only a unique system image resides in memory, but refers to the feeling of the user, is a separate single computer system. So, this also involves anyone who is, what is the purpose and needs of use, from which one level is used to use the system. The basic characteristics of SSI are roughly:
Single system: Ideally, the entire cluster looks a multiprocessor, as if a huge SMP workstation. This is different from the distributed system. In a distributed environment, the use of nodes belongs to the "part-time" state described above. Participating in parallel tasks is mainly conducted by the center scheduler, and there is a "collaboration" between each other, not "unity"; It is normal execution of the local job. Therefore, in the "single" ability, the appearance of the "single" ability is relatively weak, and there is no ability to form a true SSI ability. Single control portfolio: logically, the management control of the cluster system should be done in a clear position. All management operation behaviors enter the control entry (such as a dedicated monitoring terminal), and the hardware and software resources within the cluster range are requested by the task queue. Single control portals are often considered important when designing clusters. In fact, the cluster system that lacks a single control ability will only make the administrator tired, busy with the node's maintenance and job scheduling manual operation, making the entire cluster into a "semi-automatic" thing. Symmetry Ability: Similar to a point, if the cluster allows users to log in from them, the service capabilities that users are required to obtain must also be the same, and there is no "desene". Therefore, in addition to functional issues related to permissions, management, security, all functions and services are all right. Resource Access Transparency: It should be said that this is the essence of SSI capabilities. When using the cluster, the user does not know the server specific location for its service, all the operations feel in the locally. Although there is a certain degree of performance reduction behind resources, from the perspective of convenient users, there is no need to deal with the concept of complicated realm, volume, domain, etc., only one root, a process space, an IP address, An IO resource, the necessary performance loss is also cost-effective.
We are not difficult to find that the above two points about HA and SSI are simply implicitly hidden, and these two seemingly unsuitable features are associated. If there is no SSI ability, the cluster is not called a cluster. Even the simplest online hot backup system, whether it is in the normal state or a fault takeover, the system "shape" (ie, from the external user), it is a single system (although there are two machines), It can provide a transparent smooth service. The resource unity of SSI is not implemented in OS or higher, is unable to do this high availability. It can be said that SSI is the cornerstone of cluster technology, not only for highly available needs, but also improves further performance, so when designing clusters, the first thing to do is SSI consideration. Work management
Job management mainly involves task distribution, load balancing, and parallel processing. Compared to the traditional workstation or PC nodes, the cluster must meet the high utilization of the system, and the job management software must provide these functional functions. Then, in the implementation of the job management, the following concepts are very important: what is resources, what is homework, there are several homkers, how to measure load (LOAD), what states are included, each state is included Those elements, etc., etc., all need to be defined and reflected in the cluster system. The job management system is well-decorated, directly related to the high low of the cluster performance. Design excellent job management and scheduling system, its scalability is better than design a general cluster, which affects performance is much higher than other factors. We will analyze in subsequent chapters.
Efficient communication
Building an efficient communication subsystem for clusters, especially loosely coupled workstation is more challenging than tight coupling systems such as MPP.
Because the cluster has higher junction complexity, cluster nodes cannot be packed like MPP nodes. The loosely coupled cluster application is relatively common. The length of the physical line between the cluster is longer than the line length between the MPP node. This is the case even if it is a centralized cluster. Long line leads to long interconnect network delays. But more importantly, the long line has more reliability, clock torsion and cross-talking, etc. These issues require a reliable and secure communication protocol to solve, and the agreement will increase system overhead. Clusters generally use commercial networks with standard communication protocols such as TCP / IP, such as Ethernet, ATMs. Commercial parts generally follow Moore law, but the system overhead of TCP / IP protocols. Although the low-level communication protocol is valid than the standard communication protocol, there is no unified standard for low-level communication protocols.
Pursuit of efficientness and cluster scalability is often touched. To use a highly scalable cluster system, you should use some low-efficiency commercial networks, more versatile hardware platforms, popular operating systems. While ensuring the expandable ability of the cluster, it is inevitable that the optimization performance can be reduced, and some issues can be solved using the Open Source operating system.
Ideal cluster model
Like the OSI standard interconnection model, the ideal cluster is only existed in the concept, because there are too many constraints, it is not too possible to achieve it. However, this ideal structure as a theoretical basis for studying the cluster, which helps to learn from the analysis of existing clusters and the design.
Figure: An ideal cluster system that supports complete SSI and HA capabilities
From the figure we can see that the ideal cluster supports various nodes, available for workstations, PCs, SMP servers, and even supercomputers, and nodes operating systems are multi-user, multitasking, and multi-threaded systems. Nodes can be assigned and even heterogeneous.
Interconnection from one or more high-speed commercial networks between nodes. These networks use standard communication protocols, and the transmission speed should be more than two orders of TCP / IP high in Ethernet. The commercial network not only communicates the cluster node, but also completed the necessary communication. Also, it is also based on the unified access to SAN (storage area networks), consistent distributed I / O, consistent memory access, and the unified access to other cluster hardware resources. In fact, the network is just physical implementation, but the control of resources is also required to be carried out by means of operating system. The network interface circuit of each node is connected to the standard I / O bus (such as PCI) of the node, and all drive modules are hot plug, which can be dynamically loaded. When the processor or operating system changes, simply modify the drive software and reload, it is not necessary to modify the network or network interface, and the system is not required.
There is a set of software subsystems with work platforms on the node work platform, making a cluster operating system, providing the most basic core functions of the operating system. The operating system is a special extension or an intermediate layer, which is used to provide the necessary support for HA and SSI.
Above the middleware layer is a usability subsystem that provides high availability services. There is also a single-system image layer to provide a single user entry point, a single file hierarchy, a single control point, and an efficient job management system. Single memory can be helped through a compiler or run time library technology, but the cluster does not necessarily need to support a single process space.
The uppermost layer is the management, control, and application extension implementation layer, the user's entrance, the control of the administrator, and the scheduling of the job is implemented in this layer. Specifically, the standard API or dynamic running library provided by SSI is implemented. In addition, some other extension subsystems are also implemented, such as distributed OLTP (online transaction) databases.
Conclude
After understanding the distributed architecture, concept, scalability, and classification methods and several major elements, I believe that everyone's basic idea of the cluster has a preliminary understanding. The cluster-related technology is a very complex system, any point can be sufficient to discuss several books. But my original intention is not the case. These three articles are only a piece of paving for the subsequent case analysis. The future content will focus on the cluster of several types of mainstreams, and hope that you can learn how to implement and apply in realities in reality through future in-depth analysis.
About the Author:
Lin Fan, is now engaged in the University of Xiamen engaged in Linux related research. The cluster technology is greatly interested in communicating with like-minded friends. You can contact him via email iamafan@21cn.com.