Linux Cluster Daquan (1)
Rawn Shah
Linuxworld columnist
May 2000
Rawn Shah as an expert, in terms of Linux's existing open source and closed source cluster solutions to give you guidance.
Calculating the number of Linux group projects is like calculating the number of entrepreneurial companies in Silicon Valley. Unlike Windows NT has been obstructed by its own enclosure, Linux has a large number of cluster systems to choose from, suitable for different uses and needs. However, it is not necessary to use which cluster should be used.
Part of the reason for the problem is that the term cluster is used in different occasions. The IT manager may care about how to make the server run longer, or make the application run faster, and mathematicians may be more concerned about large-scale numerical values on the server. Both require clusters, but each need for different characteristics.
This article investigates different forms of clusters and many of many implementations, which can be bought or free of the free software. Although all the solutions listed are not all open source, most software follows public practices for distributing Linux source, especially those who implement clusters often want to adjust system performance to meet the needs.
hardware
Clusters always involve hardware connections between machines. In most cases, this is only referred to as "fast Ethernet" network card and hub. However, in the field of cutting-edge science, there are many network interface cards designed for cluster.
They include Myricom's Myrinet, GigaNet CLAN, and IEEE 1596 Standard Expandable Interface (SCI). The functions of those cards not only provide high bandwidth between clusters, but also reduce delay (time to send messages). Those delays are critical to exchange status information between nodes to keep their synchronization.
MYRICOM
MyRicom offers NIC and switches whose one-way interconnect speed can reach 1.28 Gbps. NIC has two forms, copper wire and fiber type. Copper lines can communicate at full speed in 10 feet distances, while operating in half speed within 60 feet. Fiber MYRINET can run at a single-mode optical fiber of 6.25 miles or 340-foot-long multimode fibers. Myrinet only provides direct point-to-point, hub or switch-based network configuration, but there is no limit on the number of exchange fibers that can be connected together. Adding a swap fiber will only increase the delay between nodes. The average delay between the two direct connections is 5 to 18 microseconds, which is much faster than Ethernet.
Cluster type
The most common types of three clusters include high performance scientific clusters, load balancing clusters, and high availability clusters.
Scientific cluster
Typically, the first species involves developing parallel programming applications for clusters to address complex scientific issues. This is the basis of parallel calculations, although it does not use special parallel supercomputers, this supercomputer is composed of ten to 10,000 independent processors. However, it uses a commercial system, such as a single processor or dual processor PC linking through a high-speed connection, and communicating on a public message delivery layer to run a parallel application. Therefore, you will often have a cheap Linux supercomputer to come out. But it is actually a computer cluster, its processing capability is equal to the true supercomputer, usually a set of cluster configuration overhead to exceed $ 100,000. This seems too expensive for the average person, but it is still cheap compared with the special supercomputer of millions of dollars.
Load balancing cluster
Load balancing clusters provide more practical systems for business needs. If the name is implicit, the system allows the load to be processed as much as possible in the computer cluster. This load may be an application that requires a balanced application to process a load or network traffic load. Such systems are ideal for running a large number of users running the same set of applications. Each node can handle a part of the load and can dynamically allocate a load between nodes to achieve balance. This is also true for network traffic. Typically, the web server application accepts too much access to the traffic, so that it is impossible to process, which requires traffic to send traffic to network server applications running on other nodes. It is also possible to optimize according to the special environment of each node. High availability cluster
The appearance of a high availability cluster is to make the overall service of the cluster as possible to consider computing hardware and software error. If the primary node in the high availability cluster has failed, then this time will be replaced by the secondary node. The secondary node is usually the mirror of the primary node, so when it replaces the primary node, it can take over its identity and thus make the system environment consistent with the user.
Between these three basic types of the cluster often occur in mixing and mixing. Thus, you can find that high availability clusters can also balance user loads between their nodes, while still trying to maintain high availability. Similarly, a parallel cluster can be found from the cluster to be programmed into the application, which can perform load balancing between nodes. Although the cluster system itself is independent of the software or hardware it is in use, the hardware connection will play a key role when the system is effectively run.
Giganet
GigaNet is the first vendor of a virtual interface (VI) architecture card for Linux platforms, providing CLAN cards and switches. The VI architecture is a software and hardware system independent of the platform, which is developed by Intel for creating a cluster. It uses its own network communication protocol to exchange data between servers, rather than using IP, and it is not intended to be a WAN routing system. Now, VI's future depends on the work on "System I / O Group", this group is the "next-generation I / O" team with IBM and Compaq leadership "Future I / O Group" combined with IBM and Compaq leadership . GigaNet products can provide 1 Gbps one-way communication between nodes, minimum delay is 7 microseconds.
IEEE SCI
The IEEE standard SCI is less delayed (less than 2.5 microseconds), and its unidirectional speed can reach 400 MB / sec (3.2 Gbps). The SCI is a network system based on the routine, unlike Ethernet is a star topology. This will make the communication speed between the larger nodes faster. More useful is a loop topology network, which has many annular structures between nodes. The two-dimensional surface can be represented by N-multiply M's grid, where there is a ring network in each row and each column. The three-dimensional ring surface is similar, and the three-dimensional node mesh can be used, and there is a ring network on each layer. Inter-intensive supercarbly calculated parallel system uses a loop topology network to provide relatively fastest paths for communications between hundreds of nodes.
Most operating systems constraint factors are not operating system or network interface, but the internal PCI bus system of the server. Almost all desktop PCs typically have basic 32-bit, 33-MHz PCI, and most low-end servers only provide 133 MB / sec (1 Gbps), which limits those network card capabilities. Some expensive high-end servers, such as Compaq ProLiant 6500 and IBM Netfinity 7000 Series, with 64-bit, 66-MHz NIC, which can operate in four times. Unfortunately, contradictions are more companies using low-end systems, so most suppliers finally produce and sell more low-end PCI network cards. There is also a special 64-bit, 66-MHz PCI network card, but the price is much more expensive. For example, Intel provides this type of "Fast Ethernet" network card, with a price of about $ 400 to $ 500, almost 5 times the price of the ordinary PCI version. Scientific cluster
Some parallel cluster systems can achieve such high bandwidth and low delays because they usually bypass using a network protocol, such as TCP / IP. Although the Internet Agreement is important for WAN, it contains too much overhead, and these overheads are unnecessary in closed network clusters known to each other. In fact, some of those systems can use direct memory access (DMA) between nodes, similar to graphics cards and other peripherals in a machine. Thus across the cluster, you can directly access a form of distributed shared memory directly from any processor on any node. They can also use low overhead messaging systems to communicate between nodes.
Message Transfer Interface (MPI) is the most common implementation of a message transfer layer between parallel cluster systems. There are several derivatives of MPI, but in all cases, it provides a public API to developer access parallel applications so that developers do not have to handle the distribution of code segments between the cluster nodes. One, the Beowulf system first uses the MPI as a common programming interface.
It is difficult to determine which high-performance cluster package is used. Many of them provide similar services, but the specific requirements of the calculation are the decisive factors. In many cases, research work in those systems is just half of the needs of demand, and those software requires the special help and cooperation of cluster developers.
Beowulf
When talking to the Linux cluster, many people's first reflection is Beowulf. That is the most famous Linux science software cluster system. No bag called beowulf. In fact, it is a term, suitable for a set of public software tools running on the Linux kernel. These include popular software messaging APIs, such as "Message Transfer Interface" (MPI), or "Parallel Virtual Machine" (PVM), modifications to the Linux kernel to allow combination of several Ethernet interfaces, high-performance network drives, virtual Memory Manager changes, and distributed inter-distributed communication (DIPC) services. The public global process identification space allows the use of DIPC mechanisms to access any process from any node. Beowulf also supports a range of hardware connectivity options between nodes.
Beowulf may be the first high-performance cluster system noticed when considering Linux, because it is widely used and supported. About this topic, there are many documents and books. The difference between Beowulf and the following scientific cluster systems can be actually actually, or only in the product name. For example, although the name is different, Alta Technologies AltaCluster is a Beowulf system. Some suppliers such as PARTEC AG, a German company offering a derived version of the Beowulf model to include other management interfaces and communication protocols.
GigaNet Clan
GigaNet provides a custom-hardware-based solution that communicates between nodes of a scientific cluster using non-IP protocols. As mentioned earlier, the "virtual interface" protocol is removed from the overhead of many protocols, such as IP to support faster communication between servers. In addition, the hardware system can operate according to the Gigabit bits, and the delay is short, making it ideal for building a scientific cluster of up to 256 nodes. This vendor supports MPI, so many parallel applications can run on a similar system (such as Beownulf). It also has the shortcomings of Beowulf, that is, it is not possible to use the network load sharing system unless you want to write an application to monitor and distribute network packages delivered between servers.
Reply to: Call! (Netsys2)
2004-12-7 10:35:42
Linux Cluster Daquan (2)
Legion
Legion tries to build a real multi-computer system. This is a cluster, of which each node is a separate system, but in the user's opinion, the entire system is just a computer. Legion is designed to support a world-wide computer, consisting of millions of hosts and trillion software objects. In Legion, users can create their own partnership groups.
Legion provides high performance parallel, load balancing, distributed data management, and fault tolerance.
Legion provides high performance parallel, load balancing, distributed data management, and fault tolerance. It supports high availability through dynamic reconfiguration between its fault-tolerant management and member nodes. It also has an expandable core that can be dynamically replaced or upgraded when new improvements and progress occurs. The system is not only accepting single control, but can be managed by any number of organizations, while each organization supports the overall autonomous part. The Legion API provides high performance calculations through its built-in parallelism.
Legion needs to use specially written software so that it can use its API library. It is located on the user's computer operating system, coordinates local resources and distributed resources. It automates resource scheduling and security, manages context space to describe and access hundreds of millions of possible objects in the entire system. However, when running on each node, you do not need to use the system administrator privilege and you can work with no privacy user account. This will increase the flexibility of nodes and users who join the Legion.
CPLANT
Computational Plant in Sandia National Lab is a large-scale parallel cluster for implementing Teraflop (trillion floating point operation) and is built on business components. The entire system consists of "scalable unit", which can be divided into different purposes (calculations, disks I / O, network I / O, service management). Each node in the cluster is a Linux system with a core-level module that is specially developed. Each partition function can be modified by loading and uninstalling the kernel module.
The project is completed in three stages. The start phase is the prototype, 128 systems based on 433-MHz DEC Alpha 21164, each with 192 MB RAM and 2 GB drives, with MYRinet network cards and 8-port SANs Switch connection. Section 1 expanded to 400 21164 workstations, these workstations have a running speed of 500 MHz, with 192 MB RAM, no memory, with a 16-port SAN switch to connect to a super cube structure, and run Red Hat 5.1 . Current 2nd stage has 592 DEC 21264-based machines, with a running speed of 500 MHz, with 256 MB RAM, no drivers. Each node uses 64-bit, 33-MHz PCI MYRINET NIC, and still uses a 16-port switch to connect to a super cube structure. Applications running on CPLANT include solving the optimization, molecular mechanics of the computational system, fluid mechanics and structural mechanics, a finite element analysis of linear structure mechanics, and a dynamic load balancing library of parallel applications.
Jessica 2
The University of Hong Kong has a Java-based cluster called Java's single-system image calculation architecture (JESSICA), which uses a middleware layer to complete the fantasy of single-system images. This layer is a full-end thread space for all threads running on a node that uses a distributed shared memory (DSM) system. The project uses Threadmark DSM, but eventually uses the jiajia using migrating-home protocol (jump) created by themselves. They use custom Java-based ClusterProbe software to manage 50 nodes of the cluster.
Paris
The French IRISA Institute's "Programming Parallel and Distributed System" (PARIS) projects for the Great Scale Digital Simulation Applications are available to create a tool for creating a Linux server cluster. The project consists of three parts: the resource management software of the cluster, the runtime environment of parallel programming languages, and software tools for distributed digital simulation.
Resource management software includes Globelins distributed systems for sharing memory, disk, and processor resources, and its Duplex and MOME distributed shared memory systems.
Load balancing cluster
Load balancing clusters distribute networks or calculate processing loads between multi-nodes. In this case, the difference is that a single parallel program that is missing a cross-node operation. In most cases, each node in that cluster is an independent system running separate software. However, no matter how direct communication between nodes, or by central load balancing servers to control each node load, there is a public relationship between nodes. Typically, the load is distributed using a specific algorithm.
Network traffic load balancing is a process that checks to a cluster network traffic, and then distributes traffic to individual nodes for proper processing. It is best for large network applications such as web or FTP servers. Load Balancing Network Application Services requires cluster software to check the current load of each node and determine which nodes can accept new jobs. This is ideal for serial and batch jobs such as data analysis. Those systems can also be configured to focus on hardware or operating system features of a particular node: This, the node in the cluster is not necessary.
Linux virtual server
The Linux Virtual Server project has implemented many kernel patches that create a load balancing system for the network TCP / IP traffic. The LVS software checks the traffic, and then redirects traffic to a set of servers to a group based on the load balancing algorithm. This allows web applications, such as web servers, running on node clusters to support a large number of users.
LVS supports a cluster node that is directly connected to the same LAN as a load balancing server, but it can also be connected to the remote server as a channel transmitting IP packet. The latter method includes compressing equalization requests in the IP package, which are sent directly to the remote cluster node from the load balancing server. Although LVS can remotely support website load balancing, the load balancing algorithm it uses is still invalid for the wide area web server in the virtual cluster. Therefore, if the web server is in the same LAN, LVS is best used as a load balancing server. Several hardware implementations of load balancing systems are faster than in general operating systems, such as Linux. They include hardware from Alteon and Foundry, their hardware logic and least operating systems can perform traffic management in hardware, and fast speed is faster than pure software. Their prices are also high, usually at $ 10,000. If you need simple and cheap solutions, a medium Linux system with a lot of memory (256 MB) will be a good load balancing system.
TurboLinux TurboCluster and Enfuzion
TurboLinux has a product called TurboCluster, which was originally based on the kernel patch developed by the Linux virtual server project. Therefore, it can get most of the advantages, but it is also the same as the original project. TurboLinux has also developed some tools to monitor cluster behavior to increase product practicality. Business support for a major supplier also makes it more attractive to large sites.
Enfuzion supports automatic load balancing and resource sharing between nodes, and can automatically rearrange failures.
Enfuzion is the upcoming scientific cluster product of TurboLinux, which is not based on Beowulf. However, it can support hundreds of nodes and many different non-Linux platforms, including Solaris, Windows NT, HP-UX, IBM AIX, SGI IRIX, and Tru64. Enfuzion is very interesting because it runs all existing software and does not need to write custom parallel applications for the environment. It supports automatic load balancing and resource sharing between nodes, and can automatically rearrange failures.
Platform computing LSF batch
Platform computing is an older in the cluster calculation field, and now provides a "load balancing facility (LSF) batch software on the Linux platform. The LSF batch allows the central controller to schedule jobs to run on any number of nodes in the cluster. Conceptually, it is similar to TurboQUX Enfuzion software and supports any type of application on the node.
This method is very flexible for cluster size because it can clearly select the number of nodes, or even the node of the application. Thus, 64 nodes can be divided into smaller logical clusters, each logical cluster runs its own batch application. Moreover, if the application or node fails, it can rearrange the job on other servers.
Platform's products are running on the main UNIX system and Windows NT. Currently, only their LSF batch products have been ported to Linux. In the end, the rest of the LSF Suite component will also be followed by afterwards to Linux.
Resonate Dispatch Series
Resonate has a software-based load balancing method similar to the Linux virtual server. However, it supports more features, as well as some better load balancing algorithms. For example, using RESONATE, you can load a proxy at each cluster node to determine the current system load of the node. The load balancing server checks the agent of each node to determine which node has the least load and send the new traffic to it. In addition, Resonate can also support regional distributed servers using its Global Dispatch products. Resonate has been thoroughly tested on the Red Hat Linux, which is believed that it can also be run on other releases. Resonate's software can also be run on other various platforms, including Solaris, AIX, Windows NT, and it can also perform load balancing in a hybrid environment.
Reply to: Call! (Netsys2)
2004-12-7 10:36:16
Linux Cluster Daquan (3)
MOSIX
MOSIX uses Linux kernel new version to implement process load balancing cluster systems. In this cluster, any server or workstation can be added or leaving, i.e., adding to the cluster's total processing capability, or from it. According to its document, MOSIX uses the adaptive process load balancing and the memory boot algorithm to maximize overall performance. The application process can be prioritized between nodes to utilize the best resources, similar to the symmetrical multiprocessor system to switch applications between individual processors.
MOSIX is completely transparent in the application layer and does not need to recompile or re-link to new libraries because all everything happens. There are several ways to configure it into a multi-user shared environment cluster. All servers can share a pool, and the system can be part of a cluster, or a cluster can be dynamically divided into several subgroups, each method has different purposes. The Linux workstation can also be part of a cluster, which can be fixed or temporary, or just as a batch job submitter. As a temporary cluster node, the workstation can be used to increase cluster processing power when it is idle. It is also allowed to use a cluster only in batches, in which the cluster is configured to accept batch jobs through the queue. Then, the daemon takes the job and sends them to the cluster node for processing.
MOSIX's disadvantages are some of the core parts of the Linux kernel behavior, so the system-level application will not run according to the expectations.
In addition to high performance scientific calculations, MOSIX provides an interesting option for creating a cluster environment in a joint setting. It can create and run applications faster, more efficiently, by using Internet access to idle resources using the server and workstation. Since multiple servers are accessed, it can also provide a height server availability by dynamically adjust the cluster size and change load balancing rules. MOSIX's disadvantages are some of the core parts of the Linux kernel behavior, so the system-level application will not run according to the expectations. To use a web application, the program uses a single server address-based socket connection, MOSIX is usually limited. This means that the network application starts running on a server node, if the IP address is bound to the socket, then it must continue to run on the node. Obviously, MOSIX is still starting to migrate the socket, so this will become the focus of debate.
High availability cluster
High availability (HA) clusters are committed to allowing the server system's running speed and response speed as possible. They often use redundant nodes and services running on multiple machines to track each other. If a node fails, its replacement will take over its responsibilities in a few seconds or less. Therefore, for the user, the cluster will never stop.
Some HA clusters can also maintain redundant applications between nodes. Therefore, the user's application will continue to run, even if he or her node is faulty. The running application will migrate to another in a few seconds, and all users will only be aware that the response is slightly slow. However, this application-level redundancy requires software design to have cluster consciousness and know what should be done when the node fails. But for Linux, most of them are still doing. Because the Linux system does not have a HA cluster standard, and there is no public API to build software with cluster consciousness. The HA cluster can perform load balancing, but usually the primary server runs jobs, and the system keeps the secondary server idle. The auxiliary server is usually the mirror setting of the primary server operating system, although the hardware itself is slightly different. The auxiliary node performs active monitoring or heartbeat to see if it is still running. If the heartbeat timer does not receive the response of the primary server, the secondary node will take over the network and system identity (if it is a Linux system, the IP hostname and address).
However, Linux still ignores in this area. Good news is a famous supplier is working hard to develop high availability clusters as soon as possible, as it is the function of enterprise-level servers.
Linux-HA project
High availability Linux projects, based on its target declaration, aimed to provide high availability solutions for Linux to improve reliability, availability, and service capabilities through community development results. When Linux reaches a high availability cluster, this is an attempt to give Linux and advanced UNIX systems, such as Solaris, AIX, and HP / UX, which are as competitive. Therefore, the project's goal is to analyze the Group D. H. Brown Specific Functional Level of the Group D. H. Brown.
There are software that can maintain the IP address between the nodes and take over the IP address of the failure node. If a node fails, it uses the "Forgery Redundant IP" package to add the address of the failed node to the work node to bear its responsibilities. Thus, the failed node can be automatically replaced in a few milliseconds. In actual use, heartbeat is usually in a few seconds, unless there is a private network link between nodes. Therefore, the user application in the failed system still needs to restart on the new node.
Unwanted cluster
For Linux, there are many cluster systems available. At the same time, there are several items in those projects that are non-commercial, even experimental properties. Although there is no problem in the academic and certain organizations, big companies usually prefer the famous suppliers' commercial support platform. Suppliers, such as IBM, SGI, HP, and Sun, provide products and services for building scientific clusters in Linux because the cluster is very popular and can sell a large number of server devices. Once the business agency believes that other forms of clusters are reliable, those the same server vendors may be able to create their own products around the open source cluster solution.
Linux acts as the importance of the server platform depends on the ability to support larger servers and server clusters. This allows it to compete with Sun, HP, IBM, and other company UNIX servers. Although Windows NT and 2000 do not support cluster ranges that Linux supported, the availability of the formal method of the HA cluster and the API to construct a cluster consciousness also enable it to participate in competition.
If you are considering building a cluster, you should carefully check those possibilities and compare them with your needs. You may find that the goal you want to achieve is not a complete solution, perhaps you will find outstanding solutions. Whether it is, I believe that many existing companies entrust their applications to the Linux system cluster for depth calculations and provide a large web page. Cluster is a business system service that has been successfully tested under Linux. Although new clusters will appear, the selected diversity is Linux exceeds other systems, such as Windows NT, and advantages. About author
Rawn Shah is an independent consultant living in Tucson City, Arizona. He has traded a pass with multi-platform issues over the years, but it often makes him unfolbened, very few people know useful system tools.