Note the technology of high-end UNIX servers
Author: Chen Zhihong, Haina
Reprinted from: Said.com technology world
For servers (whether it is a PC server or a UNIX server), simply improve the calculation capabilities and processing capabilities of a single processor are getting harder, although many manufacturers have made unremittingly from materials, process and design. The effort, recently remained the CPU maintains a high-speed growth potential, but the negative effects such as battery capacity and heat dissipation problems, and these negative effects generated by high-frequency, and these negative effects generated by these negative effects The problem, but in turn pushed the improvement of the CPU calculation ability to the next year. Obviously, increasing a single processor speed and performance is a strong end, and the parallel processing technology developing multiple CPUs is the effective way to truly improve modern server processing capabilities and computing speeds. This is also the multiprocessor server is not only a patent of UNIX servers, but also the reason why it has been widely used in the PC server. At present, the parallel processing technology of the industry is mainly SMP technology, MPP technology, COMA technology, cluster technology and NUMA technology.
1. SMP technology SMP (symmetrical multiprocess-Symmetrical Multiprocessing) technology is relatively unsmmontual multi-processing technology, which is widely used in parallel technology. In this architecture, multiple processors run a single copy of the operating system and share memory and other resources of a computer. All processors can access memory, I / O, and external interrupts equally.
In an asymmetric multiprocessing system, tasks and resources are managed by different processors. Some CPUs only process I / O, and some CPUs only process the submission task of the operating system, clearly asymmetric multi-processing systems cannot achieve load balancing. . In a symmetrical multiprocessing system, system resources are shared all CPUs in the system, and the workload can be uniformly assigned to all available processors.
Currently, most SMP systems have a CPU to access data through a shared system bus and implement a symmetrical process. If some RISC server vendors connect multiple CPUs using Crossbar or Switch methods, although performance and scalability are better than Intel architecture, SMP is limited.
The difficulty in adding more processors in the SMP system is that the system has to consume resources to support processors to seize memory, and two main issues in memory synchronization. The preemption is that when multiple processors jointly accesses data in memory, they cannot read and write data at the same time, although a CPU is reading a data, other CPUs can read this data, but when a CPU is modifying When data is data, the CPU will lock this data, and other CPUs must be operated.
Obviously, the more CPUs, the more serious the waiting questions, the system performance is not only possible, and even decline. In order to add more CPUs as much as possible, the current SMP system basically uses the method of increasing the server Cache capacity to reduce the problem of memory, because cache is "local memory" of the CPU, which is exchanged between the CPU It is much higher than the memory bus speed. Also because cache supports not sharing, there will be no multiple CPUs to seize the same memory resources, and many data operations can be successfully completed in Cache built in the CPU.
However, although the effect of cache solves the problem of preemptive memory in the SMP system, it has caused another difficult solution to the so-called "memory synchronization" problem. In the SMP system, each CPU accesses memory data through Cache, requires the system to keep the data in memory with the data in Cache, if the content of Cache is updated, the content in the memory should be updated accordingly, otherwise it will affect System data consistency. Since each update is required to occupy the CPU, it is also necessary to lock the updated field in memory, and the update frequency is too high and inevitably affects system performance. The update interval is too long. The update algorithm is very important. The current SMP system uses the listening algorithm to ensure that the data in the CPU Cache is consistent with the memory. The larger the Cache, the smaller the probability of sealing memory reproduction, and due to the high data transfer speed of Cache, the increase in Cache also increases the CPU's operational efficiency, but the system maintains memory synchronization is also very difficult. In terms of hardware, SMP can be implemented on UltrasParc, SparcServer, Alpha, and PowerPC architectures, and can also be implemented using all Intel chips including 486.
2. The cluster technology cluster (Cluster) is a technique for developing high-performance computers in recent years. It is a set of independent computers that make up a single computer system with high-speed communication networks and managed in a single system mode. The starting point is to provide high reliability, expandability, and disaster relief. A cluster contains multiple servers with shared data storage spaces, and each server communicates with each other through internal local area networks. When a server fails, the application it runs automatically by other servers. In most modes, all computers in the cluster have a common name, and services running on any of the clusters can be used by all network customers. Using a cluster system is usually in order to improve system stability and data processing capabilities and service capabilities of the network center.
Common cluster technology is:
(1) Server mirroring technology
(2) Application error takeover cluster technology
Error Tube Cluster technology is to connect two or more servers built in the same network through cluster technology, each server in the cluster node runs different applications, with its own broadcast address, providing service to front-end users, At the same time, each server monitors other servers' running status, providing hot backup services for the specified server.
Error Tube cluster technology typically requires external storage devices - disk array cabinets, two or more servers are connected to disk array with disk arrays via SCSI cable or fiber, and data is stored on disk array cabinets. In this cluster system, two nodes are typically backed up, rather than several servers simultaneously, and the nodes in the cluster system via the serial port, shared disk partitions or internal networks to monitor each other's heartbeat.
Error Take Overcrow cluster technology is often used in a cluster of database servers, Mail servers, and the like. This cluster technology has increased peripheral costs due to shared storage devices. It can realize the cluster of 32 machines, greatly improves the availability and scalability of the system. At present, it is currently used in improving the availability of the system that the application error takeover technology, that is, the dual-machine we usually use the cluster technology of disk arrays through the SCSI cable.
(3) fault-tolerant cluster technology
A typical application of fault-tolerant cluster technology is a fault tolerant machine, in a fault-tolerant machine, each component has a redundant design. The implementation of fault-tolerant cluster technology often requires special hardware and software design, so the cost is high, but the fault tolerance system maximizes the availability of the system, the best choice for financial, financial and security departments.
3. NUMA Technologies Today's 64-bit Unix parallel computing servers can be divided into two categories: distributed shared storage structure (DSM) and cluster systems. Non-uniform storage access (NUMA) is a parallel model that belongs to DSM. NUMA's physical memory distribution is on a different node, and the data of the remote node is accessed in a different node, which is much longer than the local data "path" access to the same point, so it is a non-uniform storage. Symmetric multiprocessing is also a multi-process machine structure for shared memory. Its single addressing space, simple programming, convenient operation is the main reason why it is easy to popularize. Large-scale parallel processing (MPP) belongs to the class system of the cluster system, its advantage is good scalability, but it is necessary to programmatically program and parallel compilation, inconvenience. The NUMA system combines SMP with the advantages of the cluster, which existing SMP programmable and has cluster scalability. In fact, this "combination of advantages" is a trade-off, and the key is to seek a combination of approach and determine the combination point.
The Numa3 system architecture NUMA (non-uniform memory access structure - Non Uniform Memory Access Architecture) was born as a research project in Stanford University in the late 1980s. In 1995, SGI has designed the first generation NUMA system with Stanford University Dash project; in 1996, SGI launched the second-generation NUMA system, the number of processors of the system can be extended from 32 to 512, and the system Bandwidth can be expanded from the number of processors and the number of functionality and improvement of the NUMA architecture, and SGI launched a third-generation NUMA system in the fall of 2000, referred to as NUMA3 system. The new system can make the system's computing power, memory capacity, deposit capacity, graphical performance and networking capacity independently, NUMA3 system has three major features such as multi-function, modular, flexibility. The user can configure the system as needed, so investment can be fully protected.
The third generation NUMA system is composed of different functional modules (Brick), which is smaller than the module of the second generation NUMA, more exclusive, more scalable, and standardized, which further increases the system's flexibility Sex.
In a third-generation NUMA server, the performance of better crossover switches Bedrock, all processors and memory are connected by BEDROCK. The combination of these processors, memory, and crossover switches constitutes an interconnected structure called Numalink. In addition, a more advanced routing chip is used in the third-generation NUMA server. By providing a high bandwidth and extremely low-delayed internet network, the routing chip is connected to form a single, continuous 1TB of storage space. The communication broadband (two-way) between the processor and the local and remote memory is raised from the original 1.6Gbps to 3.2Gbps. In addition, the power supply uses a redundant manner of n 1, so reliability is further improved.
In the NUMA architecture, each processor is connected to the own local memory and cache, and the plurality of processors are connected by a processor, a memory interconnected network. The processor also accesses the shared I / O and peripheral devices via processor, I / O network. As for communication between the processor, the communication network between the optional processors is implemented. NMUA technology has an irreplaceable position in the field of science and engineering calculations, and is increasingly important in online transaction processing (OLTP), decision support services (DSS) and intranet, and the status in the Internet. At present, the number of processors of NUMA can reach 512, and bandwidth can be expanded substantially with the number of processors. Such a large number of processors makes a single system of NUMA machines sufficient to override most applications. First, since it has the same programming mode as SMP, there is an irreplaceable position in the field of science and engineering calculations; secondly, because it has a shared memory and a good scalability advantage, it can be well adapted to the enterprise data center. Application. Today, the NUMA system can run some of the world's largest UNIX database applications, and is being widely accepted as mainstream technology for e-commerce, including processing functions, large scale scalability, high availability, workload, and resource management. Wide flexibility, and there is no need to change the SMP programming model.
Storage consistency with CCNUMA In the NUMA parallel, although the memory is physically assigned to each node, it can be accessed or shared by all processors in the system. Storage consistency is caused by a plurality of processors sharing the same storage unit. SGI uses cache coherent technology to resolve storage consistency issues, which is the origin of "ccnuma" in parallel.
In the CCNUMA system, each CPU has a private cache. In order to get better performance, the CPU often takes instructions and stored data in its cache. In this system, the content of a memory address can have many separate copies to each CPU. If each CPU points to the same memory address, the cache of each CPU will get the content copy of the address. However, when a CPU modifies the content of the address, it is necessary to prevent other CPUs from using the current "outdated" data, which is the so-called cache consistency problem. So how do you guarantee that all caches reflect the real situation of the memory?
Cache consistency is not implemented in software, it must be managed by hardware. The cache consistency problem is not implemented with a CPU, but is implemented in part of the auxiliary line in the HUB chip. In order to improve the bandwidth and scale scalability of the system, in a server using a CCNUMA architecture, a bus-based broadcast method is not used, but a directory-based cache consistency scheme is used. At any time, when a node requests access to a Cache line in the memory, its hub starts the cache that has accessed the row node and copy the memory data of the Cache line to the cache of this node. When the Cache line in the memory is not exclusive, other nodes can read the row data from the cache from the cache in the same manner. When a CPU wants to modify a Cache line, it must take excitement. To this end, the hub is retrieved from the status bits of the target row, and an invalid information is sent to each node for copying the row data, and the number of nodes modified to the content is set in the directory memory. When a CPU reads a cache line, and the line is exclusively, the hub requires the node to copy this Cache. Other nodes copy the latest information of the Cache line from the cache with exclusive rights nodes via the internet network. If the two nodes require simultaneous access to a Cache line, there is a protocol to ensure a node to access the Cache line.
The CCNUMA structure can be suitable for users in performance, flexibility, availability, and has become a very superior Internet / web server in today's Internet economy, especially in the broadband Internet as a core server for multimedia applications. Different from the cluster, the cluster uses a loose combination, communicating with each other, the internal exchange time is long, consumes large, and puts several machines as a system management, it will increase the management difficulty. The CCNUMA computer is different, no matter how many processors are there, it is just a simple computer. All in all, CCNUMA overcomes some of the shortcomings of SMP and clusters, and play a role in the place they can't show talents.
Modular structural modular servers mainly include computing modules, I / O modules, and massive memory modules. These modules work together to form a modular server system. In a modular server system, each module can be upgraded, faulty, or replace the old module with a new module, and the same module can also be added to the modular server at any time in order to expand the system.
One of the biggest benefits of modular servers is to protect customers' investment. The modular server is a scalable server that expands their server system as the business needs, expands their server system by adding various modules to the server; another significant advantage is that maintenance management is very convenient. Modular servers enhance the availability and fault tolerance of the system. From the perspective of a high-performance multiprocessor computer architecture, the CCNUMA architecture has the interconnection of multiple processors through the router fiber, and the system bandwidth can increase with the system scale, thereby overcoming the bus-based SMP architecture. Bottleneck. The CCNUMA structure adopts the multi-dimensional interconnection characteristics of the ultrasound, plus the flexibility of modular calculations, so that the scalability of the system has reached an unprecedented level while saving costs. Therefore, modular NUMA servers have reached a new realm in flexibility and economics.
The hardware partition hardware partition is the architecture of the hardware of a server into multiple partitions. Assigning the hardware resources such as processors, memory, and I / O controllers configured to multiple partitions, allowing different OSs on each partition, which is to provide "partition function". With system hardware partitioning capabilities, the system can provide support for a variety of different operating systems to meet the needs of the same physical hardware growing. The system partition is initially static. When the resource is moved from a partition to another partition, the application and operating system in both partitions need to stop, and after the operating console is reconfigured, the application and operating system can restart. . As the operating system is further improved, the operating system also provides the support based on the dynamic partition while supporting hot-swapping and hot adding capabilities, which means that resources can be moved between individual partitions without affecting The application in this partition is running.
There have been a topic for a long time: UNIX servers can replace the proprietary system. This is also a matter of seeing the benevolence, but the fact that each UNIX manufacturer continues to introduce new, increasing the technical connotation of the product, both of which reach and exceeds large hosts, as a concentrated expression of its UNIX technology, and Other UNIX vendors compete.
Of course, the development of UNIX technology will never be limited to performance and functionality. In security and management, UNIX servers have many places that need to be learned from hosts. However, high-end UNIX systems have basically met the requirements of host users, and progress faster under the promotion of each manufacturer. Part of the market that UNIX occupies the host will also be inevitable. This is similar to the competition of the NT server and the low-end UNIX server.
With the increasing market competition, the large-scale mergers of IT companies have appeared, and dozens of UNIX vendors have competed in the competition. There are currently only a few UNIX vendors, they have held market and technological development trends in UNIX servers. Let's take a look at several representative products. . HP 9000 SuperDome High-end UNIX Server
The SuperDome server mainly facing large-scale Internet companies, Internet service providers, and developing e-commerce strategies, requires enterprises that need to handle large amounts of data.
SuperDome is the first server that supports 64 and unlimited to extend microprocessors after the V series, can support PA-RISC and IA64 processors at the same time. One of 64 CPUs support 256GB of memory, up to 192 I / O slots. SuperDome has been ahead of an open system integration, and can be partitioned on the same system, run HP UX-11i, Linux, MPE and Windows NT; performance can pass hardware physical partitions, multi-level partitions, software virtual partitions, and integrated dynamic partitions. Equivation of functions. HP has developed a new sales pricing method and service mode for SuperDome - "Comprehensive Customer Service Mode". Superdome's customers will receive a full set of solutions in HP, enable customers to enjoy HP support services in the four use phases of system innovation, construction, operation, and development upgrades.
IBM P Series 680 Server
P-Series 680 Server is the most powerful UNIX symmetrical multiprocessor system that handles many different key e-commerce applications (including Web Serving, ERP, SCM, CRM, etc.).
The microprocessor equipped in 680 uses insulating silicon (SOI) technology. The technique reduces current fever by coating a layer of insulator on the chip circuit. The decrease in the chip heat can not only improve the running speed, but also reduce the runtime error, or the chance of the system crashes. The 680 CPU running frequency is 600MHz, each microprocessor has a cache capacity of 16MB, and the system memory capacity is 96GB, and the SMP configuration can reach up to 4 6-way processor modules.
Sun Enterprise 1000 Server
Sun Ennerprise 10000 runs in the Solaris operating environment, for host-based or client / server, such as online transaction processing, decision support system, data warehouse, communication service or multimedia service, etc., it is an ideal universal server. Or data server.
The SUN E10000 is a Unix high-end server with a hard partition function, and the first Unix standing machine with the first 64-channel SMP computing. It uses Sun Ultrasparcii 400MHz processor, up to 64 CPUs, maximum memory capacity is 64GB, can achieve up to 16 partitions (or domain), online disk storage capacity up to 64TB, with dynamic restructuring characteristics (available online Service Ability) and Dynamic System Domain Features. By using a GigaPlane-XB interconnection by the core parts of the system, the provided data bandwidth can reach 12.8GB per second.
SGI Origin 3000 Server
SGI Origin 3000 High Performance Server New Products Adopt SGI Numaflex Structure, which is a new modular computing structure or a third-generation NUMA architecture. The SGI 3000 server can extend from the minimum configuration of two 64-bit MIPS RISC CPUs to a multiprocessor system for shared storage of 512 CPUs. In addition, SGI's InfiniteReality3 graphic subsystem can also be integrated into this structure.