Take CC-NUMA Express
□ Xu Yaochang reproduced from: online world
Editor Press: NUMA - NON UNIFORM Memory Accept Achitecture, non-uniform memory access structure, sounds complex and unfamiliar names. It was born as a research project in Stanford University in the late 1980s, in the 1990s, in the market. Today, the NUMA system can run some of the world's largest UNIX database applications, and is being widely accepted as mainstream technology for e-commerce, including processing functions, large scale scalability, high availability, workload, and resource management. Wide flexibility, and there is no need to change the SMP programming model. In this technical close-up of this issue, we will interpret NUMA technology from a bandwidth, architecture, and comparison with SMP and clusters, to make readers have more in-depth understanding.
The importance of computer bandwidth is a certain frequency range of each signal, and we call this frequency range to bandwidth. In order to make the signal as small as possible when the signal is passed through the channel, the bandwidth of the channel is required as wide as possible. Shannon Theorem tells us that the limit data rate of signal transmission is proportional to the bandwidth of the channel. It is well known that the bandwidth of different channels is different. For digital signal transmission, the number of bits transmitted per second is usually used as units. As the data transfer rate of the shaft cable is 20 Mbps, the data transfer rate of the fiber can reach thousands of Mbps.
The bandwidth of the computer system refers to the number of operations that can be performed within the unit time. Channel or storage bandwidth refers to their data transfer rate. The data transmission rate between the components of the computer should be balanced.
The server is a central operation, management, service center, is one of the key devices of the network system, and is located in the core device in the network system. The server should not only have the characteristics of fast processing speed, large capacity, good fault, and strong scalability. To ensure sufficient data transfer rate, it should have the necessary bandwidth.
Memory bandwidth and balance According to statistics, the microprocessor's CPU speed increases approximately 80% per year, while the memory accesses increased by only 7% a year, so the ratio of CPU and memory performance is increased in geometric grade. The concept of machine equilibriness is defined in many studitors, which is defined as a ratio of memory access in each CPU cycle and the number of floating point calculations in each CPU cycle, namely: (floating Click to operate the number / CPU cycle) / (Memory Access) / CPU cycle). Because it does not take into account the true cost of memory access in most systems, the results are not reasonable.
In order to overcome the shortcomings in the above definition, people are redefined as the number of memory access times when the memory that is not cached across the unit vector operand is longer, and thus the equalization = (peak floating point calculation / CPU periodic) / (Dippless number / CPU cycle)
According to the above definition, the computer test results of the current various architectures are as follows.
Single processor: equalization is generally good, and performance is low to moderate. Shared memory: poor balance, general, medium performance. Vector machine: good equilibrium, medium, good performance. Distributed memory: average equalization, good extensibility, good performance.
Computer architecture list
Single processor architecture
In a hierarchical stored computer, a key factor determined to continuous memory bandwidth is the latency of Cache's non-hit wait time. At present, the storage system of Cache has a significant change, and the waiting time and transmission time in memory access are greatly changed. In 1990, the 20MHz machine waiting time and transmission time are roughly equal, and in the 100MHz machine in 1995, waiting for time Most of them.
Shared memory architecture
The vector machine belongs to the shared memory architecture (except distributed shared memory machines). It greatly simplifies the consistency of Cache and the waiting time (processing delay). However, the vector machine is expensive than the over-scale machine of the shared memory or hierarchy memory. There are fixed memory bandwidth limits, which means that its machine equalization value is increased as the number of processors increases as the number of processors increases, so that the number of processors has a limit value. Typically, the shared memory system is non-blocking between the various processors, thereby allowing multiple CPUs to act, which compensates for a larger delay caused by the waiting time. When using multiple processors, the machine's cache hit rate is determined by the waiting time, bandwidth limit, and the limit of the bus / network / network / crossover controller. In the vector computer, the limit is mainly on the bandwidth instead of waiting time.
Symmetric multiprocess (SMP) shared memory system
The symmetrical multiprocessing (SMP) node contains two or more exact same processors, and there is no master / from. Each processor has the same access to the node computing resource. The interconnection between the processors and memory within the node must utilize interconnection schemes that can be consistent. Consistency means that the processor can only hold or share a unique value for each data of memory.
The SMP shared memory system connects the plurality of processors to a concentrated memory. In the SMP environment, all processors accesses the same system through the bus, which means that the SMP system runs only a copy of the operating system. Applications written for single-processor systems can run in the SMP system without changing. Therefore, the SMP system is sometimes referred to as a uniform memory. For all processors, the time required to access any addresses in the memory is consistent.
The disadvantage of the SMP architecture is that the scalability is limited because the memory interface increases the processor when the memory interface is saturated and does not achieve higher performance. The number of SMP processors is usually up to 32.
New distributed storage CC-NUMA architecture
Recently, some vendors have begun to launch new systems, which connects the SMP node to easily expand more than the bus-based SMP. These nodes are interconnected by a Fiber Channel, and their wait time can be different. Typically, the memory is faster than the "far" memory than the "far" memory, which is the meaning of N "Non" in NUMA. Since UMA refers to a uniform memory access, that is, the time of each CPU accessed the memory is basically the same. Numa means non-uniform memory access. NUMA maintains both SMP system single operating system copies, simple applications programming mode and easy-managed features, and can effectively expand the size of the system. As for CC refers to a cache coherent, that is, the content of a memory cell is rewritten by a CPU, the system can quickly (through a dedicated ASIC chip and Fiber Channel) to notify other CPUs. Practice shows that moderate non-uniformity can work well, and the access time of the distal end and the local memory allows the programmer to communicate the message transfer mechanism in a similar network. The memory distributed near each CPU is physically separated, but is logically unified, so that large applications can be run without having to program, parallel compilation.
Let CC-NUMA "turn" to understand how CC-NUMA work, first start from the traditional symmetrical multiprocessing (SMP) structure. SMP is a plurality of processors communicated with a transport mechanism called an interconnect bus, communicating with a shared memory group.
CC-NUMA is similar to SMP, which can handle multiple connections, each processor to access a common memory group. This structure divides the processor into several nodes, interconnects each other, communicates with each other, and can communicate with local memory within the node to mitigate the bus blocking of SMP. For example, a 64 processor server can be divided into 2 large nodes, each node has 32 processors, and has its own memory group. The processor can also access memory groups in all other nodes. This access time is different from the distance of nodes, and CC-NUMA is stronger than SMP scalability, and only one operating system is relatively easy. Unlike the cluster, the cluster uses a loose combination, communication between several machines, and the internal exchange time is long.
Moreover, several machines are managed as a system management, which will increase the difficulty of management. The CC-NUMA computer is different, regardless of how many processors inside it, it is only a simple computer.
All in all, CC-NUMA overcomes some of the shortcomings of SMP and clusters, and play a role in the place they can't show talents.
Typical CC-NUMA systems include CONVEX's exemplar, HP V2600 NUMA series, IBM Numa Q and SGI Origin Series. The CONVEX and IBM systems use a "ring" interconnect structure of the 32-bit Intel chip, that is, each SMP node is inserted into the ring network, and the message request and response must be run around the net.
Product reviews
HP V2600 NUMA Server
The V2600 is running in the HP-UX 11 operating environment, which can support 15,000 applications, including primary databases and ERP applications. The basis of the V-Series server architecture is HP (SCA-scalable computer architecture) scalable computer architecture, which is based on high scalable nodes, consistency caches, and non-uniform storage access technologies.
The V2600 operating environment is based on a 64-bit HP-UX UNIX operating system, which provides a complete 64-bit environment, including 64-bit system kernel and address space, 64-bit file size, and file system size, 64-bit file data type.
The V-level system has the following features such as the HP Hyperplane: Transmission system with cross bandwidth of 61.2Gb / s. The crossover switch provides a smooth access from the CPU and I / O pipes to access systems. Using HP HyperPlane prevents system performance of system bus for processing memory and I / O communication. Each node contains up to 32 processors, 8 access panels, and 7 I / O channels supporting 28 PCI I / O controller interfaces; including 2 to 128 64-bit PA 8600 processors; maximum 128GB SDRAM; The maximum 7.68 GB / S I / O pipe passes; up to 112 industrial standard PCI I / O control interfaces.
In order to further expand the spread, HP uses its own memory interconnect-based V-Series server systems to provide true support for CC-NUMA. It has the advantages of SMP and distributed access architecture, including programming modules, SMP characteristics, and scalability of distributed subsystems. SCA interconnect is a multi-stage memory subsystem, and the first stage is formed by traditional SMP memory. The second stage of the memory subsystem is created by connecting the first stage memory by using a dedicated interconnect, which provides multiple bidirectional loops for high bandwidth and fault flexibility. SCA HyperLink is implemented as a series of one-way links (between nodes), allowing cross-loop access to cross-looping in a way similar to cross-memory requests, in addition, this technology also allows for any given time Multiple requests to reduce delay time. Due to the advantage of HP in SMP, each node is large, but the number of nodes is not much, the structure is simpler, and the transmission efficiency is improved, and the fault probability is reduced. In order to improve access rates, the internal memory connection is completely passed in the local memory, and the local CPU has access to the remote cache, first view the local cache, which greatly reduces the access time. IBM NUMA-Q architecture (formerly sequent)
The IBM Numa-Q system uses CC-NUMA or cache-related NUMA schemes. Memory cells are close to multiprocessor SMP build blocks to maximize the entire system speed in the CC-NUMA system. These build blocks are connected together by an interconnection device constituting a single system image.
Hardware caches mean that there is no need for software to keep multiple data copies, or transfer data between multiple operating systems and applications. Everything is managed at the hardware level, just like in any SMP node, where a single case operating system uses multiple processors.
The NUMA-Q system uses Intel's 4 processor or "quad" build block that includes memory and seven input / output slots. At present, IBM Numa-Q supports up to 16 build blocks or 64 processors, which are connected together by a hardware, cache, high speed, high-speed, high-speed, and constitute a single NUMA system, which The substrate of the traditional large bus SMP system increases the same processor substrate. The existing NUMA-Q architecture can support 64 build blocks or 256 processors in a single system node.
NUMA-Q provides an integrated MVS-style multipath swap Fiber Channel SAN (Storage Domain Network) Unix system. This feature is a key start technology that supports e-commerce and customer relationship management (CRM) applications for large-scale high-performance transactions and data warehouse environments. Fiber Channel SAN allows large backends UNIX database machines and hundreds of front-end UNIX or NT application servers using a public swap fiber, which is economically effective to share data center-based disk storage and tape libraries.
Numa-q passes the I / O through a swap-type Fiber Channel SAN fiber directly to its connected storage device, not the interconnection of memory access. On the Numa Q system, this eliminates resources preemptions that damage large SMP system throughput as the processor increases. Moreover, Numa Q provides unique inherent fault tolerance SAN due to the I / O multipath feature supported by the operating system level.
In a 24 × 7 e-commerce environment, it is an important advantage to manage resources and online and offline utilization resources without interrupting the system. NUMA-Q will be able to implement this function in a switch-based next-generation system running UNIX and Windows NT. This advanced computer system design can provide performance, scalability, availability, and manageability of online key business Unix and Windows NT systems that cannot be supported by other architectures.
SGI DSM CC-NUMA
SGI's DSM (distributed shared memory) CC-NUMA system is very different, it uses 64-bit RISC chip, cross-switch and CRAYLINK Fiber channel, which can be interconnected with "fat" super cube structure, so it can be wide bandwidth And the low latency is operated. Since SGI's DSM CC-NUMA adopts modular structure, distributed memory, router-specific chip, and distributed I / O ports, information input / output is connected to a dedicated chip interface through a smart cross-switch, which can be connected to PCI , VME, SCSI, Ethernet, ATM, FDDI, and other I / O devices, thus providing high-speed data transfer rates on the network. With the support of the SGI 64-bit operating system IRIX, its DSM CC-NUMA system bandwidth and memory performance increases with the number of CPUs, currently, the SGI CC-NUMA machine has been done within 512 CPUs, bandwidth Linear growth with the number of CPUs. Therefore, it can be called an excellent "bandwidth machine". SGI's bandwidth machine -ORIGIN server, its high-speed memory bandwidth can reach 26Gb / s, high-speed I / O bandwidth up to 102.4Gb / s, maximum memory can reach 256GB, while online fiber channel disk capacity can reach 400TB. Whether it is a connection data warehouse, storage or retrieval, product data management, or for thousands of customers with Web services, origin servers can be competent.
Conceptual
Large-scale parallel processing
Massively Parallel Processing or "Sharenothing" node is traditionally consisting of a single CPU, a small amount of memory, partial I / O, node interconnection, and an instance of an operating system of each node. Interconnection between nodes (and operating system instances that reside in each node) does not require hardware consistency, because each node has its own operating system and its own unique physical memory address space. Thus, consistency is implemented in software through "Message Passing).
MPP performance adjustments involve data partitions to minimize the amount of data that must be transmitted between nodes. Applications with data natural partitions can be run in a large MPP system, such as a video on demand application.
Coma
Coma means "cache only memory architecture" (Cache Only Memory Architecture), which is a competitor of the CCNUMA architecture, both of which have the same goal, but the implementation is different. The COMA node does not distribute the memory components, nor does the entire system consistently through the interconnection of the tip. The COMA node has no memory, only configuring a large-capacity cache in each build block. Interconnects still have to maintain consistency, and an operating system copy runs through the entire build block, but there is no "local" memory unit that is specific data. Coma hardware can compensate for unsuitable operating system algorithms associated with memory allocation and process schedule. However, it needs to modify the virtual storage subsystem of the operating system, in addition to the cache consistent interconnect, special custom memory boards.
Cluster
The cluster (or cluster system) consists of two or more nodes, each node runs the respective operating system copy; each node runs the respective application copy; the node shares other public resource pools. With the control, the node of the MPP system does not share storage resources. This is the main difference between the clustering SMP system and the traditional MPP system. Importantly, it is important to note that locations must be locked before trying to update the shared repository (database) to keep the consistency within the database. It is precisely because this requirement makes the cluster management and expansion more difficult to perform more than a single SMP node.
More performance from the cluster system is more difficult than the expansion of the node. Main barriers are communication costs in addition to the single-node environment. Delivery information to nodes must endure the longer delay of software consistency. Applications with a large number of processes are better in the SMP node, because communication is very rapid. If the inter-cross-node process demand is reduced, the application can achieve more efficient extensions in clustering and MPP systems. Reflection memory cluster
Reflective Memory Cluster, RMC is a clustering system, with a memory replication or dump mechanism between nodes and a locking information flow interconnection device. Dumps are executed using software consistency technology. The reflected memory cluster system provides a more rapid message transmission for the application, and allows each node to obtain the same memory page without having to pass a disk. At the RMC system, data is obtained from the memory of other nodes, and it is a hundred times faster than returning disk data. Obviously, only the nodes need to share data, and the application can use shared data to improve performance.
Reflecting memory clusters are faster than traditional network-based messages, because once the connection is established, the message can be sent by the application node without interfere with the operating system.
Numa
NUMA Category includes several different architectures, from a broad sense, the following structures can be considered to have non-uniform memory access delays: RMC, MPP, CCNUMA, and COMA, but the differences between them are quite large. RMC and MPP have multiple nodes, and part of "NUMA" is the software consistency between nodes. As for CCNUMA and COMA, hardware consistency is inside the node, and its "NUMA" component is in a node.