Parallel loop self-dispatcher in grid computing environment

zhaozj2021-02-16  96

1. Summary

Internet and grid computing technologies will change our method of processing complex issues. They will start large-scale computing, data, and other resources in the specified boundary. Using these new technologies will have a variety of discipline rules from high energy to life science. This article, we propose multiple computer clusters by using GT (Globus Toolkit) and SGE (Sun Grid Engine) in a grid computing environment. Also, the multiplication of matrices can be used to demonstrate its performance. On the other hand, the method of processing scheduling and load balancing on multiple heterogeneous PC charters is still immature. This self-scheduling scheme also adapted to the independent iterative parallel loop multi - cluster computer system in the past design. However, in extreme isomeric environments, the load balance cannot be completed like FSS, GSS, and TSS. In extreme heterogeneous mesh computing environments, we propose a solution based on two phases that process parallel conventional loop scheduling issues.

Keywords: parallel circuit, self-dispatcher, grid calculation, globus

2. Introduction

Simply put, grid calculations will bring distributed computing into the next stage of development. Its goal is to create a simple, but huge and powerful interconnected heterogeneous self-processing virtual computer system sharing a variety of resources. Communication standardization between heterogeneous systems will lead the booming of the Internet. The new standard of shared resources, along with the high bandwidth utilization, is making network computing into a huge development phase.

This article, we revised the known loop self-scheduling scheme to make it suitable for all heterogeneous PCs when they are often used. We propose a method to divide the loop iteration into two phases, and achieve good performance in any of the isomeric environment. In the first phase, α% workload is scored in accordance with their performance and the clock cycle of the CPU. The remaining (100-α)% of the working load is paid to the second phase in accordance with known self-scheduled self-timed. The results of the experiment are to run in a grid environment with six nodes and the fastest CPU clock cycle is 6 times the slowest 6 times. Many alpha values ​​should be used in the matrix multiplication, when α = 75 is the best. We also put two simulated and gradually loaded loads of our solutions, and also achieved significant performance improvement. Therefore, our solution is suitable for all conventional parallel loop applications.

This article is organized according to the following. The second part, we give the parallel loop from the background of the cluster and grid computing. The third part is software and hardware configuration that we use GT to propose and build grid computing environments on multiple Linux PC groups. The results of the experiment also also passed the performance of the matrix, and the results of the experiment were currently discussed. Finally, conclusions are given in the fourth part.

3. background

3.1 self-scheduled program

In a parallel processing system, there are two parallel loop scheduling decisions to use: or static calls at compile time, or dynamic calls at runtime. Static schedules are typically applied to a uniformly distributed iterative processor. Its disadvantage is that when the circuit type is not even distributed, it cannot determine the loop boundary when compiling, and the system is constant, or the local management cannot be implemented. In contrast, dynamic scheduling is more suitable for load balancing; however, the operation of running must be considered. In general, the parallelized compiler is only assigned by a scheduling algorithm (or static or dynamic).

Now we focus on self-dispatch, this is an adaptive and dynamic centralized loop scheduling scheme. (Figure 1) A universal self-schement: P represents the number of processors, n represents the entire iteration, f () is a function, and a block will be generated in each step. In the first step, the main processor calculates the program block Ci, the rest of the task ri,

0 = N, Ci = f (i, p), Ri = Ri-1-Ci, which is more than I and P parameters, like Ri-1. The main processor assigns a CI task to an idle slave processor, the balance of the load will depend on Tj (j = 1, ..., p) to perform gaps. Figure 1. A master / slave model

Different scheduling schemes use different ways to calculate CI, the most worthless example is PSS (CHUNK SELF-Scheduling), FSS (Factoring Stroling), TSS (TRAPEZOID SCHEDULING).

Table 1 shows the scheduling of different size blocks of the same problem when the number of iterations n = 1000, the number of processors P = 4.

Table 1 Sample capacity

Scheme

Partition size

PSS

1, 1, 1, 1, 1, 1, 1 ...

CSS (125)

125, 125, 125, 125, 125, 125, 125, 125

FSS

125, 125, 125, 125, 63, 63, 63, 63, 31, ...

GSS

250, 188, 141, 106, 79, 59, 45, 33, 25, ...

TSS

125, 117, 109, 101, 93, 85, 77, 69, 61, ...

3.2 yuan calculation and grid calculation

The term "yuan calculation" is the NCSA instructor Larry Smarr proposed around 1987. When the center was established in 1986, the origin of Yuan calculation has appeared. The SMARR's goal is to provide a user interface with a seamless page between the workstation and supercomputers. With the arrival of network technologies such as Ethernet and ATM, it is already possible to connect to those wide computers. High-performance local area networks and wide area networks are no longer expensive, the price of the computer has declined, and the computer related to a high-speed Internet connection has become possible, which will form a partial distributed cluster.

Grid computing uses software to divide and assign a program to thousands of computers. Grid computing is a distributed and large-scale cluster, which is the form of network distribution parallel processing. Grid calculations limit the computer workstation network between a group or a public collaborator (sometimes it is considered to be a calculated form).

The following three reasons, the grid computing shows a promising trend: (1) It has the most cost-effective ability to use computer resources, (2) It can solve the problem that the computer that does not have a huge computing power, (3) It believes that computer resources should work together or interact to manage collaborators and reach a common goal. In some grid computing systems, the computer should cooperate instead of being commanded by a computer. The application of grid computing will penetrate to computing application systems ---- There is no necessary awareness in these areas, and the computer has penetrated into our surroundings.

Establish, manage, dynamic development, interleave organization sharing interrelationship, require new technologies. This technology is the grid architecture and support software protocols and middleware.

3.3 Middleware of grid computing

3.3.1

Globus Toolkit

The Globus Toolkit provides some software tools that make it easy to build computing grids and grid-based applications. All of these tools are called The Globus Toolkit. Many organizations have adopted GT to build computing grids and use compute grids to support their applications.

The composition of GT has three pillars: resource management, information services and data management. Each pillar represents an essential GT component and uses a common security basis. Gram fulfilled a resource management protocol, MDS fulfilled a information service protocol, and Gridftp fulfilled a data transfer protocol. They share the GSI security protocol in the connection layer. 3.3.2

MPICH-G2

MPI is the information transfer library standard released in May 1994. MPI standards are consistent with participants in MPI forums, this forum is organized by more than 40 groups. Participants in this forum include manufacturers, research workers, high schools, software libraries developers and users. MPI provides portability, standardization, performance and functionality.

MPICH-G2 is a grid implementation of the MPI V1.1 standard. That is, the service is obtained from the GT (eg, the workpiece start, security); MPICH-G2 also allows coupling multiple different architectural machines to run MPI applications. MPICH-G2 automatically converts data into packets delivered between different architectures, while MPICH-G2 supports multi-protocols by automatically selection TCP for the MPI of intermediate machine packets and the internal machine packets provided by the seller. Communication. The existing parallel program written in MPI will be executed through the Globus architecture.

4. Our solutions and experimental results

4.1 System configuration

As the test environment of Figure 2, we built two machine groups to form a multi-capped environment. Each unit has two nodes and a master node. Every node passes 3Com

3C

9051 10/100 Quick Ethernet card interconnected to Accton Cheetah Switch AC-EX3016B switches. Each primary node runs on the sge qmaster port monitor, the SGE runs port monitor, manages and supervises the input job and Globus Toolkit V2.4. Each of the nodes only run the SGE execution port program to run the input job.

Table 2. Hardware configuration

CLUSTER 1

Host Name

Grid

Grid1 *

Grid2

FQDN

Grid.hpc.csie. Thu.edu.tw

Grid1.hpc.csie. Thu.edu.tw

Grid2.hpc.csie. thu.edu.tw

IP

140.128.101. 172

140.128.101. 188

140.128.101. 188

CPU

Intel Pentium 3 -1GHz × 2

Intel Celeron 1.7GHz

Intel Celeron 300MHz

Ram

512MB SDRAM

512MB DDR RAM

256MB SDRAM

CLUSTER 2

Host Name

Grid3 *

Grid4

Grid5

FQDN

Grid3.hpc.csie. Thu.edu.tw

Grid4.hpc.csie. thu.edu.tw

Grid5.hpc.csie. thu.edu.tw

IP

140.128.102. 187

140.128.102. 188

140.128.102. 189

CPU

Intel Celeron 1.7GHz

Intel Pentium 3 - 866MHz × 2

Intel Pentium 3 - 366MHz

Ram

256MB DDR RAM

512MB DDR RAM

256MB SDRAM

* Representing the main node of the cluster, other from the node

4.2 Our method

Intuitive, we have a conventional loop program, in the heterogeneous environment, in the heterogeneous environment, in accordance with their CPU clock division program length. However, the CPU clock is not the only factor affecting computer performance. Many other factors are like a total number of effective memory, the cost of memory access, and the exchange of communication between the processor, etc. also have an amazing impact. Using this intuitive method, if performance prediction does not exact its result will drop. The most undue computer will be the last work that is given.

Using this method, we don't have to know the real computer performance. Computers who have completed their jobs will get another bigger job. The parameter α should not be too small or too large. α is too small, and the dominant computer cannot complete its own work. α is too large, the dynamic scheduling strategy is not flexible. In both cases, it will not get good performance. Only suitable alpha values ​​will produce good performance. In addition, dynamic load balancing methods do not know the operational behavior of the application before execution. However, the dynamic load balancing method under GSS and TSS is not very suitable because: In order to obtain a good performance, each computer in the extreme heterogeneous environment should be arranged in performance. This problem does not exist in our solution.

In this article, the term FSS-80 represents α = 80, and the remaining iterations use FSS divided.

example 1

This is a five-node cluster. Their CPU clocks are: 200MHz, 200MHz, 233MHz, 533MHz, 1.5GHz. Table 3 shows that when it is the same problem, the same problem of 2048, the different block size of the block. The number in parentheses is the number of scheduling.

We use the following terms to simulate our solutions:

T: Total workload of all iterations

W: α% of the total workload

B: A minimum workload in a gradual / gradual work loop. It or the first iterative workload in the gradual increasing work loop, or the last iteration of the last iteration in the work loop.

H: Different workloads of related iterations. H is a positive integer.

The number of iterations of α% accumulated workload. X is a positive real number. Since iteration is unparalleled, we need a suitable integer to assess X. This integer is M, in order to achieve load balancing, if m> x may be made.

Table 3 sample division

GSS

410, 328, 262, 210, 168, 134, 108, 86,

69, 55, 44, 35, 28, 23, 18, 14, 12, 9, 7,

6, 5, 4, 3, 2, 2, 2, 1, 1, 1, 1 (n = 30)

GSS-80

923, 328, 144, 123, 121, 82, 66, 53, 42,

34, 27, 21, 17, 14, 11, 9, 7, 6, 4, 4, 3, 2,

2, 1, 1, 1, 1, 1 (n = 28)

FSS

205, 205, 205, 205, 205, 103, 103, 103,

103, 103, 51, 51, 51, 51, 51, 26, 26, 26,

26, 26, 13, 13, 13, 13, 13, 6, 6, 6, 6, 6, 6, 6,

3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 1, 1, 1 (n = 43)

FSS-80

923, 328, 144, 123, 121, 41, 41, 41, 41,

41, 21, 21, 21, 21, 21, 10, 10, 10, 10,

10, 5, 5, 5, 5, 5, 3, 3, 3, 3, 3, 1, 1, 1, 1,

1, 1, 1, 1, 1 (n = 39)

TSS

204, 194, 184, 174, 164, 154, 144, 134,

124, 114, 104, 94, 84, 74, 64, 38

(N = 16)

TSS-80

923, 328, 144, 123, 121, 40, 38, 36, 34,

32, 30, 28, 26, 24, 22, 20, 18, 16, 14,

12, 10, 8, 1 (n = 23)

4.3 Experimental results

Experiments include three situations: personal computers, cluster environments and grid environments. The first step, we run the MPI program on a personal computer or SMP system to evaluate system performance. In the second step, we connect 3 individual computers to form a chart (our testing includes the cluster 1 and the unit 2), then we run the same MPI program to evaluate system performance. In the third step, through the WAN, we connect the cluster 1 and the cluster 2 to form a grid computing environment. Finally, we run the same MPI program to evaluate system performance. At first, we did an experiment to compare the performance of the cluster calculation and mesh (see Figure 2 and Table 4). Next, we did a second experiment, this experiment is three self-scheduled grid computing environment, and this environment is based on our two-stage scheme (see Figure 3 and Table 5). In this chart, N FSS represents the FSS self-schement based on our two-stage approach.

Table 4 1024 * 1024 Multiplication of matrix multiplication

α scheme

0%

60%

70%

75%

80%

90%

CLUSTER1

100.312

90.887

90.214

90.213

90.164

100.419

CLUSTER2

100.357

100.954

90.092

89.922

99.040

100.630

Grid

66.177

65.912

61.616

61.454

62.433

70.142

Table 5 Grid Environment 2048 * 2048 Implementation Time of Matrix Multiplication

α scheme

0%

60%

70%

75%

80%

90%

FSS

275.622

275.898

281.222

282.091

271.182

272.567

TSS

325.498

286.093

290.645

282.401

285.064

282.134

GSS

370.199

310.193

274.505

282.875

283.216

283.598

Figure 2 Execution time of 1024 * 1024 matrix multiplication under different platforms

Figure 3 Execution time of 2048 * 2048 matrix multiplication in grid computing environment

4. Conclusions and future work

In extreme heterogeneous environments, known self-scheduled schemes cannot achieve good load balancing. This article, we propose a solution for a loop in extreme heterogeneous environments and achieved good performance. According to the weighting value of the CPU clock performance, the remaining 20% ​​workload assigned to known self-tutorial methods. In the matrix multiplier example of 2018 * 2018, our program has achieved significant performance improvements than GSS. Therefore, our solution is suitable for all predictable loop applications.

From the results of the experiment, we found that grid computing technology can bring more computational performance than traditional PC clusters or SMP systems. In the future work, we will use the CPU idle time through the job scheduling policy expansion test bench. Enable the system to be fully used.

转载请注明原文地址:https://www.9cbs.com/read-13742.html

New Post(0)