Asian Science Citation Index is committed to provide an authoritative, trusted and significant information by the coverage of the most important and influential journals to meet the needs of the global scientific community.  
ASCI Database
308-Lasani Town,
Sargodha Road,
Faisalabad, Pakistan
Fax: +92-41-8815544
Contact Via Web
Suggest a Journal
Journal of Computer Science
Year: 2009  |  Volume: 5  |  Issue: 10  |  Page No.: 760 - 763

Load Allocation Model for Scheduling Divisible Data Grid Applications

Monir Abdullah, Mohamed Othman, Hamidah Ibrahim and Shamala Subramaniam    

Abstract: Problem statement: In many data grid applications, data can be decomposed into multiple independent sub-datasets and distributed for parallel execution and analysis. Approach: This property had been successfully employed by using Divisible Load Theory (DLT), which had been proved as a powerful tool for modeling divisible load problems in data-intensive grid. Results: There were some scheduling models had been studied but no optimal solution has been reached due to the heterogeneity of the grids. This study proposed a new optimal load allocation based on DLT model recursive numerical closed form solutions are derived to find the optimal workload assigned to the processing nodes. Conclusion/Recommendations: Experimental results showed that the proposed model obtained better solution than other models (almost optimal) in terms of Makespan.

Figure 1 shows how the logical input Data (D) is decomposed onto networks and their computing resources.

The scheduling problem is to decompose D into datasets (Di for all i = 1, 2,...,M) across N virtual sites in a Virtual Organization (VO) given its initial physical decomposition. We assume that the decomposed data can be analyzed on any site.

For the notations, definitions that used in this research are stated in Table 1.


Fig. 1:

Data decomposition and their processing

The execution time of a subtask allocated to the site i (Ti) and the turn around time of a job J (Tturn_around_time) can be expressed as follows:

The cost (Ti) includes input data transfer (Tinput_cm(i)), computation (Tcp(i)) and output data transfer to the client at the destination site d (Toutput_cm(i,d)):

We assume that data from multiple data sources can be transferred to a site i concurrently in the wide area environment and computation starts only after the assigned data set is totally transferred to the site. Hence, the problem of scheduling a divisible job onto n sites can be stated as deciding the portion of original workload (D) to be allocated to each site, that is, finding a distribution of distribution of which minimizes the turn-around time of a job. The proposed SA approach uses this cost model when evaluating solutions at each generation.

In all the literature related to the divisible load scheduling domain so far, an optimality criterion[11] is used to derive an optimal solution is as follows. It states that in order to obtain an optimal processing time, it is necessary and sufficient that all the sites that participate in the computation must stop at the same time. Otherwise, load could be redistributed to improve the processing time. The timing diagram for this distributed system in optimal case is depicted in Fig. 2. In this timing diagram, communication time appears above the axis and computation time appears below the axis.


Table 1:

Terminology, definitions and notations

Fig. 2:

Optimal case

MATERIALS AND METHODS

The load scheduling problem is to decompose $D$ into datasets (Di for all i = 1, 2,,M) across M virtual sites in a Virtual Organization (VO) given its initial physical decomposition. This model includes two steps:

New optimal closed form formulas for scheduling divisible load on large scale data grid system are proposed. Closed-form expressions for the processing time and the fraction of workload for each processing node are derived. Since, the closed form did not consider the communication time[5]. The new model considers computation time as well as communication time.

A new closed form solution is proposed for obtaining the optimal fraction (αi) as follows:

α1.W11.Z1 = T1

Where:

W1 = The node speed
Z1 = Link speed between the source and the node 1.

And similarly:

α2.W22.Z2 = T2


(1)

(2)

In this case:

(3)

(4)

(5)

(6)

From Eq. 5 and 6 we obtain:

(7)

(8)

Thus, the closed-form expression of processing time (Makespan) is given as:

Moreover, we can add the application type (ccRatio) to the equation as:

After we get T, we can get αi by Eq. 7. By calculating the αi, the optimal time will be calculated.

RESULTS AND DISCUSSION

To measure the performance of the proposed model against CDLT, ADLT and A2DLT models, randomly generated experimental configurations were used. The estimated expected execution time for processing a unit dataset on each site, the network bandwidth between sites, input data size and the ratio of output data size to input data size were randomly generated with uniform probability over some predefined ranges. The network bandwidth between sites is uniformly distributed between 1 Mbyte sec and 10 Mbps.

The location of m data sources DSk is randomly selected and each physical dataset size Lk is randomly selected with a uniform distribution in the range of 1 GB - 1 TB. It is assumed that the computing time spent in a site i to process a unit dataset of size 1 MB is uniformly distributed in the range 1-10/rcb seconds, where rcb is the ratio of computation speed to communication speed.


Fig. 3:

Makespan Vs ccRatio for CDLT, ADLT, A2DLT and IDLT (M = 100)

We examined the overall performance of each model by running them under 100 randomly generated Grid configurations. These parameters are varied: ccRatio (0.001-1000), M (20-100), N (20-100), rcb (10-500) and data file size (1 GB-1 TB). To show how these models perform on different type of application (different ccRatio), we created graphs in Fig. 3.

In Fig. 3, the Makespan for the CDLT, ADLT, A2DLT and the proposed models is plotted against application type (ccRatio). The value of ccRatio is fixed at 1000 and the value of number of nodes M is fixed to be 100. It can be shown from the Fig. 3 that the proposed model is the best for any type of application, as expected, because the proposed model produce the almost optimal solution for scheduling load that is produced from single source.

CONCLUSION

In this study, we have developed an effective Iterative model for optimal workload allocation. The proposed model is proposed for load allocation to processors and links for scheduling divisible data grid applications. The experimental results showed that the proposed model is capable of producing almost optimal solution for single source scheduling. Hence, the proposed model can balance the processing loads efficiently. We are planning to adapt the proposed model to be implemented in multiple sources. With such improvements, the proposed model can be integrated in the existing data grid schedulers in order to improve their performance.

" target="_blank">View Fulltext    |   Related Articles   |   Back
 
 
   
 
 
 
  Related Articles

No Article Found
 
 
 
Copyright   |   Desclaimer   |    Privacy Policy   |   Browsers   |   Accessibility