Figure 1 shows how the logical input Data (D) is decomposed onto networks and their computing resources. The scheduling problem is to decompose D into datasets (D_{i} for all i = 1, 2,...,M) across N virtual sites in a Virtual Organization (VO) given its initial physical decomposition. We assume that the decomposed data can be analyzed on any site. For the notations, definitions that used in this research are stated in Table 1.
 Fig. 1:  Data
decomposition and their processing 
The execution time of a subtask allocated to the site i (T_{i}) and the turn around time of a job J (Tturn_around_time) can be expressed as follows:
The cost (T_{i}) includes input data transfer (T_{input_cm}(i)),
computation (T_{cp}(i)) and output data transfer to the client at the
destination site d (T_{output_cm}(i,d)):
We assume that data from multiple data sources can be transferred to a site
i concurrently in the wide area environment and computation starts only after
the assigned data set is totally transferred to the site. Hence, the problem
of scheduling a divisible job onto n sites can be stated as deciding the portion
of original workload (D) to be allocated to each site, that is, finding a distribution
of distribution of which
minimizes the turnaround time of a job. The proposed SA approach uses this
cost model when evaluating solutions at each generation.
In all the literature related to the divisible load scheduling domain so far, an optimality criterion^{[11]} is used to derive an optimal solution is as follows. It states that in order to obtain an optimal processing time, it is necessary and sufficient that all the sites that participate in the computation must stop at the same time. Otherwise, load could be redistributed to improve the processing time. The timing diagram for this distributed system in optimal case is depicted in Fig. 2. In this timing diagram, communication time appears above the axis and computation time appears below the axis.
Table 1: 
Terminology,
definitions and notations 

MATERIALS AND METHODS
The load scheduling problem is to decompose $D$ into datasets (D_{i} for all i = 1, 2,…,M) across M virtual sites in a Virtual Organization (VO) given its initial physical decomposition. This model includes two steps: New optimal closed form formulas for scheduling divisible load on large scale data grid system are proposed. Closedform expressions for the processing time and the fraction of workload for each processing node are derived. Since, the closed form did not consider the communication time^{[5]}. The new model considers computation time as well as communication time.
A new closed form solution is proposed for obtaining the optimal fraction (α_{i})
as follows:
α_{1}.W_{1}+α_{1}.Z_{1 }=
T_{1}

Where:
W_{1} 
= 
The node speed 
Z_{1} 
= 
Link speed between the source and the node 1. 
And similarly:
α_{2}.W_{2}+α_{2}.Z_{2 }=
T_{2}

In this case:
From Eq. 5 and 6 we obtain:
Thus, the closedform expression of processing time (Makespan) is given as:
Moreover, we can add the application type (ccRatio) to the equation as:
After we get T, we can get α_{i} by Eq. 7. By
calculating the α_{i}, the optimal time will be calculated.
RESULTS AND DISCUSSION To measure the performance of the proposed model against CDLT, ADLT and A^{2}DLT models, randomly generated experimental configurations were used. The estimated expected execution time for processing a unit dataset on each site, the network bandwidth between sites, input data size and the ratio of output data size to input data size were randomly generated with uniform probability over some predefined ranges. The network bandwidth between sites is uniformly distributed between 1 Mbyte sec and 10 Mbps. The location of m data sources DS_{k} is randomly selected and each physical dataset size L_{k} is randomly selected with a uniform distribution in the range of 1 GB  1 TB. It is assumed that the computing time spent in a site i to process a unit dataset of size 1 MB is uniformly distributed in the range 110/r_{cb} seconds, where r_{cb} is the ratio of computation speed to communication speed.
 Fig. 3:  Makespan
Vs ccRatio for CDLT, ADLT, A^{2}DLT and IDLT (M = 100) 
We examined the overall performance of each model by running them under 100 randomly generated Grid configurations. These parameters are varied: ccRatio (0.0011000), M (20100), N (20100), r_{cb} (10500) and data file size (1 GB1 TB). To show how these models perform on different type of application (different ccRatio), we created graphs in Fig. 3. In Fig. 3, the Makespan for the CDLT, ADLT, A^{2}DLT and the proposed models is plotted against application type (ccRatio). The value of ccRatio is fixed at 1000 and the value of number of nodes M is fixed to be 100. It can be shown from the Fig. 3 that the proposed model is the best for any type of application, as expected, because the proposed model produce the almost optimal solution for scheduling load that is produced from single source. CONCLUSION In this study, we have developed an effective Iterative model for optimal workload allocation. The proposed model is proposed for load allocation to processors and links for scheduling divisible data grid applications. The experimental results showed that the proposed model is capable of producing almost optimal solution for single source scheduling. Hence, the proposed model can balance the processing loads efficiently. We are planning to adapt the proposed model to be implemented in multiple sources. With such improvements, the proposed model can be integrated in the existing data grid schedulers in order to improve their performance. " target="_blank">View Fulltext
 Related
Articles  Back
