Distributed heterogeneous systems emerged as a viable alternative to dedicated parallel computing (Keane, 2004). A distributed system is made up of a set of sites cooperating with each other for resource sharing. The information exchange medium among the sites is a communication network. When the processing power varies from one site to another, a distributed system seems to be heterogeneous in nature (Karatza and Hilzer, 2002). Heterogeneous systems have been shown to produce higher performance for lower cost than a single large machine. The factors of performance degradation during parallel execution are: the frequent communication among processes; the overhead incurred during communication; the synchronizations during computations; the infeasible scheduling decisions and the load imbalance among processors (Dhandayuthapani et al., 2005). In short we can say that, Load balancing and Scheduling are crucial factors for grid like distributed heterogeneous systems (Radulescu and van Gemund, 2000).
Scheduling is all about keeping processors busy by efficiently distributing the workload. Load balancing attempts to ensure that the workload on each host is within a balance criterion of the workload present on every other host in the system. Dynamic load balancing assumes no prior knowledge of the tasks at compile-time. Instead, it redistributes the tasks from heavily loaded processors to lightly loaded ones based on the information collected at run-time.
Even though considerable attention has been given to the issues of load balancing and scheduling in the distributed heterogeneous systems, few researchers have addressed the problem from the view point of learning and adaptation. It has been shown by the communities of Multi-Agents Systems (MAS) and distributed Artificial Intelligence (AI) that groups of autonomous learning agents can successfully solve the issues regarding different load balancing and resource allocation problems (Weiss and Schen, 1996; Stone and Veloso, 1997; Weiss, 1998; Kaya and Arslan, 2001). Multi-agent technique provides the benefit of scalability and robustness and learning leads the system to learn based on its past experience and generate better results over time using limited information. To solve these core issues like learning, planning and decision making Reinforcement Learning (RL) is the best approach and active area of AI.
Reinforcement learning: Reinforcement Learning (RL) is an active area of research in AI because of its widespread applicability in both accessible and inaccessible environments. The model of the reinforcement learning problem is based on the theory of Markov Decision Processes (MDP) (Stone and Veloso, 1997).
In RL, an agent learns by interacting with its environment and tries to maximize its long term return by performing actions and receiving rewards as shown in Fig. 1. This area of machine learning learns the behavior of dynamic environment through trial and error. The trial and error learning feature and the concept of reward makes the reinforcement learning distinct from other learning techniques.
Q-learning: The Q-learning is a recent form of Reinforcement Learning.
It works by maintaining an estimate of the Q-function and adjusting Q-values
based on actions taken and reward received (Kaelbling et al., 1996) (Sutton
and Barto, 1998). It is adaptive version of Reinforcement Learning and does
not need model of its environment. Q-learning gradually reinforces those actions
that contribute to positive rewards by increasing the associated Q-values. Q-value
can be calculated by Eq. 1.
Qt+1(s,a) denotes the state-action value of the next possible state at time t+1, r the immediate reinforcement and α is the learning rate of the agent. One expects to start with a high learning rate, which allows fast changes and lowers the learning rate as time progresses. Action a must be chosen which maximizes, Q(s,a). When in each state the best-rewarded action is chosen according to the stored Q-values, this is known as greedy-method. γ is discount factor. The closer γ is to 1 the greater the weight is given to future reinforcements.
Motivation behind using this technique is that, Q-Learning does converge to the optimal Q-function (Even-Dar and Monsour, 2003). It uses the observed information to approximate the optimal function, from which one can construct the optimal policy.
Related work: Extensive research has been done in developing scheduling algorithms for load balancing of parallel and distributed systems. These algorithms are broadly classified as non-adaptive and adaptive algorithms.
Guided Self Scheduling (GSS) (Polychronopoulos and Kuck, 1987) and factoring (FAC) (Hummel et al., 1993) are examples of non-adaptive scheduling algorithms. GSS addresses the problem of uneven starting time of the processor and is applicable to constant length and variable length iterates executions (Polychronopoulos and Kuck, 1987). In FAC, iterates are scheduled in batches, where the size of a batch is a fixed ratio of the unscheduled iterates and the batch is divided into P chunks (Hummel et al., 1993).
Banicescu et al. (2000) proposed Adaptive Weighted Factoring (AWF) algorithm which was applicable to time stepping applications, it uses equal processor weights in the initial computation and adapts the weight after every time step. Adaptive Factoring (AF) (Banicescu and Liu, 2000) dynamically estimated the mean and standard deviation of the iterate execution times during runtime.
The scheduling problem is known to be NP-complete. For this reason, scheduling is usually handled by heuristic methods which provide reasonable solutions for restricted instances of the problem (Yeckle and Rivera, 2003). Most research on scheduling has dealt with the problem when the tasks, inter-processor communication costs and precedence relations are fully known. Now we will converge specifically towards multi-agent RL techniques. Majercik and Littman (1997) evaluated, how the load balancing problem can be formulated as a Markov Decision Process (MDP) and described some preliminary attempts to solve this MDP using guided on-line Q-learning and a linear value function approximator tested over small range of value runs. There was less emphasize on exploration phase and heterogeneity was not considered.
Zomaya et al. (1998) proposed five Reinforcement Based Schedulers (RBSs) which were: 1) Random RBS 2) Queue Balancing RBS 3) Queue Minimizing RBS 4) Load Based RBS 5) Throughput based RBS. The load-based and throughput-based RBSs were not effective in performing dynamic scheduling. The random scheduler and the queue-balancing RBS proved to be capable of providing good results in all situations. The queue balancing RBS had the advantage of being able to schedule for a longer period before any queue overflow took place. Random scheduler was Capable of extremely efficient dynamic scheduling when the processors are relatively fast. Under more difficult conditions, its performance is significantly and disproportionately reduced.
Parent et al. (2002) implemented a reinforcement learner for distributed load balancing of data intensive applications in heterogeneous environment. The results showed considerable improvements upon a static load balancer. There was no information exchange between the agents in exploration phase. Later Parent et al. (2004) improved the application as a framework of multi-agent reinforcement learning for solving communication overhead. This algorithm was receiver initiated and works locally on the slaves.
Galstyan et al. (2004) proposed, Minimalist decentralized algorithm for resource allocation in a simplified Grid-like environment. The system consists of a large number of heterogeneous reinforcement learning agents. This technique neglected the need for co-allocation of different resources. Present work is the enhancement of this technique. Verbeeck et al. (2005) described how multi-agent reinforcement learning algorithms can practically be applied to common interest problem and conflicting interest problem. They proposed a new algorithm called Exploring Selfish Reinforcement Learning (ESRL) based on 2 phases, exploration and synchronization phase.
Problem description: The aim of this research is to solve scheduling
and load balancing problem and extension of Galstyan et al. (2004) work
by handling co-allocation. There are some other challenges and Issues which
are considered by this research.
||Some existing scheduling middle-wares are not efficient as they assume
knowledge of all the jobs in a heterogeneous environment.
||Process redistribution cost and reassignment time is high in case of non-adaptive
|| Complex nature of the application causes unrealistic assumptions about
node heterogeneity and workload.
||Allocating a large number of independent tasks to a heterogeneous computing
platform is still a hindrance. This is due to the different speeds of computation
and communication of resources.
|| A further challenge to load balancing lies in the lack of accurate resource
status information at the global scale. Redistribution of tasks from heavily
loaded processors to lightly loaded ones in dynamic load balancing needs
quick information collection at run-time in order to use it for rectification
of scheduling technique.
The goal of this study is to apply Multi-Agent Reinforcement Learning technique
to the problem of scheduling and Load Balancing in the grid like environment
and dynamically distribute the workload over all available resources in order
to get maximum throughput.
MATERIALS AND METHODS
The key features of our proposed solution are: Support for a wide range of parallel applications; use of advance Q-Learning techniques on architectural design and development; multiple reward calculation; and QL-analysis, learning and prediction*. The architecture diagram of our proposed system is shown in Fig. 2.
||Architecture diagram of proposed system|
Modules description: The Resource Collector directly communicates to
the Linux kernel in order to gather the resource information in the grid. The
Resource Analyzer displays the load statistics. Tasks that are submitted from
outside the boundary will be buffered by the Task Collector. The Task Manager
handles user requests for task execution and communication with the grid.
The Log Generator saves the collected information of each grid node and executed tasks information. The Performance Monitor monitors the resource and task information and signals for load imbalance and task completion to the Q-Learning Load Balancer in the form of RL (Reinforcement learning) Signal (described after sub-module description). It is also responsible for backup in case of system failure.
Before scheduling the tasks, the QL Scheduler and Load balancer dynamically gets a list of available resources from the global directory entity. In Q-Learning, the states and the possible actions in a given state are discrete and finite in number. A detailed view of QL Scheduler and Load balancer is shown in Fig. 3.
Sub-module description of QL scheduler and load balancer:
||QL Analyzer receives the list of executable tasks from Task Manager and
list of available resources from Resource Collector. It analyzes the submission
time and size of input task and forwards this information to State Action
||The State Action Pair Selector searches the nearest matched states of
current input and gets its action set At for each agent from
Log History. This information provides efficiency and performance of that
resource in the past. (Nearest matched states will be searched on the basis
of submission time and size of task)
||Reward Calculator calculates reward by considering five vectors as reward
parameters using Eq. 2. These parameters are: CPU Speed,
Load, Memory, Status and Task Execution Time. The reward will be calculated
for each resource on the basis of information collected by State Action
View of Q-Learning Scheduler and Load Balancer|
Where Tw is the task wait time and Tx is the task execution time. a, b, c,
d, e are constants determining the weight of each contribution from history
||Q-Table Generator generates Q-Table and Reward-Table and places reward
information in Reward-Table.
||The Q-Value Calculator follows the Q-Learning algorithm to calculate Q-value
for each node and update these Q-Values in Q-Table. γ value is zero
and epsilon greedy policy is used in our proposed approach. Algorithm is
||Initialize Q (s,a) arbitrarily
||Repeat (for n episodes) Where 0.3<α>0.1
||Repeat for each step of episode (Learning)
||Take action a, observe reward r, move next state s'
||Qi,t+17 Qi,t + α (r-Qi,t)I = 1-n
||s = s'
||Until Learning is done
||QL History Generator stores state action pairs (st, at)
along with Q-values (calculated by Q-Value Calculator) as episodic memory
into QL History.
||Task Mapping Engine, Co-allocation is done by the Task Mapping Engine;
it calculates the average distribution of tasks and distributes them on
selected resources. Average distribution of tasks for Resource R1
(Avg. Dist. Ri) will be calculated by dividing Q-Value of that particular
Resource with Cumulative Q-Value (CQV) using Eq. 7.
CQV is calculated by Eq. 8.
Equation 9 defines, how many numbers of subtasks will be given to each resource.
β is a constant for determining number of sub jobs calculated by averaging
over all submitted sub jobs from history.
||Task Analyzer shows the distribution and run time performance of tasks
on grid resources. Out put will be displayed after successful execution.
||Finally, the Log Generator generates log of successfully executed tasks.
Reinforcement learning signals:
Task completion signal: After successful execution of task, Performance Monitor signals the Reward Calculator (sub-module of QL Scheduler and Load balancer) in the form of task completion time. After receiving RL signal Reward Calculator calculates reward and update Q-value in Q-Table.
Load imbalance signal: Performance Monitor keeps track of maximum load on each resource in the form of Threshold value. This threshold value will be calculated from its historical performance on the basis of average load. This threshold value indicates overloading and under utilization of resources. On finding load imbalance, Performance Monitor signals QL Load Balancer to start its working and remapping the subtasks on under utilized resources.
RESULTS AND DISCUSSION
The experiments were conducted on a Linux operating system kernel patched with OpenMosix as a fundamental base for resource collector. For comparison purpose we are using Guided Self Scheduling (GSS) and Factoring (FAC) as non-adaptive algorithms and Adaptive Factoring (AF) and Adaptive Weighted Factoring (AWF) as adaptive algorithms.
The experiments to verify and validate the proposed algorithm are divided into two categories. The first category of e experiments is based on learning with varying effect of load and resources. The second level of experiments describes the load and resource effect on Q-Scheduling and Other Scheduling (Adaptive and Non-Adaptive). Experiments were conducted for a different number of processors, episodes and task input sizes. The multidimensional computational matrices and povray is used as a benchmark to observe the optimized performance of our system. The cost is used as a performance metric to assess the performance of our Q-Learning based grid application. Cost is calculated by multiplying number of processors P with parallel execution time Tp.
Starting with the first category, Table 1-2
and Fig. 4 show the execution time comparison of different
number of episodes and processors. We can see from tables that execution time
is decreasing when the number of episodes increasing. This allows the system
to learn better from more experiences. The results obtained from these comparisons
highlight the achievement of the goal of this research work, that of attaining
performance improvements by increasing Learning.
For second category of experiments Fig. 5-7
show the cost comparison for 500, 5000 and 10000 episodes respectively. It can
be seen from these graphs that the proposed approach performs better than the
non-adaptive techniques such as GSS and FAC and even against the advanced adaptive
techniques such as AF and AWF. Consistent cost improvement can be observed for
increasing number of processors.
time for 10000 episodes vs. 6000 episodes with 30 input task and increasing
number of processors
time for 5000 episodes vs. 200 episodes with 60 input task and increasing
number of processors|
time for 8000 episodes vs. 4000 episodes with 30 input task and increasing
number of processors|
comparison of QL Scheduling vs. Other Scheduling with increasing number
of processors for 500 Episodes|
For Q-learning, there is a significant drop
in the cost when processors are increased from 2-8. However, Tp does not significantly change as processors are further increased
from 12-32. This validates the hypothesis that the proposed approach provides
better optimal scheduling solutions when compared with other adaptive and non-adaptive
Present proposed technique also handles load distribution overhead which is the major cause of performance degradation in traditional dynamic schedulers.
comparison of QL Scheduling vs. Other Scheduling with increasing number
of processors for 5000 Episodes|
comparison of QL Scheduling vs. Other Scheduling with increasing number
of processors for 10000 Episodes|
comparison of Q Scheduling vs. Other Scheduling with increasing number
of tasks for 500 Episodes and 8 processors|
Figure 8 shows the cost comparison with increasing number of tasks for 8 processors and 500 episodes. Again this graph shows the better performance of QL scheduler with other scheduling techniques. Results of Fig. 8 highlight the achievement of attaining maximum throughput using Q-Learning while increasing number of tasks. By outperforming the Other Scheduling, the QL-Scheduling achieves the design goal of dynamic scheduling, cost minimization and efficient utilization of resources.
CONCLUSION AND FUTURE WORK
The aspiration of this research was fundamentally a challenge to machine learning. Computer systems can optimize their own performance by learning from experience without human assistance. To repeatedly adjust in response to a dynamic environment, they will need the adaptability that only machine learning can offer. In this regard, the use of Reinforcement Learning is more precise and potentially computationally cheaper than other approaches. This research has shown the performance of QL Scheduler and Load Balancer on distributed heterogeneous systems. Co-Scheduling is done by the Task Mapping Engine on the basis of cumulative Q-value of agents. Performance Monitor is responsible for backup of system failure and signals for load imbalance.
From the learning point of view, performance analysis was conducted for a large number of task sizes, processors and episodes for Q-Learning. The optimality and scalability of QL-Scheduling was analyzed by testing it against adaptive and non-adaptive Scheduling for a varying number of tasks and processors. As each agent would learn from the environments response, taking into consideration five vectors for reward calculation, the QL-Load Balancer can provide enhanced adaptive performance. Ultimately, the outcome indicates an appreciable and substantial improvement in performance on an application built using this approach.
In future we will enhance this technique using SARSA algorithm, another recent form of Reinforcement Learning. We will try to merge our methodology with Verbeeck et al. (2005) proposed algorithm.