It is assumed that the CPU loads of both the primary processes and the
backup processes running in the system are known in advance. The assumption
is valid in the computing environment where the occurrence, load and duration
can be predicted. Examples of such systems include on-line transaction
processing and real-time systems, in which most of the processes are running
continuously or repeating in a periodic manner (Kim et al., 1997;
Mullender, 1993).
Furthermore, it is also assumed in this model that every backup process
is assigned with an id and when a fault occurs, the role of the primary
process Pi is taken over by the one with the smallest id until
no more backup process for Pi is available (i.e., the backup
process startup sequence for Pi is Bi-1, Bi-2,...,
Bi-r). This can be achieved by using some sophisticated leader
election algorithms such as bully algorithm or ring-based election algorithm
(Garcia-Molina, 1982; Singh and Kurose, 1994; Stoller, 2000).
The primary-backup process model considered in this study has been used
heavily in the high available distributed systems such as Grid Service,
Network File Server, Application Server Clusters and fault tolerant computer
systems such as the Tandem Nonstop System and Delta System (Powell, 1994;
Zhang et al., 2004; Liu et al., 2005)
Notation: The following notation is used to formulate the allocation
problem:
 |
| Fig. 1: |
Primary-backup based fault tolerant process |
| N |
= No. of all the nodes |
| M |
= No. of primary processes |
| aj |
= 0 if node j has failed, 1 otherwise. |
| V |
= The set of all the possible node status vectors in the form [a1a2a3...an] |
| V |
= {[a1a2a3...an]} |
| R |
= No. of the backup processes for each primary process 1≤R
|
| F |
= No. of current faulty nodes. 0≤F≤R. Thus,  |
| Pi |
= Load of primary process i |
| Bi |
= Load of the backup processes of primary process i (Assume every
backup process belonging to one primary-backup process group has equal
load) |
| Xij |
= 1 if primary process i is allocated to node j, 0 otherwise |
 |
= 1 if rth backup replica of primary process i is allocated to node
j, 0 otherwise |
 |
= 1 if rth backup replica of process i is available, 0 otherwise.
Thus,  |
| τj |
= Load increment to be added to node j when the fault occurs |
| μj |
= The node set that process j can be allocated on |
| P(j) |
=Total load of node j |
| φ(Vi) |
=Difference between the maximum and the minimum load when given
a node status vector Vi (Vi∈V) |
Formal problem description: In this part of research, we formally
describe the load balancing allocation for replicated R-Resilient fault
tolerant processes.
The load-balancing process allocation problem is represented as a constrained
optimization problem, which aims to minimize an objective function subject
to constraints on the possible values of the independent variable. The
constraints and objective function are described below.
Let us assume that there are N nodes and M primary processes in the system.
Because each primary process has R backup processes, the total number
of primary and backup processes is (R+1)M. Primary processes can be allocated
to any node. However the primary and backup processes that belong to the
same primary-backup group cannot be allocated to the same node. Thus,
The load increment to be added to node j in the event of a fault in node
k can be represented as
where, f(i) represents the id of the elected backup process replica to
take over the role of the primary process i (i.e., f(i) equals 2 if the
backup replica with id 1 is unavailable and the backup replica with id
2 is available). Thus,
Hence, the total load increment for node j is:
The load of node j, denoted as P(j), is the sum of the load before the
fault occurrence and the load increment τj to be incurred
upon the occurrence of a fault.
There are two metrics for evaluating load balance between nodes. One
is the standard deviation of the processor loads used in Mullender (1993).
Another one is the load difference between the node with the heaviest
load and the node with the lightest load (Kim et al., 1997). We
use the second one in order to keep the metric same as the two-stage allocation
algorithm.
The objective function of load balance evaluation is formally described
as below.
For a given node status vector Vi (Vi = [a1a2a3...an](ViεV)),
the load difference denoted as φ(Vi), can be represented
as follows:
The node status vector Vi is divided into Vnon
and Vfaulty according to whether a faulty node exits. Moreover,
Vfaulty is divided into Vf 1,Vf 1, …
et al., according to the number of faulty nodes exit.
The objective function φ is defined as:
Wi (0≤i≤size of (V)) denotes the relative weights of
importance before and after faults occurrence, respectively. The value
for each Wi depends on the possibility of the pertinent node
status occurrence in reality. However, to give an obvious comparison of
the load distribution performance for the process allocation algorithms,
we assume that the weights have the same value, i.e., W0 =
W1 = W2 = ... = Wi = 1. Thus, the objective
function becomes:
Hence, the load balancing process allocation is the problem of finding
values of Xij and
for all possible i (1≤i≤M), j(1≤j≤N), k(1≤k≤R)
that minimize the objective functions φ in our fault tolerant system
model with the constraint given in Eq. 1.
HEURISTIC PROCESS ALLOCATION ALGORITHMS
In this part of study, two-stage allocation algorithm is firstly
introduced with an example and then R-stage allocation algorithm is presented
using the same example. The result shows that RSA improve the load distribution
performance than two-stage allocation algorithm when primary processes
are replicated more than once. Then the time complexity of the RSA is
analyzed.
Two-stage allocation: Here we present a simple example. Assuming
there are 20 primary processes (M = 20) running on 4 nodes (N = 4). Each
primary process has 2 backup processes (R = 2). The processes loads are
shown in Table 1.
The two-stage allocation algorithm works as follows.
In the first stage, a greedy method is used to allocate the primary process
with the highest load to the node with the lowest load. Table
2 shows the allocation results after the first stage.
| Table 1: |
Process and their loads |
 |
| Table 2: |
Allocated primary process |
 |
| Table 3: |
Grouping backup processes on node 1 |
 |
In the second stage, for each node, the backup processes are firstly
sorted in descending order using the load difference between the primary
process and the backup. Secondly, the backup processes of the primary
processes on each node are divided into (N-1) groups having approximately
equal incremental load by assigning each r backup processes to the r group
with the smallest load, in the order of the sorted list in the previous
step. This load difference is the amount of load increment to be incurred
upon the occurrence of a fault. The total number of backup groups generated
in this step is N(N-1). Thirdly, the algorithm computes the actual load
of each group using the actual load of the backup processes. Greedy method
is used again to allocate the group with the highest actual load to the
node with the lowest load. However, when allocating each backup group,
the algorithm checks whether there is a pre-allocated backup group that
comes from the same node as the to-be allocated backup group. In such
a case, the node with the next-to-the-minimum load is selected (Lee et
al., 1999).
Table 3 shows the grouping results on node 1 in the
example and Table 4 shows the final allocation results
computed by the two-stage allocation algorithm.
In the two-stage allocation algorithm, the purpose of dividing the backup
processes into (N-1) groups for each node is to guarantee that each node
has an approximately equal amount of load increment. Hence, the system
will have a balanced load when a fault occurs. The purpose of computing
the actual load of each group and assigning groups based on their actual
loads is to guarantee that each processor`s load is balanced before the
occurrence of a fault. Therefore, the algorithm can balance the processor
load before as well as after the occurrence of a fault (Kim et al.,
1997).
| Table 4: |
Final results of two-stage allocation |
 |
However, for a more prevalent situation when each primary process is
duplicated more than once, simply applying the two-stage allocation algorithm
would reveal two notable defects:
| • |
Allocating all the backup processes in one stage decreases
the load-balancing performance since the point that backup processes
start at different time according to their assigned id is ignored.
In other words, comparing to Bi-2, Bi-1 has
higher election priority but is now treated equally. In the previous
example as shown in Table 3 and 4,
when Node 1 is failed, those preferential backup processes (B1-1,
B6-1, B15-1, B16-1, B20-1)
are only started on Node 2 and 4 while no load is added to Node 3.
It is more reasonable to allocate Bi-1s and Bi-2s
in different stages since for any fixed i, Bi-1 and Bi-2
cannot start with each other at the same time although they are both
the backup replicas for P2. |
|
• |
Grouping backup processes only based on each node cannot well balance
the load when multiple faults happen. For example, as shown in Table
4, P19, B20-1, B16-1 are allocated on Node 2
while P20, P16 and B19-1 are allocated on Node 1. Consider
when Node 1 and Node 2 both fail, P20, P16, P19 would lose their primary
and first-preference backup processes which lead to require start
up their second-preference backup processes. Thus, B20-2,
B16-2, B19-2 need to be considered grouping
together although their primary processes and first-preference backup
processes are not allocated on the same node. |
Therefore, the two-stage allocation algorithm lacks scalability due to
it pays little consideration to the backup replicas startup sequence and
only groups the backup processes on each node. Its load-balancing performance
decreases when the number of backup processes for each primary process
(R) increases, as its grouping algorithm cannot guarantee that each node
has an approximately equal amount of load increment when faults occur.
A new allocation algorithm named R-stage allocation that solves the above
problems is presented as follows.
R-stage allocation: To solve the defects of two-stage allocation
algorithm described earlier, the new allocation algorithm has to consider
two issues. First, since Bi-1 and Bi-2 need to be
allocated in different stages separately, which one should be allocated
prior than the other? Second, to group the backup processes on all the
nodes rather than on each node, what grouping rule should the new allocation
algorithm to use? The following lemmas which form the core of the R-stage
allocation algorithm address these questions one by one.
LEMMA 1: If 1≤xallocate(allocate(nodes,
Bi-x),Bi-y)) ≤φ (allocate (allocate
(nodes, Bi-y), Bi-x).
The function allocate(nodes, Bi-x) represents allocating
Bi-x to the proper nodes for all i such that 1≤i≤M.
Lemma 1 implies that the backup processes with smaller id should be allocated
in the stage prior to those with bigger id to minimize φ (In present
model, the backup processes with smaller id have higher priority to take
over the role of the primary processes).
LEMMA 2: If size of (μi∩μj)≥size
of(μi∩μk), then φ(group(group(i,
j), k))≤φ(group(group(i, j), j).
The function group (i, j)represents dividing processes i and j into groups
having approximately equal incremental load.
Lemma 2 implies that one process should be considered grouping with the
processes having the maximum μ intersection with it to minimize φ.
This is because the bigger the μ intersection of processes is, the
more possible that these processes would start up at the same time. For
example, processes with μ intersection as {2, 3, 4} means they would
start up if one node (Node 1) fails (Assuming there are only 4 nodes).
Thus, they have more possibility to start up at the same time than the
processes with μ intersection as {1, 3} which means the processes
would start up if two nodes (Node 2 and 4) both fail.
According to the two lemmas described above, RSA firstly allocates primary
processes using the same way as the two-stage allocation algorithm. But
for the backup processes allocation, RSA needs R rounds to accomplish
(R is the number of the backup process replica). The main principal in
RSA is to allocate the backup process replica groups one by one in their
election sequence. The algorithm is formally described below.
R-stage algorithm
Stage-0: Allocate primary processes:
| • |
Sort primary processes in descending order of CPU load. |
| • |
Allocate each primary process to the node with the minimum load
from the highest load to the lowest. |
According to Lemma 1, the backup processes are allocated based on their
election sequence (Bi-1, Bi-2,..., Bi-r)
in different stages. For example, Stage-2 allocates Bi-2, Stage-R
allocates Bi-r. The stage-1 which allocates Bi-1
is presented below. The following stages after stage-1 use the similar
steps as it.
Stage-1 Allocate Bi-1: Firstly, backup groups are generated
by the steps below:
| • |
Compute the load difference between each primary process and its
corresponding Bi-1. |
| • |
Sort Bi-1 in descending order using the load difference.
|
According to Lemma 2, the backup processes with maximum μ intersection
are grouped first; a hashmap is constructed in the next step to facilitate
the grouping operations.
| • |
Construct a hashmap whose key is Bi-1 and value is a
list including intersections section of μ (Each intersection
has at least one element). |
| • |
Go to step (7) if the hashmap value
set is empty, otherwise, merge the same μ intersection for the
value of each key in the hashmap, the fake code is described below,
|
|
• |
Sort the μ intersections in the value set of the hashmap in
descending order of the μ intersection size. |
| • |
Divide Bi-1 into groups according to the μ intersections
in the order of the sorted list in the previous step. The fake code
of this step is described below: |
While (sorted list size is not empty)
{
Let σ be the first μ intersection in the sorted
list generated in the previous step
Let Partn be a temporary partition. Select out all the
Bi-1 appeared in σ into it.
Divide the backup processes in Partn into min (size of
σ, size of Partn) groups having approximately equal incremental load
by assigning each backup process with the maximum load difference to the
group with the smallest difference load.
Remove all the intersections in the sorted list if they include one of
the Bi-1 in Partn. (This can be quickly done by utilizing the
previous constructed hashmap).
}
| • |
After the previous steps, the left non-grouped Bi-1s
only have the empty intersection of μ with each other; hence,
each of them forms a group. |
Secondly, do the following for the backup groups that are generated by
the steps above.
| • |
Sort all the backup groups using the actual loads in descending
order. |
|
• |
Sort the N nodes using their current loads in ascending order. |
| • |
Allocate each backup group to the node with the minimum load. However,
if the node does not belong to the μ intersections of the backup
processes in this group or if one of the backup groups that has already
been allocated to this node comes from the same Partnas the backup
group to be allocated, choose the node with the next-to-the-minimum
load. |
The left backup processes are allocated using the same algorithm described
in the stage-1. Stage-2 allocates Bi-2, Stage-R allocates Bi-r.
The same example described previously in the two-stage algorithm is used
here. Since each primary process has two backup replicas, two stages are
used to allocate the backup process replicas.
After primary processes have been allocated, RSA allocates all the Bi-1
on the nodes according to their μ intersections. Table
5 shows the grouping results for the μ intersection {2, 3, 4}
during stage-1. Table 6 shows the allocation result in stage-1.
| Table 5: |
Grouping backup processes in stage one |
 |
| Table 6: |
Allocation results in stage one |
 |
| Table 7: |
Grouping backup processes in stage two |
 |
| Table 8: |
Final results of r-stage allocation |
 |
In stage-2, Bi-2 are grouped as shown in Table
7. After allocating each group to the nodes, RSA gets the final result
as shown in Table 8.
Let us compare the load distribution in Table 4 (two-stage
allocation results) with Table 8 (RSA results). When Node 1 fails, the
preferential backup processes (Bi-1, B6-1, B15-1
B16-1, B20-1) are only started on Node 2 and Node
4 in Table 4 while in Table 8, these
backup processes are started on the left available nodes; hence φ
is minimized in Table 8. Apparently, RSA balances the
load better after faults occurrence.
Algorithm complexity:
| • |
Primary process allocation: A sorting
algorithm whose running time is O(M lg M)is used to sort M primary
processes. Allocating M primary processes to N nodes requires O(M
lg N) time. |
| • |
Backup process allocation: Let us first consider the time complexity
for one stage. |
Sorting M backup processes takes O(M lg M) time.
It requires
time to select arbitrarily two backup processes and takes at most 2N
time to compute their μ intersections. Thus, the hashmap construction
requires O(M2N). Merging and sorting its value set in descending
order of the size of the intersection requires O(M2 lg M).
The worst case for step 5 and step 6 is that every two backup processes
forms a separate Partn, thus requires O(M.
There are at most M groups generated by the previous steps. Allocating
these groups to N nodes requires o(M lg N) time.
So the worst time complexity for the backup process allocation in one
stage is O(M lg M+M2N+M2lgM+M+M lg N).
The above steps described are repeated in R stages, which together require
O(R(M lg M+M2N+M2 lgM+M+M lg N).
Hence, plus the primary process allocation and R Stages of backup process
allocation, the total time complexity of this algorithm is
Assuming that the number of processes is much larger than the number
of nodes (M>N2), the execution time is bounded by O(M2R
lg M).
Comparing Eq. 8 with the time complexity of two stage allocation algorithm
O(NM lg NM+N3 lg N), we can see the time overload of RSA is
acceptable and it is due to using more allocation stages (R stage instead
of two stage), hashmap construction and its value set sorting.
PERFORMANCE COMPARISON
In this part of study, we compare the load distribution performance
of RSA with the two-stage algorithm using simulations.
The environment parameters used in the simulation are as follows. To
keep the total load of each node below 100%, the load of the primary processes
is chosen randomly in the range of 1 to 100(N-R)/M based on a uniform
distribution. The loads of the backup processes are also chosen randomly
between 5~10% of the load of their primaries, also based on a uniform
distribution.
Figure 2 shows the simulation results on how φ
in Eq. 10 is affected by R in two-stage allocation algorithm and RSA algorithm.
It is assumed that the number of nodes (N) is eight. The Y-axis represents
the value of φ while the X-axis represents R (the number of the backup
processes replica). In two-stage algorithm, φ increases dramatically
when R increases especially when R reaches N-1 due to the fact that the
algorithm does not group the backup replicas on each node well. RSA algorithm
minimizes φ hence it has much better scalability and load distribution
performance than two-stage allocation algorithm.
Assuming the number of process is 400 and the node number is eight, Fig.
3 shows the maximum and minimum load using the two-stage algorithm
and RSA after one node fails. The X-axis represents R while the Y-axis
represents the CPU load. The difference between maximum load and minimum
load in RSA is smaller than the two-stage algorithm especially when R
increases. Figure 4 shows the simulation results after
two nodes fail based on another generated random test data but with the
same parameters as Fig. 3.
 |
| Fig. 2: |
φ affected by R (No. of backup process) |
 |
| Fig. 3: |
Maximum and minimum load when one fault occurs |
 |
| Fig. 4: |
Maximum and minimum load when two faults occur |
CONCLUSION
In this study, we considered the static process allocation for load
distribution in the primary-backup based Fault-Tolerant system.
In the primary-backup fault tolerant model, only primary process receives
the invocations from the client process and sends the response back. Hence
the load of primary process is much bigger than its backup processes.
And when a fault occurs, one of the backup processes will be elected to
take over the role of the primary process. Therefore, the load of this
elected backup process varies before and after the occurrence of a fault.
Previous research shows it is an NP-hard problem to find out a static
process allocation algorithm to balance the system load before as well
as after faults occur.
The main contribution of this study is the presentation and analysis
of a new heuristic approximation static load-balancing algorithm for replicated
R-Resilient process in the primary-backup based Fault Tolerant System.
The proposed algorithm has better scalability and load distribution performance
than the previous allocation algorithms in this area. In this study, we assume
there is no difference between each node. We are currently working on
extending this algorithm to handle heterogeneous distributed systems.
Also, we plan to study the process allocation in the dynamic situation
in the future.