HOME JOURNALS CONTACT

Information Technology Journal

Year: 2008 | Volume: 7 | Issue: 4 | Page No.: 599-606
DOI: 10.3923/itj.2008.599.606
Replicated R-Resilient Process Allocation for Load Distribution in Fault Tolerant System
Jian Wang, Jianling Sun, xinyu Wang and Hang Chen

Abstract: Process allocation for load distribution can improve system performance by utilizing resources efficiently. For primary-backup based fault tolerant system, a classic load-balancing process allocation method (two-stage allocation algorithm) has been proposed that can balance the load before as well as after faults occurrence. But two-stage allocation algorithm has bad scalability since its load-balancing performance reduces dramatically when each primary process is duplicated more than once (i.e., has more than one backup process). In this study, we present an improved algorithm named RSA (R-Stage Allocation algorithm) that can have the load better balanced no matter how many backup processes each primary process owns; Simulations are also used to compare the proposed algorithm with the two-stage allocation algorithm and the experimental results show that when extending to replicated R-Resilient processes, RSA has significantly better load distribution performance than two-stage allocation algorithm.

Fulltext PDF Fulltext HTML

How to cite this article
Jian Wang, Jianling Sun, xinyu Wang and Hang Chen, 2008. Replicated R-Resilient Process Allocation for Load Distribution in Fault Tolerant System. Information Technology Journal, 7: 599-606.

Keywords: process allocation, replicated processes, Primary-backup, fault-tolerant and load-balancing

INTRODUCTION

Process replication is heavily used by the software based fault tolerant system. How to allocate the replicated processes for load distribution is an important research area in the study of fault tolerant system. Bannister and Trivedi (1983) firstly presented an algorithm which evenly distributes the load of the system to all nodes thus improves the system performance by utilizing the resources efficiently. An assumption of their research work is that all the primary processes and the replicas play the same role; the invocations of client process are received and processed by the non-faulty replicas in the same order. This fault tolerant approach is also called Active Replication or State-Machine Approach (Mullender, 1993).

In addition, there is another important fault tolerant approach called Primary-Backup (also known as Passive Replication), in which one of the replicas called the primary plays a special role (Mullender, 1993): it receives the invocations from the client process and sends the response back. The backup replicas interact with the primary and do not interact directly with the client process.

The allocation algorithm presented by Bannister and Trivedi cannot be applied to primary-backup based fault tolerant system because the active replication uses more resources than the primary-backup approach by having all the process replicas work with the same load and execute the same client invocation. In primary-backup approach, only primary process executes the client invocation and then sends the state update message to the backup process; backup process just needs to update their state hence the load is much less than the primary process, i.e., 5~10% of the load of the primary process.

Considering this difference, Kim et al. (1995, 1997) consequently proposed another static allocation algorithm which can balance the load before as well as after faults occurrence for the primary-backup based fault tolerant system. They firstly formalized and proved this kind of primary-backup process allocation is an NP-hard problem, then they gave a heuristic algorithm called two-stage allocation algorithm with an assumption that each primary process is duplicated only once (only has one backup process). After that, Guo Hui et al. (2005) extended the two-stage allocation algorithm to heterogeneous distributed system.

However, two-stage allocation algorithm pays less consideration to a more prevalent situation when each primary process is duplicated more than once. In this case when a fault occurs, one of the backup replicas is elected to take over the role of the primary process until no more backup process is available. Although the two-stage allocation algorithm was extended (Lee et al., 1999) for handling this case, the critical point that the backup replicas of the same primary process start up at different time is ignored.

In this study we present a new process allocation algorithm named R-Stage Allocation algorithm (RSA) that can have the load better balanced after faults occurrence than the two-stage allocation algorithm no matter how many backup replicas each primary process owns. RSA specifies the backup processes startup sequence by assigning an id to every replica and utilizing the election algorithm to select the one with the smallest id to take over the role of the primary process when fault occurs.

FAULT-TOLERANT PROCESS ALLOCATION

Fault-tolerant system model: In this study, we use the similar system model considered by Lee et al. (1999). As shown in Fig. 1, the fault-tolerant system in this model consists of N nodes. To tolerate the fault, each process is replicated R times (1≤R <N) and executed as a group, referred to as a primary-backup process group. By using R backup processes, the system can allow at most R faulty Nodes (Assuming the nodes only have fail-stop failures; the faulty nodes have all the processes running on it become unavailable). Primary processes can be allocated to any node. However, there is one restriction on the placement of backup processes. That is, the primary and backup processes that belong to one primary-backup group cannot be allocated to the same node.

It is assumed that the CPU loads of both the primary processes and the backup processes running in the system are known in advance. The assumption is valid in the computing environment where the occurrence, load and duration can be predicted. Examples of such systems include on-line transaction processing and real-time systems, in which most of the processes are running continuously or repeating in a periodic manner (Kim et al., 1997; Mullender, 1993).

Furthermore, it is also assumed in this model that every backup process is assigned with an id and when a fault occurs, the role of the primary process Pi is taken over by the one with the smallest id until no more backup process for Pi is available (i.e., the backup process startup sequence for Pi is Bi-1, Bi-2,…, Bi-r). This can be achieved by using some sophisticated leader election algorithms such as bully algorithm or ring-based election algorithm (Garcia-Molina, 1982; Singh and Kurose, 1994; Stoller, 2000).

The primary-backup process model considered in this study has been used heavily in the high available distributed systems such as Grid Service, Network File Server, Application Server Clusters and fault tolerant computer systems such as the Tandem Nonstop System and Delta System (Powell, 1994; Zhang et al., 2004; Liu et al., 2005)

Notation: The following notation is used to formulate the allocation problem:

Fig. 1: Primary-backup based fault tolerant process

N = No. of all the nodes
M = No. of primary processes
aj = 0 if node j has failed, 1 otherwise.
V = The set of all the possible node status vectors in the form [a1a2a3…an]
V = {[a1a2a3…an]}
R = No. of the backup processes for each primary process 1≤R<N
F = No. of current faulty nodes. 0≤F≤R. Thus,
Pi = Load of primary process i
Bi = Load of the backup processes of primary process i (Assume every backup process belonging to one primary-backup process group has equal load)
Xij = 1 if primary process i is allocated to node j, 0 otherwise
= 1 if rth backup replica of primary process i is allocated to node j, 0 otherwise
= 1 if rth backup replica of process i is available, 0 otherwise. Thus,
τj = Load increment to be added to node j when the fault occurs
μj = The node set that process j can be allocated on
P(j) = Total load of node j
Φ(Vi) = Difference between the maximum and the minimum load when given a node status vector Vi (Vi∈V)

Formal problem description: In this part of research, we formally describe the load balancing allocation for replicated R-Resilient fault tolerant processes.

The load-balancing process allocation problem is represented as a constrained optimization problem, which aims to minimize an objective function subject to constraints on the possible values of the independent variable. The constraints and objective function are described below.

Let us assume that there are N nodes and M primary processes in the system. Because each primary process has R backup processes, the total number of primary and backup processes is (R+1)M. Primary processes can be allocated to any node. However the primary and backup processes that belong to the same primary-backup group cannot be allocated to the same node. Thus,

(1)

The load increment to be added to node j in the event of a fault in node k can be represented as

where, f(i) represents the id of the elected backup process replica to take over the role of the primary process i (i.e., f(i) equals 2 if the backup replica with id 1 is unavailable and the backup replica with id 2 is available). Thus,

(2)

Hence, the total load increment for node j is:

(3)

The load of node j, denoted as P(j), is the sum of the load before the fault occurrence and the load increment τj to be incurred upon the occurrence of a fault.

(4)

There are two metrics for evaluating load balance between nodes. One is the standard deviation of the processor loads used in Mullender (1993). Another one is the load difference between the node with the heaviest load and the node with the lightest load (Kim et al., 1997). We use the second one in order to keep the metric same as the two-stage allocation algorithm.

The objective function of load balance evaluation is formally described as below.

For a given node status vector Vi (Vi = [a1a2a3…an](ViεV)), the load difference denoted as φ(Vi), can be represented as follows:

(5)

The node status vector Vi is divided into Vnon and Vfaulty according to whether a faulty node exits. Moreover, Vfaulty is divided into Vf 1,Vf 1, … et al., according to the number of faulty nodes exit.

The objective function φ is defined as:

(6)

Wi (0≤i≤size of (V)) denotes the relative weights of importance before and after faults occurrence, respectively. The value for each Wi depends on the possibility of the pertinent node status occurrence in reality. However, to give an obvious comparison of the load distribution performance for the process allocation algorithms, we assume that the weights have the same value, i.e., W0 = W1 = W2 = … = Wi = 1. Thus, the objective function becomes:

(7)

Hence, the load balancing process allocation is the problem of finding values of Xij and for all possible i (1≤i≤M), j(1≤j≤N), k(1≤k≤R) that minimize the objective functions φ in our fault tolerant system model with the constraint given in Eq. 1.

HEURISTIC PROCESS ALLOCATION ALGORITHMS

In this part of study, two-stage allocation algorithm is firstly introduced with an example and then R-stage allocation algorithm is presented using the same example. The result shows that RSA improve the load distribution performance than two-stage allocation algorithm when primary processes are replicated more than once. Then the time complexity of the RSA is analyzed.

Two-stage allocation: Here we present a simple example. Assuming there are 20 primary processes (M = 20) running on 4 nodes (N = 4). Each primary process has 2 backup processes (R = 2). The processes loads are shown in Table 1.

The two-stage allocation algorithm works as follows.

In the first stage, a greedy method is used to allocate the primary process with the highest load to the node with the lowest load. Table 2 shows the allocation results after the first stage.

Table 1: Process and their loads

Table 2: Allocated primary process

Table 3: Grouping backup processes on node 1

In the second stage, for each node, the backup processes are firstly sorted in descending order using the load difference between the primary process and the backup. Secondly, the backup processes of the primary processes on each node are divided into (N-1) groups having approximately equal incremental load by assigning each r backup processes to the r group with the smallest load, in the order of the sorted list in the previous step. This load difference is the amount of load increment to be incurred upon the occurrence of a fault. The total number of backup groups generated in this step is N(N-1). Thirdly, the algorithm computes the actual load of each group using the actual load of the backup processes. Greedy method is used again to allocate the group with the highest actual load to the node with the lowest load. However, when allocating each backup group, the algorithm checks whether there is a pre-allocated backup group that comes from the same node as the to-be allocated backup group. In such a case, the node with the next-to-the-minimum load is selected (Lee et al., 1999).

Table 3 shows the grouping results on node 1 in the example and Table 4 shows the final allocation results computed by the two-stage allocation algorithm.

In the two-stage allocation algorithm, the purpose of dividing the backup processes into (N-1) groups for each node is to guarantee that each node has an approximately equal amount of load increment. Hence, the system will have a balanced load when a fault occurs. The purpose of computing the actual load of each group and assigning groups based on their actual loads is to guarantee that each processor`s load is balanced before the occurrence of a fault. Therefore, the algorithm can balance the processor load before as well as after the occurrence of a fault (Kim et al., 1997).

Table 4: Final results of two-stage allocation

However, for a more prevalent situation when each primary process is duplicated more than once, simply applying the two-stage allocation algorithm would reveal two notable defects:

Allocating all the backup processes in one stage decreases the load-balancing performance since the point that backup processes start at different time according to their assigned id is ignored. In other words, comparing to Bi-2, Bi-1 has higher election priority but is now treated equally. In the previous example as shown in Table 3 and 4, when Node 1 is failed, those preferential backup processes (B1-1, B6-1, B15-1, B16-1, B20-1) are only started on Node 2 and 4 while no load is added to Node 3. It is more reasonable to allocate Bi-1s and Bi-2s in different stages since for any fixed i, Bi-1 and Bi-2 cannot start with each other at the same time although they are both the backup replicas for P2.
Grouping backup processes only based on each node cannot well balance the load when multiple faults happen. For example, as shown in Table 4, P19, B20-1, B16-1 are allocated on Node 2 while P20, P16 and B19-1 are allocated on Node 1. Consider when Node 1 and Node 2 both fail, P20, P16, P19 would lose their primary and first-preference backup processes which lead to require start up their second-preference backup processes. Thus, B20-2, B16-2, B19-2 need to be considered grouping together although their primary processes and first-preference backup processes are not allocated on the same node.

Therefore, the two-stage allocation algorithm lacks scalability due to it pays little consideration to the backup replicas startup sequence and only groups the backup processes on each node. Its load-balancing performance decreases when the number of backup processes for each primary process (R) increases, as its grouping algorithm cannot guarantee that each node has an approximately equal amount of load increment when faults occur.

A new allocation algorithm named R-stage allocation that solves the above problems is presented as follows.

R-stage allocation: To solve the defects of two-stage allocation algorithm described earlier, the new allocation algorithm has to consider two issues. First, since Bi-1 and Bi-2 need to be allocated in different stages separately, which one should be allocated prior than the other? Second, to group the backup processes on all the nodes rather than on each node, what grouping rule should the new allocation algorithm to use? The following lemmas which form the core of the R-stage allocation algorithm address these questions one by one.

LEMMA 1: If 1≤x<y≤R, then φ(allocate(nodes, Bi-x),Bi-y).

The function allocate(nodes, Bi-x) represents allocating Bi-x to the proper nodes for all i such that 1≤i≤M.

Lemma 1 implies that the backup processes with smaller id should be allocated in the stage prior to those with bigger id to minimize φ (In present model, the backup processes with smaller id have higher priority to take over the role of the primary processes).

LEMMA 2: If size of (μi∩μj)≥size of(μi∩μk), then φ(group(group(i, j), k))≤φ(group(group(i, j), j).

The function group (i, j)represents dividing processes i and j into groups having approximately equal incremental load.

Lemma 2 implies that one process should be considered grouping with the processes having the maximum μ intersection with it to minimize φ. This is because the bigger the μ intersection of processes is, the more possible that these processes would start up at the same time. For example, processes with μ intersection as {2, 3, 4} means they would start up if one node (Node 1) fails (Assuming there are only 4 nodes). Thus, they have more possibility to start up at the same time than the processes with μ intersection as {1, 3} which means the processes would start up if two nodes (Node 2 and 4) both fail.

According to the two lemmas described above, RSA firstly allocates primary processes using the same way as the two-stage allocation algorithm. But for the backup processes allocation, RSA needs R rounds to accomplish (R is the number of the backup process replica). The main principal in RSA is to allocate the backup process replica groups one by one in their election sequence. The algorithm is formally described below.

R-stage algorithm
Stage-0: Allocate primary processes:

Sort primary processes in descending order of CPU load.
Allocate each primary process to the node with the minimum load from the highest load to the lowest.

According to Lemma 1, the backup processes are allocated based on their election sequence (Bi-1, Bi-2,…, Bi-r) in different stages. For example, Stage-2 allocates Bi-2, Stage-R allocates Bi-r. The stage-1 which allocates Bi-1 is presented below. The following stages after stage-1 use the similar steps as it.

Stage-1 Allocate Bi-1: Firstly, backup groups are generated by the steps below:

Compute the load difference between each primary process and its corresponding Bi-1.
Sort Bi-1 in descending order using the load difference.

According to Lemma 2, the backup processes with maximum μ intersection are grouped first; a hashmap is constructed in the next step to facilitate the grouping operations.

Construct a hashmap whose key is Bi-1 and value is a list including intersections section of μ (Each intersection has at least one element).


Go to step (7) if the hashmap value set is empty, otherwise, merge the same μ intersection for the value of each key in the hashmap, the fake code is described below,


Sort the μ intersections in the value set of the hashmap in descending order of the μ intersection size.
Divide Bi-1 into groups according to the μ intersections in the order of the sorted list in the previous step. The fake code of this step is described below:

  While (sorted list size is not empty) {
  Let σ be the first μ intersection in the sorted list generated in the previous step
  Let Partn be a temporary partition. Select out all the Bi-1 appeared in σ into it.
  Divide the backup processes in Partn into min (size of σ, size of Partn) groups having approximately equal incremental load by assigning each backup process with the maximum load difference to the group with the smallest difference load.

Remove all the intersections in the sorted list if they include one of the Bi-1 in Partn. (This can be quickly done by utilizing the previous constructed hashmap).

}

After the previous steps, the left non-grouped Bi-1s only have the empty intersection of μ with each other; hence, each of them forms a group.

Secondly, do the following for the backup groups that are generated by the steps above.

Sort all the backup groups using the actual loads in descending order.
Sort the N nodes using their current loads in ascending order.
Allocate each backup group to the node with the minimum load. However, if the node does not belong to the μ intersections of the backup processes in this group or if one of the backup groups that has already been allocated to this node comes from the same Partnas the backup group to be allocated, choose the node with the next-to-the-minimum load.

The left backup processes are allocated using the same algorithm described in the stage-1. Stage-2 allocates Bi-2, Stage-R allocates Bi-r.

The same example described previously in the two-stage algorithm is used here. Since each primary process has two backup replicas, two stages are used to allocate the backup process replicas.

After primary processes have been allocated, RSA allocates all the Bi-1 on the nodes according to their μ intersections. Table 5 shows the grouping results for the μ intersection {2, 3, 4} during stage-1. Table 6 shows the allocation result in stage-1.

Table 5: Grouping backup processes in stage one

Table 6: Allocation results in stage one

Table 7: Grouping backup processes in stage two

Table 8: Final results of r-stage allocation

In stage-2, Bi-2 are grouped as shown in Table 7. After allocating each group to the nodes, RSA gets the final result as shown in Table 8.

Let us compare the load distribution in Table 4 (two-stage allocation results) with Table 8 (RSA results). When Node 1 fails, the preferential backup processes (Bi-1, B6-1, B15-1 B16-1, B20-1) are only started on Node 2 and Node 4 in Table 4 while in Table 8, these backup processes are started on the left available nodes; hence φ is minimized in Table 8. Apparently, RSA balances the load better after faults occurrence.

Algorithm complexity:

Primary process allocation: A sorting algorithm whose running time is O(M lg M)is used to sort M primary processes. Allocating M primary processes to N nodes requires O(M lg N) time.
Backup process allocation: Let us first consider the time complexity for one stage.

Sorting M backup processes takes O(M lg M) time.
It requires

time to select arbitrarily two backup processes and takes at most 2N time to compute their μ intersections. Thus, the hashmap construction requires O(M2N). Merging and sorting its value set in descending order of the size of the intersection requires O(M2 lg M).

The worst case for step 5 and step 6 is that every two backup processes forms a separate Partn, thus requires O(M.

There are at most M groups generated by the previous steps. Allocating these groups to N nodes requires o(M lg N) time.

So the worst time complexity for the backup process allocation in one stage is O(M lg M+M2N+M2lgM+M+M lg N).

The above steps described are repeated in R stages, which together require O(R(M lg M+M2N+M2 lgM+M+M lg N).

Hence, plus the primary process allocation and R Stages of backup process allocation, the total time complexity of this algorithm is

(8)

Assuming that the number of processes is much larger than the number of nodes (M>N2), the execution time is bounded by O(M2R lg M).

Comparing Eq. 8 with the time complexity of two stage allocation algorithm O(NM lg NM+N3 lg N), we can see the time overload of RSA is acceptable and it is due to using more allocation stages (R stage instead of two stage), hashmap construction and its value set sorting.

PERFORMANCE COMPARISON

In this part of study, we compare the load distribution performance of RSA with the two-stage algorithm using simulations.

The environment parameters used in the simulation are as follows. To keep the total load of each node below 100%, the load of the primary processes is chosen randomly in the range of 1 to 100(N-R)/M based on a uniform distribution. The loads of the backup processes are also chosen randomly between 5~10% of the load of their primaries, also based on a uniform distribution.

Figure 2 shows the simulation results on how φ in Eq. 10 is affected by R in two-stage allocation algorithm and RSA algorithm. It is assumed that the number of nodes (N) is eight. The Y-axis represents the value of φ while the X-axis represents R (the number of the backup processes replica). In two-stage algorithm, φ increases dramatically when R increases especially when R reaches N-1 due to the fact that the algorithm does not group the backup replicas on each node well. RSA algorithm minimizes φ hence it has much better scalability and load distribution performance than two-stage allocation algorithm.

Assuming the number of process is 400 and the node number is eight, Fig. 3 shows the maximum and minimum load using the two-stage algorithm and RSA after one node fails.

Fig. 2: φ affected by R (No. of backup process)

Fig. 3: Maximum and minimum load when one fault occurs

Fig. 4: Maximum and minimum load when two faults occur

The X-axis represents R while the Y-axis represents the CPU load. The difference between maximum load and minimum load in RSA is smaller than the two-stage algorithm especially when R increases. Figure 4 shows the simulation results after two nodes fail based on another generated random test data but with the same parameters as Fig. 3.

CONCLUSION

In this study, we considered the static process allocation for load distribution in the primary-backup based Fault-Tolerant system.

In the primary-backup fault tolerant model, only primary process receives the invocations from the client process and sends the response back. Hence the load of primary process is much bigger than its backup processes. And when a fault occurs, one of the backup processes will be elected to take over the role of the primary process. Therefore, the load of this elected backup process varies before and after the occurrence of a fault. Previous research shows it is an NP-hard problem to find out a static process allocation algorithm to balance the system load before as well as after faults occur.

The main contribution of this study is the presentation and analysis of a new heuristic approximation static load-balancing algorithm for replicated R-Resilient process in the primary-backup based Fault Tolerant System. The proposed algorithm has better scalability and load distribution performance than the previous allocation algorithms in this area. In this study, we assume there is no difference between each node. We are currently working on extending this algorithm to handle heterogeneous distributed systems. Also, we plan to study the process allocation in the dynamic situation in the future.

REFERENCES

  • Bannister, J.A. and K.S. Trivedi, 1983. Task allocation in fault-tolerant distributed systems. Acta Inform. J., 20: 261-281.
    CrossRef    Direct Link    


  • Garcia-Molina, H., 1982. Elections in a distributed computing system. IEEE Trans. Comput. J., 31: 48-59.
    CrossRef    Direct Link    


  • Guo, H., Z.G. Wang and J.L. Zhou, 2005. Load balancing based process scheduling with fault-tolerance in heterogeneous distributed system. Chin. J. Comput., 28: 1807-1816.
    Direct Link    


  • Kim, J., H. Lee and S. Lee, 1995. Process allocation for load distribution in fault-tolerant multicomputers. Proceedings of the 25th International Symposium on Fault-Tolerant Computing, June 27-30, 1995, IEEE Computer Society, Washington, DC., USA., pp: 174-183.


  • Kim, J., H. Lee and S. Lee, 1997. Replicated process allocation for load distribution in fault-tolerant multicomputers. IEEE Trans. Comput. J., 46: 499-505.
    CrossRef    Direct Link    


  • Lee, H., J. Kim and S.J. Hong, 1999. Evaluation of two load-balancing primary-backup process allocation schemes. IEICE Trans. Inform. Syst. J., E82-D: 1535-1544.
    Direct Link    


  • Liu, A.F., Z.G. Chen and G.P. Long, 2005. Research on fault tolerant scheduling algorithms of web cluster based on probability. Wuhan Univ. J. Natural Sci. J., 10: 70-74.
    CrossRef    Direct Link    


  • Mullender, S.J., 1993. Distributed Systems. ACM Press Publishing.


  • Powell, D., 1994. Distributed fault tolerance: Lessons from delta-4. IEEE Microb. J., 14: 36-47.
    CrossRef    Direct Link    


  • Singh, S. and J.F. Kurose, 1994. Electing good leaders. Paral. Distribut. Comput., J., 21: 184-201.
    CrossRef    


  • Stoller, S.D., 2000. Leader election in asynchronous distributed systems. IEEE Trans. Comput. J., 49: 283-284.
    CrossRef    Direct Link    


  • Zhang, X.N., D. Zagorodnov and M. Hiltunen, 2004. Fault-tolerant grid services using primary-backup: Feasibility and performance. Proceedings of the International Conference Cluster Computing, September 20-23, 2004, IEEE Computer Society, Washington, DC., USA., pp: 105-114.

  • © Science Alert. All Rights Reserved