Designing an Adaptive Fault Tolerance Structure in Distributed Real Time Systems

Journal of Applied Sciences

Year: 2009 | Volume: 9 | Issue: 6 | Page No.: 1114-1120
DOI: 10.3923/jas.2009.1114.1120

Designing an Adaptive Fault Tolerance Structure in Distributed Real Time Systems

N. Mosharraf and M.R. Khayyambashi

Abstract: In this study, the Fault Tolerance CORBA (FT-CORBA) structure as a structure used for supporting fault tolerance programs as well as relative important parameters including replication style and number of replica which play further role in improved performance and making it adaptive to real time distributed system have been reviewed. Studying these specifications have been made a structure adaptive to real time systems with higher performance than FT-CORBA structure and finally the implementing of the said structure and determination of the number of replica and the objects replication style as well as the significance of related parameters have been investigated.

Fulltext PDF Fulltext HTML

How to cite this article

N. Mosharraf and M.R. Khayyambashi, 2009. Designing an Adaptive Fault Tolerance Structure in Distributed Real Time Systems. Journal of Applied Sciences, 9: 1114-1120.

Keywords: replication, trade-offs, fault tolerant, Middleware and real time

INTRODUCTION

The Middleware such as CORBA (Common Object Request Broker Architecture) has made possible the supporting of a fault tolerance structure and real time distributed system through providing FT-CORBA and RT-CORBA structures while programs capable of having the specification of real time and fault tolerance at the same time are not supported therein (Narasimhan et al., 2005). In fact, being real time and reliability are regarded as property system level, which are not simultaneously combined and adapted on distributed system because they often impose conflicting requirements on the distributed system (Narasimhan et al., 2000, 2005; Balasubramanian et al., 2007).

The specification of a real time system is so that it should contain bounded request processing time, predictable and specified task deadline on system (Narasimhan et al., 2002, 2005; Kim, 2001). The behavior of a real time system is based on actual time, frequency of client invocation, the relative priorities of the various invocations, the worst case execution times of invocations at the server, availability and allocation of recourses for the application’s execution time must be known ahead of run time. And with this information, the real time infrastructure then computes a schedule ahead of run time (Narasimhan et al., 2002, 2005; Kim, 1998). In fact, predictability is considered as a major specification in real time system, while in a system with reliability and fault tolerance, a program should be recovered and be continued in the presence of unanticipated events such as faults (Sahoo and Ekka, 2007). Therefore, there is a basic difference between the specifications of two systems and the fault tolerance and being real time and when both the specifications are given for the said system there should be a trade off generated between the sensitive specifications of both the structures so that the real time system will correctly continue its activity and at the same time, it will be able to discover and recover faults whenever a fault is occurred (Narasimhan et al., 2005, 2002; Felber and Narasimhan, 2004).

What is often done for the generating of fault tolerance structures is the replication of objects on distributed system so that the object replication will be used when a fault is occurred in current object and the recovery on system would be provided through generating a replication of object and of course, this is accomplished in such a condition that the consistency of old and current object is maintained.

Thus, when a fault tolerance structure is designed the determination of replication style and the number of replicas on distributed system are regarded as the main specifications of the structure, especially when the distributed system is real time.

Creating fault tolerance structures through FT-CORBA is fundamentally static, i.e., the specifications of fault tolerance structure cannot be altered during the execution time and the object replication style cannot be changed on execution time. Utilizing a mechanism for dynamic changes in the determination of the object replication style and the number of replicas can be considered as a great advantage and as a consequent, it causes that in case of programs with bounded time whenever a fault is occurred the changes as per system current situation are deemed necessary.

Therefore, with considering the features as stated above, the determination of the number of replicas and objects replication style in a real time distributed system have been studied and its effect on improved performance of adaptive fault tolerance structures in respect of restrictions having been described as a guideline in improved performance and reliability.

FT-CORBA

In FT-CORBA infrastructure four main specifications are noticed such as the object replica, management and distributed of replica, storage and recovery of object and replicas consistency. FT-CORBA architecture has three basic modules: Login and Recovery Service (LRS), Replication Management Service (RMS) and Fault Management Service (FMS) which have been shown in Fig. 1 separately (Narasimhan et al., 2005; Object Management Group, 2004). Login and recovery service is located underneath the ORB and each application object inherits checkpontable interface to allow its state to be retrieved and assigned for purposes of recovery. In case of failure or rising any fault in this section, the recovery action will come out and in any situation get_state() function will reserve the existence state and if it is necessary to recovery, set_state() function active another replica object. The Replication Management Service (RMS) which consists of three service of which property manager service is one of most important of this structure which distinguishes the main properties of fault tolerance structure by receiving specifications of a real time system and then we can make a the decision about properties of fault tolerance. Main determinable characteristics are as follows: replication style, minimum number of replica, consistency style, membership style check, point interval and etc. (Narasimhan et al., 2005; Object Management Group, 2004). From Generic factory service which is used to create a new object, in case of detecting any fault we will get a message from fault manager which is responsible for the create and delete of an object group and with relation to the Object Group Manager (OGM).

Fault manager consists of three service, fault detector which frequently checks if any fault occurs by sending is_alive() message in the program and can detect the fault on one or a few objects in one process or a few processes in one host and will inform to Fault Notifier Service (FNS) if any fault is detected (Narasimhan et al., 2005; Object Management Group, 2004).


Fig. 1:	FT-CORBA infrastructure

The fault notifier service receives fault reports from fault detector and fault analyzer. According to submitted report by fault notifier the fault analyzer can make the necessary decision for predicting all kind of faults and sending them to the fault notifier. As described above, in replication manager Service there is a major service of fault tolerance structure named property manager in which the features associated with determination of number of replicas, replication style, membership style, consistency style and check point are specified and we can consider the appropriate decisions for fault tolerance structure through using specifications as sent from real time system and we can make adjustments consistent with current states of real time system, for example, on the basis of what response time our real time system needs we will consider the replication style in such a way that fault tolerance structure will be able to respond the requests in shortest time. Therefore, with considering the features having been taken into account in FT-CORBA structure we can use this structure for adjustment and consistency of real time according to given specifications and restrictions.

REPLICATION STYLES AND DETERMINING THE NUMBER OF REPLICAS

One major feature in FT-CORBA structure and in replication manager service is the determination of replication style in fault tolerance structure. Two common approaches for maintaining replicas are active and passive replication. In active replication, as shown in Fig. 2 (Fraga et al., 2003), all objects do any requested operation and in fact, when a fault is occurred on an object the remainder objects will be ready to respond the said requests. Therefore, in this method all objects have similar states and their compatibility is established on system at any moment. This style results in the execution of a command by several objects and their responses to request.


Fig. 2:	(a) Active replication and (b) Passive replication

Active replication may lead to duplicate operations because every replica of an actively replicated object sends the invocation or response so to avoid having duplicate operations, it is necessary to have the duplication invocation suppressed (Fraga et al., 2003). For passive replication only one object designated the primary, processes (invokes) the operations and the remaining passive replicas known as backup, are preloaded into memory and are synchronized periodically with the primary replica and then the reply would be given to the invocation sender. In this style if the existing primary replica fails, one of the backup replicas is chosen to be the new primary and as such must be restored to the state that the old primary possessed. The state of primary must be continuously or periodically captured and stored so that it is available for recovery if the primary replica fails. In the case of passive replication under normal operation, the cost of checkpointing the primary replica’s state to a log must be considered if the state of the object is large. This checkpointing could become quite expensive but instead, the passive replication can lead to decreasing using network bandwidth and requires only one replica to be operational and thus conserves processing power. Thus, the passive replication has the advantage that it consumes processing power less, i.e., it does not require the operation to be performed by each of the replicas (Narasimhan et al., 2002). When the recovery time and response time is limited, the active style should be used to, because all of the replicas of an actively replicated object perform every operation and even if one of the replicas fails, the other operational replicas can continue the processing and perform the operation (Gokhale et al., 2004).

Combination of the aforementioned two styles led to another style so-called semi-active (Sahoo and Ekka, 2007). In this style, all replicated objects are arranged in a ordered list and they will be in connection with the primary object and any sent message will be dispatched to both the primary object and the first object of this list and it will be executed and upon the execution of the said command, the remainder objects may be updated to new state. In this way, there is always a replication of objects in consistency with the main object and in case of any failure and fault on the primary object: the first object as appeared on replicated objects list will be chosen as the main object. In conclusion, fault recovery time became faster and this will decrease the sending of messages in network as compared to active method. As seen, a set of advantages of active replication and passive replication methods will exist in semi-active method (Sahoo and Ekka, 2007). With taking into account the advantages and disadvantages of the above-mentioned methods and in view of the restrictions and parameters of real time system it is necessary to choose an appropriate object replication method for fault tolerance structure. As a result of time limitations and in view of the sensitivity of real time systems the active method is usually selected for objects replication because in such systems the time to respond the sent request will be considered as an important and basic property (Sahoo and Ekka, 2007).

Fault tolerance structures are made in a static manner by FT-CORBA. Therefore, in order to upgrade the performance of FT-CORBA structure we can make an investigation of objects replication style and the number of replicas according to objects’ current state on the structure and, in this way, to create the best method and the least replication of objects. AFT-CCM is one of the structures having been involved in the adjustment of objects replication style in accordance with programs being executed on FT-CORBA.

In AFT-CCM structure, considering the number of faults occurred during program execution the objects replication style is set dynamically and finally led to the adjustment of one major specification of fault tolerance structure in an optimal manner and based on system current state.


Fig. 3:	State machine of the QoS manager in AFT-CCM structure


Fig. 4:	All part of the cruise control system

In this method for which Fig. 3 of machine has been shown in the state related to the determination of objects replication style (Lung et al., 2006), if only one error is observed in 15 sec the passive method will be selected as objects replication style in fault tolerance structure and if two errors occurred in the network in 15 sec passive method will be replaced by semi-active method and if there is still increased number of faults and two other faults occurred in 15 sec the active method will be preferred as objects replication style and these changes will be similarly continued during program execution.

In view of the said structure, on the basis of number of faults, the given replication style is selected and as a consequent, it will lead in upgrading of performance and improved function of FT-CORBA structure on distributed real time systems. This action as shown in Fig. 4 is a result of AFT-CCM method showing improved function of fault tolerance structure as compared to earlier structures (Narasimhan et al., 2005; Lung et al., 2006).

ADAPTIVE FAULT TOLERANCE STRUCTURE

According to the mentioned specification of the distributed real time system and if the system also needs the creation of fault tolerance structure, the adaptive fault tolerance structure on real time system should be able to create so that to adjust the fault tolerance structure based on real time system specifications and through creating tradeoff between specifications of both real time and fault tolerance systems to make adaptive fault tolerance structure performance improved. Number of replicas is a parameter of importance in these situations, in addition to objects replication style and of effectiveness in structure performance. If the number of replicas is selected properly this will result in both well execution of fault tolerance structure and increasing of response time associated with real time system in an acceptable level, therefore, adaptive fault tolerance structure will be placed on real time system so that, it will be able to cause specifications related to fault tolerance based on real time system specifications and in fact, it will lead in a trade off between the specifications of real time and fault tolerance enabling us to improve the performance and function of fault tolerance structure under such conditions.

Considering to the structure shown in Fig. 1, one can use real time specifications to regulate fault tolerance structure and to enter those specifications to the property manager unit as input parameters. If active replication style is considered to define the number of object replications, then according to the limitation of the response time of each invocation, one can dynamically increase the number of object replications until fault tolerance conditions are met and also not to violate the time limitations of the real time system.

During the system execution, if some invocation is sent to any of the distributed objects so that its response time is not beyond the maximum response time, then it is considered as a fault in the structure and therefore, the underlying object will be recovered and the current object will be eliminated from the structure and a new object with the current state is added to the system to be replaced to the faulty object. Hence, the number of replications will be always optimum, in order to minimize the increase of response time in the real time system due to the creation of fault tolerance structure.

According to the definition of replications style and the number of object replication in an adaptive fault tolerance structure, these specifications are considered as the regulated specifications based on the real time system and will be added to the FT-CORBA structure. Therefore, an adaptive FT-CORBA structure will be created based on the real time and it causes the other parts of FT-CORBA which are in communication with replication management to perform more efficiently.

In order to evaluate the results of the mentioned fault tolerance structure, the Cruise control system has been considered as the distributed real time system. Cruise control system is a real time system which is used to regulate engine speed of the automobile and since the response time and creation of fault are important in such systems, we have considered this system in order to test the fault tolerant structure. In Fig. 4, all parts of the cruise control system are shown. Determining the replication style and the number of replicas in a distributed real time system, these specifications will be considered as adjusted specifications based on real time system and entered FT-CORBA structure. In FT-CORBA structure, a unit named property manager exists in replication Manager which takes the adjustment of main specifications and parameters on the basis of distributed real time system entering fault tolerance structure and, as a result, adaptive FT-CORBA structure is created according to real time system. This causes other parts of FT-CORBA structure be in connection with replication manager to perform their activity efficiently.

When studying the results of the given fault tolerance structure, Cruise control system has been taken as a distributed real time system for making fault tolerance structure. Cruise control system is a real time system which is, indeed, used for the adjustment of speed in machines engine and because of the importance of response time and fault occurred in systems it is considered as given system for testing the function of fault tolerance structure and in Fig. 4 all parts associated with cruise control system are shown. So said, it is contained of 6 parts: namely: control system, break sensor, engine sensor, wheal shaft sensor, throttle accelerator and gear sensor. The function of cruise control system is such that the user reaches the desired speed in system through the execution of accelerator command and in any speed user will be able execute the given function at the same speed to keep the system fixed and, if necessary, to decrease the system speed by execution of break function. Each object in the real time system has its own response time and each of the operations is sent to all of the replicated objects in the fault tolerant system to be executed and in fact, the active replication style is considered as an object replication style and if the execution time of each invocation is more than the desirable response time, it will be considered as a fault in the structure and the response will not be acceptable.

In a study on the function of distributed real time system and according to descriptions given on the method of fault tolerance structure execution on real time system, the performance and function of given programs in different situations have been reviewed. Figure 5 shows the situation in which servers are considered central and the objects are replicated within the system: in this state it is observed that when the number of replicas reaches over nine replicas the average response time will be strongly enhanced and the number of faults on structure will be increased too. Therefore, whenever it is desired that the given real time system to have appropriate response time in addition to fault tolerance capability, the objects should be replicated with at most nine replicas. Thus, in property manager of FT-CORBA structure, this quantity as input of real time system have been adjusted and the performance of fault tolerance structure based on the performance of given system will have been controlled so as not to increase the response time of real time system in a not desired manner as well as not to increase the number of replicas in a not required manner, when the execution of program is in such a way that objects are replicated on several computers. This state has been shown in Fig. 6 and according to obtained diagram for each object in order to observe the restriction of response time have been generated up to eleven replicas and to keep the number of faults in an acceptable level, also, the comparison of the both Fig. 6 and 7 show that, in view of response time as obtained by objects when the objects replication is considered less than nine replicas, it is better that objects are replicated on a system so that the response of objects will be received in a shorter time. In Fig. 5 and 6 the effect of increased objects replications on the response time have been reviewed, adjustment and improved fault tolerance structure function. Figure 7 and 8 deal with a state in which have been established the objects replication fixed and optimal and have studied the clients’ role and their increase in system’s response time.


Fig. 5:	Objects replication vs (a) Response time and (b) No. of faults on central servers


Fig. 6:	Objects replication vs (a) Response time and (b) No. of faults on several computers


Fig. 7:	Replication of clients vs (a) Response time and (b) No. of faults on central servers


Fig. 8:	Replication of clients vs (a) Response time and (b) No. of faults on several computers

Table 1:	Number of fault in adaptive fault tolerance system

Figure 7 has been considered when the servers are central and the number of replicas is 8. It is observed that when the clients are increased the response times will increase not appropriately and the number of faults are increased on the structure as well and at most up to eight clients work on system central servers as appropriate and acceptable in respect of response time, but when the objects are replicated on several servers and the number of clients have been increased, at most up to fifteen clients will be responsible at the same time as shown in Fig. 8. The comparison of Fig. 7 and 8 shows that when the replication of objects are fixed and optimal and if the clients have been increased it will be better that the objects replication to be on distributed system rather than be arranged as objects central so that making it possible to receive the response of objects in a shorter time. Table 1 and 2 shows all data (No. of fault and response time) that in simulation of adaptive fault tolerance system have been described.

Table 2:	Response time in adaptive fault tolerance system

CONCLUSION

With considering the specifications and restrictions as given for making fault tolerance structures in distributed real time systems, this study has dealt with the review of implementation of adaptive fault tolerance structure for upgrading performance and its adjustability based on real time systems specifications and in view of specifications received from real time system the objects replication style and the number of replicas have been studied as effective and sensitive parameters in the function of fault tolerance structure and they were applied for FT-CORBA structure. In respect of implementation and the review of results obtained it was observed that this will lead in improved FT-CORBA structure function for distributed real time system and will include the adjustability of fault tolerance structure on the basis of distributed real time system’s specifications.

REFERENCES

Balasubramanian, J., S. Tambe, A. Gokhale, D.C. Schmidt, C.H. Lu and C.H. Gill, 2007. FLARe: A fault tolerance lightweight adaptive real time middleware for distributed real time and embedded system. Proceedings of the 4th International Symposium on Middleware Doctoral, November 26-30, 2007, Newport Beach, California, pp: 320-330.

Fraga, J., F. Siqueria and F. Favarim, 2003. An adaptive fault tolerance component model. Proceedings of the 9th International Workshop on Object Oriented Real Time Dependable System, October 1-3, 2003, Santa Catarina, Brazil, pp: 179-187.

Felber, P. and P. Narasimhsn, 2004. Experiences strategies and challenges in building fault tolerance CORBA systems. IEEE Trans. Comput., 53: 497-511.
Direct Link

Gokhale, A.S., B. Natarajan, K. Joseph, C., Douglas and C. Schmidt, 2004. Towards real-time fault-tolerant CORBA middleware. Cluster Comput. J., 7: 331-346.
CrossRef Direct Link

Kim, K.H., 2001. Middleware of real time object based fault tolerance distributed computing systems. Proceedings of the Pacific Rim International Symposium on Dependable Computing, December 17-19, 2001, Washington, DC. USA., pp: 3-8.

Kim, K.H., 1998. ROAFTS: A middleware architecture for real time object oriented adaptive fault tolerance supprt. Proceedings of the 3rd International Symposium on High-Assurance Systems Engineering, November 13-14, 1998, Washington, DC., USA., pp: 50-57.

Lung, L.C., F. Favarim, G.T. Santos and M. Correia, 2006. An infrastructure for adaptive fault tolerance on FT-CORBA. Proceedings of the 9th International Symposium on Object and Component Oriented Real Time Distributed, April 24-26, 2006, Washington, DC., USA., pp: 504-511.

Narasimhan, P., T.A. Dumitras, A.M. Paulos, S.M. Pertet and C.F. Reverte et al., 2005. MEAD: Support for real-time fault-tolerance CORBA. Concurrency Comput: Practice Exp., 17: 1527-1545.
Direct Link

Narasimhan, P., L.E. Moser and P.M. Melliar-Smith, 2002. Strongly consistent replication and recovery of fault-tolerant CORBA applications. Comput. Syst. Sci. Eng. J., 17: 103-114.
Direct Link

Object Management Group, Fault Tolerant CORBA, 2004. Chapter 25, CORBA v3.0.3, OMG Document formal/04-03-10 edition. http://www.omg.org/cgi-bin/doc?formal/01-09-29.

Sahoo, B. and A.A. Ekka, 2007. Backward fault recovery in real time distributed system of periodic task with timing and precedence constrain. Proceedings of the International Conference on Emerging Trends in High Performance Architecture, Algorithms and Computing, July 12-13, 2007, SSN Institutions, pp: 124-130.

HOME JOURNALS CONTACT

Journal of Applied Sciences

Year: 2009 | Volume: 9 | Issue: 6 | Page No.: 1114-1120 DOI: 10.3923/jas.2009.1114.1120

Designing an Adaptive Fault Tolerance Structure in Distributed Real Time Systems

N. Mosharraf and M.R. Khayyambashi

How to cite this article

N. Mosharraf and M.R. Khayyambashi, 2009. Designing an Adaptive Fault Tolerance Structure in Distributed Real Time Systems. Journal of Applied Sciences, 9: 1114-1120.

Keywords: replication, trade-offs, fault tolerant, Middleware and real time

REFERENCES

Year: 2009 | Volume: 9 | Issue: 6 | Page No.: 1114-1120
DOI: 10.3923/jas.2009.1114.1120