HOME JOURNALS CONTACT

Information Technology Journal

Year: 2004 | Volume: 3 | Issue: 3 | Page No.: 275-282
DOI: 10.3923/itj.2004.275.282
Detecting the Faulty Communication Links under the Loose Coupled Distributed System
K.Q. Yan and S.C. Wang

Abstract: Our previous study solved the Byzantine Agreement (BA) problem in the presence of mixed faults on processors and links in the general network by proposed protocol GPBA. After that, they also solve the Fault Diagnosis Agreement (FDA) problem with mixed faults on processor by proposed protocol FDAMIX. Nevertheless, the protocol FDAMIX is designed for the FDA problem with faulty processors only. Therefore, the FDA problem with faulty links remains to be solved. That is, when the agreement is achieved by the GPBA, then executes the protocol FDAMIX can let the general network without faulty processors, but the faulty links are still existed in the general network. In this study, the proposed protocol FDAML can detect/locate the faulty links to solve the FDA problem and reconfigure the unstable general network be a stable general network.

Fulltext PDF Fulltext HTML

How to cite this article
K.Q. Yan and S.C. Wang, 2004. Detecting the Faulty Communication Links under the Loose Coupled Distributed System. Information Technology Journal, 3: 275-282.

Keywords: distributed system, Fault diagnosis agreement, fault tolerance, mixed fault model and loose coupled

INTRODUCTION

A loose coupled distributed system is a collection of processors that do not share memory or a clock and each processor can communicate with each other processors through communication links in the un-fully connected network[1]. To cope the influences from faulty components, a distributed system must provide a mechanism to allow all fault-free processors in the distributed system to reach a common agreement[2]. Such an agreement problem was also called as the Byzantine Agreement (BA) problem[3-7]. Siu et al.[8] solve the BA problem in the presence of mixed faults on processors and links in the general network by protocol GPBA. The general network is that the network topology may not be fully connected and processors and links can both be subjected to different types of failures (dormant fault and arbitrary fault) simultaneously[9-11]. After reaching a common agreement by the protocol GPBA, each fault-free processor in the network can cope with the influences from faulty components.

In order to provide more stable environment, reaching a common agreement is not enough, so we must visit another related problem which called as the Fault Diagnosis Agreement (FDA) problem[12-14]. The FDA is used to detect/locate the faulty components in the network. Therefore, Siu et al.[12] proposed the protocol FDAMIX to solve the FDA problem with mixed faults on processors in the general network. That is, the protocol FDAMIX can detect/locate the dormant and arbitrary faulty processors in the general network.

However, the faulty components in the network can be classified into two catalogues; they are the processors and communication links[6,8]. The FDAMIX can solve the FDA problem with mixed faults on processors. However, the FDA problem with mixed faults on links in the general network remains to be solved. Therefore, this study, focused the FDA problem with mixed faults on links in the general network and the proposed protocol which called fault diagnosis agreement with mixed faults on links (FDAML) can detect/locate the maximum number of dormant faulty links and arbitrary faulty links.

The conditions for FDA problem: The FDA problem is considered in a synchronous network, so the processing time and communication time of each fault-free components are finite[15]. Moreover, the symptoms of a faulty communication link can be divided into two types. They are dormant fault (include crash and stuck-at) and arbitrary fault. In the synchronous system, each fault-free processor can detect the messages from dormant faulty communication link, if the protocol appropriately encodes a transmitted message by Manchester code[16] before transmission. Because, in the Manchester encoding, logic 0 is indicated by a 0 to 1 (upward transition at bit centre) transition at the centre of the bit and logic 1 is indicated by a 1 to 0 (downward transition at bit centre) transition at the centre of the bit[16]. Due to a crash fault happens when a component is broken, so the component would not send any signal to the receiver processor. A stuck-at fault happens when the signal from a certain communication link is always constant. Therefore, a fault-free processor can detect the crash fault and stuck-at fault if the protocol encodes a transmitted message by Manchester code before transmission. However, the message from the arbitrary faulty communication link is arbitrary and unrestricted, so it is the most damaging failure types of all and causes the worst problem.

The assumptions and parameters of our protocols

The underlying network is synchronous.
Let Þ be the set of all processors in the network and .Þ.= n, where, n is the number of processors in the general network and n.4.
Each processor in the network can be identified uniquely.
Each processor in the network are fault-free, this can be achieved by using the FDA protocol FDAMIX[16].
The communication links in the network are assumed fallible.
A processor that transmits messages is called a sender processor.
Each processor in the network has the same initial value; this can be achieved by using the BA protocol GPBA[6].
The proposed protocol encodes a transmitted message by Manchester code before transmission.
Let a be the maximum number of arbitrary faulty communication links allowed.
Let d be the maximum number of dormant faulty communication links allowed.
Let c be the connectivity of the general network.
A processor does not know the faulty status of communication links, but the message(s) received through dormant faulty communication links can be detected.

The constraints: Since the FDAML is based on the BA protocol GPBA and FDA protocol FDAMIX. Therefore, the constraints of the FDAML are also based on the GPBA and FDA. The proposed FDAML can solve the FDA problem in the general network if c > 2a+d.

The proposed protocol: In this section, the proposed FDAML is used to detect/locate the dormant faulty communication link and arbitrary link in the general network. There are three phases in the FDAML; there are the message exchange phase, fault diagnosis phase and reconfiguration phase. The message exchange phase is used to collect the messages received from all processors in the network. The fault diagnosis phase is used to diagnosis the messages received in the message exchange phase to find out the faulty communication links. The reconfiguration phase is used to reconfiguration the network by eliminating the faulty communication links logically. The definition of FDAML is shown in Fig. 1.

The message exchange phase: In the message exchange phase, the FDAML transmits the message(s) by using the concept of virtual channel; the virtual channel can let an un-fully connected network work just like a fully connected network. Because, each processor in the network has the common knowledge of the graphic information of the entire general network and there are at least c (c>2a+ d) node-disjoint paths to each processor. An example of virtual channels between the processor P1 and processor P3 is in Fig. 2. Figure 2a shows a general network. The connection state of the sender processor P1 and connection state of the destination processor P3 is shown in Fig. 2b. The node-disjoint paths from sender processor P1 to destination processor P3 is shown in Fig. 2c.

In the first round of message exchange phase, each destination processor Pj collects other processors’ initial value to construct the MATj. According to the MATj, processor Pj can find out the dormant faulty paths by examining the value λ in the MATj and sets the d_Pathj = d_Pathj .{dormant faulty path}. The processor Pj also can find out the arbitrary faulty path if the message in the MATj is the non-agreement value and non-λ.value and sets the a_Pathj = a_Pathj. {arbitrary faulty path}, where, 1 ≤ j ≤ n.

In the second round of message exchange phase, each processor Pi (for 1≤ i ≤ n) transmits its d_Pathi and a_Pathi to all processors through c node-disjoint paths. Then each destination processor Pj can take the majority message from c paths to get the d_Pathi and m_Pathi. Then each destination processor Pj constructs the same D_Path = d_Path1 . d_Path2 . …. d_Pathn-1 . d_Pathn and A_Path = a_Path1 . a_Path2 . …. a_Pathn-1 . a_Pathn.

The fault diagnosis phase: In the fault diagnosis phase, each processor can detect/locate the dormant faulty links by examining the D_Paths to find out the paths which only contain one link and set the D_Links = D_Links . {dormant faulty links}.

Fig. 1: The proposed protocol FDAML

Each processor also can detect/locate the arbitrary link by examining the A_Paths to find out the links which appear times is the top (c-.D_Links.-1)/2 of the A_Paths and set the A_Links = A_Links . {arbitrary faulty links}. Because, there are at most (c-.D_Links.-1)/2 arbitrary faulty links in the network if c>2a +d. This treatment can ensure that all the arbitrary faulty communication links are detected.

The reconfiguration phase: Each processor uses the results of D_Links and A_Links from the fault diagnosis phase to reconfigure the network by isolating the faulty communication links logically. After the reconfiguration, the distributed system can be more stable and the performance and integrity of the network can be guaranteed.

An example of executing FDAML: In this section, we give an example of executing FDAML. In Fig. 3a, there is a network with six processors and the connectivity c of the network is four. There is a dormant faulty link between processor P3 and processor P4 (Link34) and an arbitrary faulty link between Processor P1 and P3 (Link13). The initial value of each processor is assumed to be the same (This can be achieved by the protocol GPBA).

Fig. 2: An example of virtual channel between processor P1 and processor P3

In this scenario, we assume that the initial value of each processor is “1” (Fig. 3b).

The message exchange phase: In the first round of message exchange phase, each processor Pi (for 1≤ i ≤ n) in the network transmits its initial value to all other processors through c node-disjoint paths, then each destination processor Pj can construct the matrix MATj by the messages received from other processors as shown in Fig. 3c, e, g, i, k and m. Each processor Pj examines the values in the MATj to find out the dormant faulty paths and arbitrary faulty paths. And sets the d_Pathj = d_Pathj .{dormant faulty paths} and a_Pathj = a_Pathj .{arbitrary faulty paths}. In Fig. 3d, f, h, j, l and n, shows the d_Pathj and m_Pathj of each destination processor Pj.

In the second round of message exchange phase, each processor Pi (for 1≤ i ≤ n) exchanges its d_Pathi and a_Pathi to all processors through c node-disjoint paths. Then, each processor Pj can construct the same D_Paths and A_Paths. Example of D_Paths and A_Paths are shown in Fig. 3o.

The fault diagnosis phase: Each processor examines the D_Paths in Fig. 3o to find out the Link34 is in dormant fault then set the D_Links={Link34}. There are at most 1((c-.D_Links.-1)/2 = (4-1-1)/2) arbitrary faulty link in the network for c>2a +d. Therefore, each processor examines the A_Paths to find out the link, which appear times is the top 1 at the A_Paths. So set the A_Links={Link13}.

The reconfiguration phase: According to the D_Links and A_Links, to eliminate the Link34 and Link13 and then can reconfigure the network logically as shown in Fig. 3p. After reconfiguration, the performance and integrity of the network can be guaranteed.

The correctness and complexity of DFAML: The following lemmas and theorems are used to prove the correctness of DFAML.

Lemma 1: Each processor in the general network is fault-free and the initial value of each processor is the same.

Proof: Executing the BA protocol GPBA by Siu et al.[8] can let each fault-free processor has the same initial value and after executing the FDA protocol FDAMIX by Siu et al.[12] can detect/locate the faulty processors in the general network.

Theorem 1: Each processor can detect/locate the dormant faulty link in the general network.

Proof: Due to the message received from dormant faulty link can be detected if the protocol encodes a transmitted message by Manchester code before transmission.

Fig. 3: An example of executing FDAML

If the message received from the path is λ, then we can ensure that there is at least one dormant faulty link in the path, we call such path as the dormant faulty path. In DFAML, D_Paths is used to store the dormant faulty path. Therefore, we can examine the D_Paths to find out the paths, which only contain one link. In case of only one link in the path, then the link is a dormant faulty link.

Lemma 2: If the path between the sender processor Pi and destination processor Pj is fault-free, then the received messages should be the same as the initial value of each processor by GPBA.

Proof: If the path between sender processor Pi and destination processor Pj is fault-free, then the message sent though the path is transmitted correctly. That is, if the sender processor Pi sends its initial value vi to the destination processor Pj, then the destination processor should receive the value vi correctly. According to the initial value of each processor is the same, so the received value vi should be the same as the processor Pj’s initial value.

Lemma 3: Each processor can detect the arbitrary faulty path in the general network.

Proof: If the message received from the sender processor Pi is not the initial value of each processor by GPBA and also not the value λ, then the path between the sender processor Pi and destination processor Pj is in arbitrary.

Lemma 4: There are at most (c-.D_Links.-1)/2 arbitrary faulty links in the general network if c>2a +d.

Proof: By Theorem 1, the dormant faulty link can be detected and according to the constraint c>2a +d , we can ensure that there are at most (c-.D_Links.-1)/2 arbitrary faulty links in the network, where, d =.D_Links..

Theorem 2: Each processor can detect/locate the arbitrary faulty links in the general network.

Proof: By Lemma 3, each processor can detect the arbitrary faulty paths in the general network. In DFAML, A_Paths is used to store the arbitrary faulty paths. According to Lemma 4, there are at most (c-.D_Links.-1)/2 arbitrary faulty links in the network if c>2a +d. So that, examines the A_Paths to find out the arbitrary links which appear times is the top (c-.D_Links.-1)/2 at the A_Paths.

Theorem 3: The FDA protocol DFAML can detect/locate the faulty links in a general network.

Proof: By Theorem 1 and Theorem 2, this theorem is proved.

CONCLUSION

By using the BA protocol GPBA[8] and FDA protocol FDAMIX[16] by Siu et al.[8] can solve the BA problem on processors and communication links with mixed faults and solve the FDA problem on processors with mixed faults. Then using our FDA protocol FDAML can solve the FDA problem on communication links with mixed faults. That is, after executing the protocol GPBA, FDAMIX and FDAML can reconfigure the unreliable general network to be a stable general network without faulty components. In addition, the proposed FDAML can detect/locate the maximum number of faulty communication links with mixed faults in a general network.

REFERENCES

  • Silberschatz, A., P. Galvin and G. Gagne, 2000. Applied Operating System Concepts. John Wiley and Sons Inc., New York, pp: 469-480


  • Silberschatz, A., P. Galvin and G. Gagne, 2002. Operating Systems Concepts. 6th. Edn., John Wiley and Sons Inc., New York


  • Dolev, D., 1982. The byzantine generals strike again. J. Algorithms, 1: 14-30.


  • Lamport, L., R. Shostak and M. Pease, 1982. The byzantine generals problem. ACM Trans. Programm Languages Syst., 4: 382-401.
    Direct Link    


  • Pease, M., R. Shostak and L. Lamport, 1980. Reaching agreement in the presence of faults. ACM J., 2: 228-234.
    Direct Link    


  • Wang, S.C., Y.H. Chin and K.Q. Yan, 1995. Byzantine agreement in a generalized connected network. IEEE Trans. Parallel Distributed Syst., 6: 420-427.
    Direct Link    


  • Yan, K.Q., Y.H. Chin and S.C. Wang, 1992. Optimal agreement protocol in malicious faulty processors and faulty links. IEEE Trans. Data Knowledge Eng., 4: 266-280.
    Direct Link    


  • Siu, H.S., Y.H. Chin and W.P. Yang, 1998. Byzantine agreement in the presence of mixed faults on processors and links. IEEE Trans. Parallel Distribution Syst., 9: 335-345.
    Direct Link    


  • Meyer, F.J. and D.K. Pradhan, 1991. Consensus with dual failure modes. IEEE Trans. Parallel Distributed Syst., 2: 214-222.
    CrossRef    


  • Siu, H.S., Y.H. Chin and W.P. Yang, 1998. Reaching strong consensus in the presence of mixed failure types. J. Inform. Sci., 108: 157-180.
    Direct Link    


  • Siu, H.S., Y.H. Chin and W.P. Yang, 1996. A note on consensus on dual failure modes. IEEE Trans. Parallel Distributed Syst., 7: 225-230.
    Direct Link    


  • Hsiao, H.S., Y.H. Chin and W.P. Yang, 2000. Reaching fault diagnosis agreement under a hybrid fault model. IEEE Trans. Comput., 49: 980-986.
    Direct Link    


  • Ramarao, K.V.S. and J.C. Adams 1988. On the diagnosis of byzantine faults. Proceedings of the 7th Symposium on Reliable Distributed Systems, Oct. 10-12, Columbus, OH. USA., pp: 144-153.


  • Wang, S.C., Y.H. Chin and K.Q. Yan, 1990. Reaching a fault detection agreement. Proc. Int. Conf. Parallel Process., 2: 251-258.


  • Fischer, M., M. Paterson and N. Lynch, 1985. Impossibility of distributed consensus with one faulty process. ACM J., 32: 374-382.
    Direct Link    


  • Halsall, F., 1995. Data Links, Computer Networks and Open System. 4th Edn., Addison Wesley Publishers Ltd., London, UK., pp: 112-125


  • Deo, N., 1974. Graph Theory with Applications to Engineering and Computer Science. Prentice-Hall, Englewood Cliffs, New Jersey, USA

  • © Science Alert. All Rights Reserved