Detecting the Faulty Communication Links under the Loose Coupled Distributed System

Information Technology Journal

Year: 2004 | Volume: 3 | Issue: 3 | Page No.: 275-282
DOI: 10.3923/itj.2004.275.282

Detecting the Faulty Communication Links under the Loose Coupled Distributed System

K.Q. Yan and S.C. Wang

Abstract: Our previous study solved the Byzantine Agreement (BA) problem in the presence of mixed faults on processors and links in the general network by proposed protocol GPBA. After that, they also solve the Fault Diagnosis Agreement (FDA) problem with mixed faults on processor by proposed protocol FDAMIX. Nevertheless, the protocol FDAMIX is designed for the FDA problem with faulty processors only. Therefore, the FDA problem with faulty links remains to be solved. That is, when the agreement is achieved by the GPBA, then executes the protocol FDAMIX can let the general network without faulty processors, but the faulty links are still existed in the general network. In this study, the proposed protocol FDAML can detect/locate the faulty links to solve the FDA problem and reconfigure the unstable general network be a stable general network.

Fulltext PDF Fulltext HTML

How to cite this article

K.Q. Yan and S.C. Wang, 2004. Detecting the Faulty Communication Links under the Loose Coupled Distributed System. Information Technology Journal, 3: 275-282.

Keywords: distributed system, Fault diagnosis agreement, fault tolerance, mixed fault model and loose coupled

INTRODUCTION

A loose coupled distributed system is a collection of processors that do not share memory or a clock and each processor can communicate with each other processors through communication links in the un-fully connected network^[1]. To cope the influences from faulty components, a distributed system must provide a mechanism to allow all fault-free processors in the distributed system to reach a common agreement^[2]. Such an agreement problem was also called as the Byzantine Agreement (BA) problem^[3-7]. Siu et al.^[8] solve the BA problem in the presence of mixed faults on processors and links in the general network by protocol GPBA. The general network is that the network topology may not be fully connected and processors and links can both be subjected to different types of failures (dormant fault and arbitrary fault) simultaneously^[9-11]. After reaching a common agreement by the protocol GPBA, each fault-free processor in the network can cope with the influences from faulty components.

In order to provide more stable environment, reaching a common agreement is not enough, so we must visit another related problem which called as the Fault Diagnosis Agreement (FDA) problem^[12-14]. The FDA is used to detect/locate the faulty components in the network. Therefore, Siu et al.^[12] proposed the protocol FDAMIX to solve the FDA problem with mixed faults on processors in the general network. That is, the protocol FDAMIX can detect/locate the dormant and arbitrary faulty processors in the general network.

However, the faulty components in the network can be classified into two catalogues; they are the processors and communication links^[6,8]. The FDAMIX can solve the FDA problem with mixed faults on processors. However, the FDA problem with mixed faults on links in the general network remains to be solved. Therefore, this study, focused the FDA problem with mixed faults on links in the general network and the proposed protocol which called fault diagnosis agreement with mixed faults on links (FDAML) can detect/locate the maximum number of dormant faulty links and arbitrary faulty links.

The conditions for FDA problem: The FDA problem is considered in a synchronous network, so the processing time and communication time of each fault-free components are finite^[15]. Moreover, the symptoms of a faulty communication link can be divided into two types. They are dormant fault (include crash and stuck-at) and arbitrary fault. In the synchronous system, each fault-free processor can detect the messages from dormant faulty communication link, if the protocol appropriately encodes a transmitted message by Manchester code^[16] before transmission. Because, in the Manchester encoding, logic 0 is indicated by a 0 to 1 (upward transition at bit centre) transition at the centre of the bit and logic 1 is indicated by a 1 to 0 (downward transition at bit centre) transition at the centre of the bit^[16]. Due to a crash fault happens when a component is broken, so the component would not send any signal to the receiver processor. A stuck-at fault happens when the signal from a certain communication link is always constant. Therefore, a fault-free processor can detect the crash fault and stuck-at fault if the protocol encodes a transmitted message by Manchester code before transmission. However, the message from the arbitrary faulty communication link is arbitrary and unrestricted, so it is the most damaging failure types of all and causes the worst problem.

The assumptions and parameters of our protocols

•	The underlying network is synchronous.
•	Let Þ be the set of all processors in the network and .Þ.= n, where, n is the number of processors in the general network and n.4.
•	Each processor in the network can be identified uniquely.
•	Each processor in the network are fault-free, this can be achieved by using the FDA protocol FDAMIX^[16].
•	The communication links in the network are assumed fallible.
•	A processor that transmits messages is called a sender processor.
•	Each processor in the network has the same initial value; this can be achieved by using the BA protocol GPBA^[6].
•	The proposed protocol encodes a transmitted message by Manchester code before transmission.
•	Let a be the maximum number of arbitrary faulty communication links allowed.
•	Let d be the maximum number of dormant faulty communication links allowed.
•	Let c be the connectivity of the general network.
•	A processor does not know the faulty status of communication links, but the message(s) received through dormant faulty communication links can be detected.

The constraints: Since the FDAML is based on the BA protocol GPBA and FDA protocol FDAMIX. Therefore, the constraints of the FDAML are also based on the GPBA and FDA. The proposed FDAML can solve the FDA problem in the general network if c > 2a+d.

The proposed protocol: In this section, the proposed FDAML is used to detect/locate the dormant faulty communication link and arbitrary link in the general network. There are three phases in the FDAML; there are the message exchange phase, fault diagnosis phase and reconfiguration phase. The message exchange phase is used to collect the messages received from all processors in the network. The fault diagnosis phase is used to diagnosis the messages received in the message exchange phase to find out the faulty communication links. The reconfiguration phase is used to reconfiguration the network by eliminating the faulty communication links logically. The definition of FDAML is shown in Fig. 1.

The message exchange phase: In the message exchange phase, the FDAML transmits the message(s) by using the concept of virtual channel; the virtual channel can let an un-fully connected network work just like a fully connected network. Because, each processor in the network has the common knowledge of the graphic information of the entire general network and there are at least c (c>2a+ d) node-disjoint paths to each processor. An example of virtual channels between the processor P₁ and processor P₃is in Fig. 2. Figure 2a shows a general network. The connection state of the sender processor P₁and connection state of the destination processor P₃ is shown in Fig. 2b. The node-disjoint paths from sender processor P₁ to destination processor P₃ is shown in Fig. 2c.

In the first round of message exchange phase, each destination processor P_jcollects other processors’ initial value to construct the MAT_j. According to the MAT_j, processor P_jcan find out the dormant faulty paths by examining the value λ in the MAT_j and sets the d_Path_j = d_Path_j .{dormant faulty path}. The processor P_jalso can find out the arbitrary faulty path if the message in the MAT_j is the non-agreement value and non-λ.value and sets the a_Path_j = a_Path_j. {arbitrary faulty path}, where, 1 ≤ j ≤ n.

In the second round of message exchange phase, each processor P_i (for 1≤ i ≤ n) transmits its d_Path_iand a_Path_i to all processors through c node-disjoint paths. Then each destination processor P_jcan take the majority message from c paths to get the d_Path_iand m_Path_i. Then each destination processor P_jconstructs the same D_Path = d_Path₁ . d_Path₂ . …. d_Path_n-1 . d_Path_nand A_Path = a_Path₁ . a_Path₂ . …. a_Path_n-1 . a_Path_n.

The fault diagnosis phase: In the fault diagnosis phase, each processor can detect/locate the dormant faulty links by examining the D_Paths to find out the paths which only contain one link and set the D_Links = D_Links . {dormant faulty links}.


Fig. 1:	The proposed protocol FDAML

Each processor also can detect/locate the arbitrary link by examining the A_Paths to find out the links which appear times is the top (c-.D_Links.-1)/2 of the A_Paths and set the A_Links = A_Links . {arbitrary faulty links}. Because, there are at most (c-.D_Links.-1)/2 arbitrary faulty links in the network if c>2a +d. This treatment can ensure that all the arbitrary faulty communication links are detected.

The reconfiguration phase: Each processor uses the results of D_Links and A_Links from the fault diagnosis phase to reconfigure the network by isolating the faulty communication links logically. After the reconfiguration, the distributed system can be more stable and the performance and integrity of the network can be guaranteed.

An example of executing FDAML: In this section, we give an example of executing FDAML. In Fig. 3a, there is a network with six processors and the connectivity c of the network is four. There is a dormant faulty link between processor P₃ and processor P₄ (Link₃₄) and an arbitrary faulty link between Processor P₁ and P₃ (Link₁₃). The initial value of each processor is assumed to be the same (This can be achieved by the protocol GPBA).


Fig. 2:	An example of virtual channel between processor P₁ and processor P₃

In this scenario, we assume that the initial value of each processor is “1” (Fig. 3b).

The message exchange phase: In the first round of message exchange phase, each processor P_i (for 1≤ i ≤ n) in the network transmits its initial value to all other processors through c node-disjoint paths, then each destination processor P_jcan construct the matrix MAT_j by the messages received from other processors as shown in Fig. 3c, e, g, i, k and m. Each processor P_j examines the values in the MAT_j to find out the dormant faulty paths and arbitrary faulty paths. And sets the d_Path_j = d_Path_j .{dormant faulty paths} and a_Path_j = a_Path_j .{arbitrary faulty paths}. In Fig. 3d, f, h, j, l and n, shows the d_Path_jand m_Path_j of each destination processor P_j.

In the second round of message exchange phase, each processor P_i (for 1≤ i ≤ n) exchanges its d_Path_iand a_Path_ito all processors through c node-disjoint paths. Then, each processor P_jcan construct the same D_Paths and A_Paths. Example of D_Paths and A_Paths are shown in Fig. 3o.

The fault diagnosis phase: Each processor examines the D_Paths in Fig. 3o to find out the Link₃₄is in dormant fault then set the D_Links={Link₃₄}. There are at most 1((c-.D_Links.-1)/2 = (4-1-1)/2) arbitrary faulty link in the network for c>2a +d. Therefore, each processor examines the A_Paths to find out the link, which appear times is the top 1 at the A_Paths. So set the A_Links={Link₁₃}.

The reconfiguration phase: According to the D_Links and A_Links, to eliminate the Link₃₄ and Link₁₃ and then can reconfigure the network logically as shown in Fig. 3p. After reconfiguration, the performance and integrity of the network can be guaranteed.

The correctness and complexity of DFAML: The following lemmas and theorems are used to prove the correctness of DFAML.

Lemma 1: Each processor in the general network is fault-free and the initial value of each processor is the same.

Proof: Executing the BA protocol GPBA by Siu et al.^[8] can let each fault-free processor has the same initial value and after executing the FDA protocol FDAMIX by Siu et al.^[12] can detect/locate the faulty processors in the general network.

Theorem 1: Each processor can detect/locate the dormant faulty link in the general network.

Proof: Due to the message received from dormant faulty link can be detected if the protocol encodes a transmitted message by Manchester code before transmission.




Fig. 3:	An example of executing FDAML

If the message received from the path is λ, then we can ensure that there is at least one dormant faulty link in the path, we call such path as the dormant faulty path. In DFAML, D_Paths is used to store the dormant faulty path. Therefore, we can examine the D_Paths to find out the paths, which only contain one link. In case of only one link in the path, then the link is a dormant faulty link.

Lemma 2: If the path between the sender processor P_iand destination processor P_jis fault-free, then the received messages should be the same as the initial value of each processor by GPBA.

Proof: If the path between sender processor P_i and destination processor P_j is fault-free, then the message sent though the path is transmitted correctly. That is, if the sender processor P_i sends its initial value v_i to the destination processor P_j, then the destination processor should receive the value v_i correctly. According to the initial value of each processor is the same, so the received value v_i should be the same as the processor P_j’s initial value.

Lemma 3: Each processor can detect the arbitrary faulty path in the general network.

Proof: If the message received from the sender processor P_i is not the initial value of each processor by GPBA and also not the value λ, then the path between the sender processor P_i and destination processor P_j is in arbitrary.

Lemma 4: There are at most (c-.D_Links.-1)/2 arbitrary faulty links in the general network if c>2a +d.

Proof: By Theorem 1, the dormant faulty link can be detected and according to the constraint c>2a +d , we can ensure that there are at most (c-.D_Links.-1)/2 arbitrary faulty links in the network, where, d =.D_Links..

Theorem 2: Each processor can detect/locate the arbitrary faulty links in the general network.

Proof: By Lemma 3, each processor can detect the arbitrary faulty paths in the general network. In DFAML, A_Paths is used to store the arbitrary faulty paths. According to Lemma 4, there are at most (c-.D_Links.-1)/2 arbitrary faulty links in the network if c>2a +d. So that, examines the A_Paths to find out the arbitrary links which appear times is the top (c-.D_Links.-1)/2 at the A_Paths.

Theorem 3: The FDA protocol DFAML can detect/locate the faulty links in a general network.

Proof: By Theorem 1 and Theorem 2, this theorem is proved.

CONCLUSION

By using the BA protocol GPBA^[8] and FDA protocol FDAMIX^[16] by Siu et al.^[8] can solve the BA problem on processors and communication links with mixed faults and solve the FDA problem on processors with mixed faults. Then using our FDA protocol FDAML can solve the FDA problem on communication links with mixed faults. That is, after executing the protocol GPBA, FDAMIX and FDAML can reconfigure the unreliable general network to be a stable general network without faulty components. In addition, the proposed FDAML can detect/locate the maximum number of faulty communication links with mixed faults in a general network.

REFERENCES

Silberschatz, A., P. Galvin and G. Gagne, 2000. Applied Operating System Concepts. John Wiley and Sons Inc., New York, pp: 469-480

Silberschatz, A., P. Galvin and G. Gagne, 2002. Operating Systems Concepts. 6th. Edn., John Wiley and Sons Inc., New York

Dolev, D., 1982. The byzantine generals strike again. J. Algorithms, 1: 14-30.

Lamport, L., R. Shostak and M. Pease, 1982. The byzantine generals problem. ACM Trans. Programm Languages Syst., 4: 382-401.
Direct Link

Pease, M., R. Shostak and L. Lamport, 1980. Reaching agreement in the presence of faults. ACM J., 2: 228-234.
Direct Link

Wang, S.C., Y.H. Chin and K.Q. Yan, 1995. Byzantine agreement in a generalized connected network. IEEE Trans. Parallel Distributed Syst., 6: 420-427.
Direct Link

Yan, K.Q., Y.H. Chin and S.C. Wang, 1992. Optimal agreement protocol in malicious faulty processors and faulty links. IEEE Trans. Data Knowledge Eng., 4: 266-280.
Direct Link

Siu, H.S., Y.H. Chin and W.P. Yang, 1998. Byzantine agreement in the presence of mixed faults on processors and links. IEEE Trans. Parallel Distribution Syst., 9: 335-345.
Direct Link

Meyer, F.J. and D.K. Pradhan, 1991. Consensus with dual failure modes. IEEE Trans. Parallel Distributed Syst., 2: 214-222.
CrossRef

Siu, H.S., Y.H. Chin and W.P. Yang, 1998. Reaching strong consensus in the presence of mixed failure types. J. Inform. Sci., 108: 157-180.
Direct Link

Siu, H.S., Y.H. Chin and W.P. Yang, 1996. A note on consensus on dual failure modes. IEEE Trans. Parallel Distributed Syst., 7: 225-230.
Direct Link

Hsiao, H.S., Y.H. Chin and W.P. Yang, 2000. Reaching fault diagnosis agreement under a hybrid fault model. IEEE Trans. Comput., 49: 980-986.
Direct Link

Ramarao, K.V.S. and J.C. Adams 1988. On the diagnosis of byzantine faults. Proceedings of the 7th Symposium on Reliable Distributed Systems, Oct. 10-12, Columbus, OH. USA., pp: 144-153.

Wang, S.C., Y.H. Chin and K.Q. Yan, 1990. Reaching a fault detection agreement. Proc. Int. Conf. Parallel Process., 2: 251-258.

Fischer, M., M. Paterson and N. Lynch, 1985. Impossibility of distributed consensus with one faulty process. ACM J., 32: 374-382.
Direct Link

Halsall, F., 1995. Data Links, Computer Networks and Open System. 4th Edn., Addison Wesley Publishers Ltd., London, UK., pp: 112-125

Deo, N., 1974. Graph Theory with Applications to Engineering and Computer Science. Prentice-Hall, Englewood Cliffs, New Jersey, USA

HOME JOURNALS CONTACT

Information Technology Journal

Year: 2004 | Volume: 3 | Issue: 3 | Page No.: 275-282 DOI: 10.3923/itj.2004.275.282

Detecting the Faulty Communication Links under the Loose Coupled Distributed System

K.Q. Yan and S.C. Wang

How to cite this article

K.Q. Yan and S.C. Wang, 2004. Detecting the Faulty Communication Links under the Loose Coupled Distributed System. Information Technology Journal, 3: 275-282.

Keywords: distributed system, Fault diagnosis agreement, fault tolerance, mixed fault model and loose coupled

REFERENCES

Year: 2004 | Volume: 3 | Issue: 3 | Page No.: 275-282
DOI: 10.3923/itj.2004.275.282