HOME JOURNALS CONTACT

Information Technology Journal

Year: 2011 | Volume: 10 | Issue: 12 | Page No.: 2268-2275
DOI: 10.3923/itj.2011.2268.2275
On Out-of-sequence Problem in Contention-tolerant Crossbar Switches
Jianfei Zhang, Zhiyi Fang, Guannan Qu, Xiaohui Zhao and S.Q. Zheng

Abstract: Recently, we proposed an innovate crossbar switch named Contention-tolerant Crossbar Switches (CTC(N)). Different with conventional crossbar switches which resolve output contentions by building conflict-free connections among inputs and outputs before switching, CTC(N) tolerates output contentions and avoids cell lost by reconfiguring bus in switch fabric. In this way, complex hardware which is used to complete the arbitrations is released and schedulers are fully distributed over all input ports and operate independently. CTC(N) opens a new perspective on cell switching. For the asymmetric property of original CTC(N) architecture, DiaCTC(N), an improved architecture was proposed. However, interceptions which happen in contention-tolerant mechanism lead to cell out of their original sequence. In this study, we focus on this problem in CTC(N) and DiaCTC(N). Two methods including re-sequencing and Self-adjust scheduling algorithm will be discussed.

Fulltext PDF Fulltext HTML

How to cite this article
Jianfei Zhang, Zhiyi Fang, Guannan Qu, Xiaohui Zhao and S.Q. Zheng, 2011. On Out-of-sequence Problem in Contention-tolerant Crossbar Switches. Information Technology Journal, 10: 2268-2275.

Keywords: Contention-tolerant, switches, out-of-sequence and scheduling algorithm

INTRODUCTION

Crossbar switches are commonly used in high-speed Internet routers and switches. In order to simplify scheduling and memory management, the synchronized switching in ATM switches is still used in IP switches. Variable IP length packets arriving at input ports are segmented into fixed-size cells and time is slotted with one cell transmission per time slot within the crossbar switch. The cells are reassembled into packets in output ports.

Without internal speedup, cells are buffered in input ports only and such switches are called Input Queued (IQ) switches. In order to overcome Head-of-line (HOL) blocking and enable higher throughput, input buffers in an IQ switch are usually arranged as Virtual Output Queues (VOQs). The performance of an IQ switch depends on how cells are scheduled to be switched through the switch which is reduced to matching problem in Bipartite Graphs. Many scheduling algorithms based on maximized matching have been proposed, including maximum size matching (Hopcroft and Karp, 1973) and maximum weight matching (McKeown, 1999). Since global information is required, a central scheduler operates on time complexity of O(N3) (or O(N2.5)) to find maximum (size or weight) matchings. It is too high for practical use. Thus, iterative heuristics for finding maximal matchings (Tamir and Chi, 1993) were considered instead. For such a switch, each input port and output port has an arbiter and 2 N arbiters work in Request-Grant-Accept (RGA) fashion with signal exchange. Representative works are PIM (Anderson et al., 1993), iRRM (McKeown et al., 1993), iLQF, iOCF (McKeown, 1995), iSLIP (McKeown et al., 1999) and DRRM (Chao, 2000). It is proved that O(log N) iterations (which involve global information exchange) is required to obtain a maximal matching between inputs and outputs. Though, implemented in hardware, these schedulers are considered too slow with too high cost for high-speed networks.

To reduce scheduling complexity, a small size of buffer is proposed to be added at each crosspoint. Combined with input queues, scheduling process operates in two phases. In the first phase, each input port selects a cell to place into a crosspoint buffer in its corresponding row and in the second phase, each output port selects a crosspoint in its corresponding column to take a cell from. Input (resp. output) ports operate independently and in parallel in the first (resp. second) phase, eliminating a single centralized scheduler. Some work on CICQ with and without internal speedup include, for example Rojas-Cessa et al. (2001, 2005), Chrysos and Katevenis (2003), Chuang et al. (2005), Lin and McKeown (2005), Chrysos and Katevenis (2007) and He et al. (2008). However, the cost of crosspoint buffers which requires at least O(c N2) memory space, where c is the number of bits in a cell, is used to trade for reduced control complexity. Meanwhile, for output scheduling, N separated N-input arbiters are required, resulting an additive O(N2) wiring complexity which is the same as that of an unbuffered crossbar switch with a scheduler based on maximal matching. Crosspoint buffers and the circuit for schedulers take a large chip area which severely restricts the scalability of buffered crossbar switches.

In summary, all previous crossbar switches require complex control such that N requests must be communicated to and processed by arbiters, including the case of crossbar with crosspoint buffers. We opened a new perspective on designing switches by proposing an innovative agile crossbar switch architecture called Contention-tolerant Crossbar which is denoted by CTC(N) (Qu et al., 2011). Unlike the conventional crossbars, CTC(N) can tolerate output contentions by a pipelining mechanism, with pipeline stages implemented as buffers in the input ports. These buffers are used to decouple the scheduling task into N independent parts in such a way that N schedulers are located in the N input ports and they operate independently and in parallel. Without using arbiters and crosspoint buffers that require additional chip area, the CTC(N) switch is more scalable.

In our previous study, we proposed a fully distributed scheduling algorithm scheme call Staggered Polling (SP) (Qu et al., 2010) and showed that schedulers located at input ports are able to cooperate “smartly” with zero knowledge of other input ports. On the other hand, we presented an improved CTC(N) architecture called Diagonalized Contention-tolerant Crossbar Switch (DiaCTC(N)) (Qu et al., 2011). DiaCTC(N) overcomes the asymmetric data flows problem in CTC(N) and largely enhance the performance without any additional cost. However, CTC(N) and DiaCTC(N) suffer from out-of-sequence problem which is caused by intercepting cells. In this study, we discuss this issue from two ways, i.e., re-sequencing cells and scheduling algorithm. Self-Adjusting scheduling algorithm is proposed to prevent cell out-of-sequence. Simulations showed the performance of these schemes. Since, the property of contention-tolerant, we called CTC(N) and DiaCTC(N) as CTC(N)-class switches in the rest of this study.

OUT-OF-SEQUENCE PROBLEM IN CTC(N)-CLASS SWITCHES

The switch fabric of CTC(N) is comprised of N2 crosspoints (Switching Element, SE) arranged as an N*N array.

The SE at row i and column j is denoted as SEi,j who has three inputs, three outputs and two states, as shown in Fig. 1a. Input port Ii is equipped with a scheduler Si. In one time slot, if Ii wants to transmit a cell to output port j, Sj sets the state of corresponding SEi,j to Receive-and-transmit (RT) state. The remaining SEs in the same row will be kept in Cross (CR) state. If more than one input ports set their SEs as RT in the same output column, the output column is configured as a pipeline, as shown in Fig. 1b. Cells transmitted from upstream input ports will be intercepted and buffered in downstream input ports. In this way, output contentions are tolerated in CTC(N). Without avoiding output conflicts,schedulers operate independently and in parallel which largely reduces scheduling complexity.

Buffer in each input ports can be arranged as single FIFO queue or Virtual Output Queues according to queueing management policies which contains cells both from outside of switch and from upstream input ports (if exist).

Fig. 1 (a-b): (a) A SE and its two states and (b) Each output line of CTC(N) is a reconfigurable bus

Since cells flows from upstream input ports to downstream input ports in CTC(N), input ports with larger input port number has more severe traffic load. The throughput of an input port goes down once it is overloaded. Thus, we proposed an improved CTC(N) architecture called Diagonalized Contention-tolerant Crossbar Switch denoted as DiaCTC(N) by Qu et al. (2011) DiaCTC(N) is exactly the same in all aspects of CTC(N), except the connections in each SE column. In CTC(N), SEs in each output column form a unidirectional alignment. SEi',j is an upstream node of SEi'',j in output column j, where 0< = I'<i''< = N-1 and 0< = j< = N-1. Therefore, input port 0 is the top input for any output destination and it only has traffic from outside. A cell transmitted out from input port 0 could be intercepted by N-1 downstream input ports. Input port N-1 is the bottom input for any output destination. Cells buffered in input port N-1 are from outside and N-1 possible upstream input ports by N possible output columns. While in DiaCTC(N), consider output column j, 0< = j< = N-1, SEs in output column j are classified into the following three classes:

Head SE: Sei,j is the head SE of output column j when I = j. The associated Ii is the top input for output destination Oj. Cell transmitted from Ii to Oj possibly is intercepted by Id, d = (i+k) mod N and 1≤k≤N-1
Tail SE: SEi,j is the tail SE when i = (j-1) mod N. The associated Ii is the bottom input for output destination Oj. Cells transmitted from Ii to Oj arrives at Oj without being intercepted
Internal SE: SEi,j is an Internal SE when i = (j+k) mod N, 1≤k≤N-2. Ii has upstream inputs and downstream inputs for output destination Oj. A cell transmitted from Ii to Oj could be intercepted by its downstream input Id, i.e., d = (I+k) mod N and 1≤k≤(j-1-i) mod N

The structure of DiaCTC(4) is shown in Fig. 2b. Its counterpart CTC(4) is shown in Fig. 2a for comparison, with SEs serving as Heads and Tails being labeled.

CTC(N)-class switches suffer from out-of-sequence problem. Let's consider this problem from an example, as shown in Fig. 3.

Let Ii be input port i. Cki,j is the kth cell arriving at input port i from outside of switch with output port j as its destination. In time slot t, Iu and Id select cells to be transmit to output port j, i.e., C1u,j and C1d,j, as shown in Fig. 3a. According to the contention tolerating process, C1u,j is intercepted by and buffered in Id and C1d,j arrives at its output destination at the end of t (Fig. 3b). In time slot t+1, Iu and Id transmit cells to output j and k, respectively (Fig. 3c). Without being intercepted, C2u,j arrives at output j at the end of t+1 but C1u,j which arrived at Iu earlier is delayed by Id, as shown in Fig. 3d.

The cells cannot leave output ports unless all their preceding cells leave. Thus, additional mechanisms are required to resolve this problem. In our previous study, we mainly focused on switch throughput and mean cell delay without considering cell’s sequence. In this study, we analyze how to resolve this problem in two ways.

One quick fix is adding re-sequence buffers in each output port and they are arranged as N queues. RVIQi,j denotes the re-sequencing queue containing cells from input port i in output port j and cells can be inserted into any place of RVIQi,j. Cells will be re-ordered in RVIQi,j according to their time stamps. In this way, cells leave RVIQi,j sequentially. It increases hardware complexity and additional re-sequence delay, since cells could be blocked in RVIQi,j till all of cells arriving at switches earlier left. Under following assumptions: 1. no cell lost during the switching process; 2. Cells arriving at output port can be inserted into any place of re-sequence buffer; 3. A cell is ready when all of cells who have earlier time stamp are ready. Let Ta denote the time stamp of cell C, i.e., C’s time stamp. Tr denotes the time when C is ready. Then the re-sequencing delay of C is defined as: Tr-Ta.

The second way is designing smart scheduling algorithm to prevent cell disordering. We will present a fully distributed scheduling algorithm call Self-Adjust Scheduling Algorithm in next section and simulation results show the performance of CTC(N) and DiaCTC(N) with both two out-of-sequence resolving method.

SELF-ADJUSTING SCHEDULING ALGORITHM

We present a fully distributed scheduling algorithm called Self-adjusting Scheduling Algorithm (SA) and we will show that out-of-sequence problem doesn’t exist with SA and high performance achieves by desynchronizing the scheduling pointers.

In input port i, 0 <=i <= N-1, VOQs are used to buffer cells from outside. A small separated queue UQi is used to buffer cells from its upstream inputs, as shown in Fig. 4. In order to simplify our work, we assume that one time slot is cut into three phases: cells are placed in VOQs from outside in the first phase; cells are transmitted through switch fabric in 2nd phase; cells are intercepted and placed in Uqs in 3rd phase. Under admissible traffic, no more than one cell arrives at one input port from outside.

Fig. 2 (a-b): (a) CTC(4) and (b) DiaCTC(4)

Fig. 3 (a-d): Example of out-of-sequence problem

Fig. 4: Self-adjusting scheduling mechanism in input port i

Without internal speedup, at most one cell can be transmitted and received by one input port and one output port, respectively. We would like to mention that there are at most two cells arrive at one input port and a special designed buffer can deal with this situation without any speedup (Zheng, 2011).

Let Ci,k denote the cell from input ports i originally and with output j as its destination. Scheduler Si maintains a scheduling pointer Ai operating in round-robin fashion with initial place being VOQi,0. During time slot t, Si selects cells to be transmitted in next time slots according to following algorithm:

If there is cell Cu,k being intercepted and placed in UQi, Si selects Cu,k as its scheduling results and update scheduling pointer to k
If there is no cell in UQi, Si looks for the first non-empty VOQ by moving the place of scheduling pointer from its current location. When the scheduling pointer Ai obtains VOQi,j, it stops moving and keeps its location on j and Si selects the head-of-line cell of VOQi,j as its scheduling result

Reconsider the example shown in Fig. 3. With Self-adjusting scheduling algorithm, C1u,j is intercepted by and buffered in Id at time slot t and will be retransmitted at time slot t+1.

Fig. 5 (a-d): Scheduling Example in CTC(4) with SA scheduling algorithm

It ensures C2u,j arriving at output port j be forward after C1u,j. Cells disordering is avoided efficiently.

Schedulers are distributed over all of the input ports and they operate independently according to their local information. Through output conflicts occurring in switch fabric can be tolerant and cell lost can be avoided, throughput is blocked by cell intercepting. However, the scheduling pointers can be desynchronized smartly, like iSLIP. Take CTC(4) with SA scheduling algorithm as an example, as shown in Fig. 5. We assume that all of VOQs in each input ports are not empty and all of the scheduling pointers locate at VOQi,0 at time slot t. Si, 0≤i≤3, selects the head-of-line cell of VOQi,0 as its scheduling result and updates its scheduling pointer to 1. However, since 4 input ports transmit cells to the same output port, 3 downstream input ports (1, 2 and 3) intercept and buffer cell in their UQs. At time slot t+1, input port 1, 2 and 3 update their scheduling pointer to 0 but input port 0 moves its scheduling pointer to 1. In this way, at time slot t+3, scheduling pointers distributed over all of the input ports are desynchronized and avoid output conflicts.

SIMULATION ANALYSIS

The performances in terms of mean cell delay under uniform and non-uniform traffic are investigated by simulations as shown in Fig. 6 and 7. Uniform traffic includes Bernoulli arrivals and Bursty arrivals with burst lengh is 16. For non-uniform traffic, we choose Chang’s traffic model and Asymmetric traffic model which are used in other works commonly. The switch size is 32*32. Crossbar switch with famous iSLIP algorithm with one iteration is used as comparison.

CTC(N)_SP_Random (DiaCTC(N)_SP_Random) denotes mean cell delay of CTC(N) (DiaCTC(N)) with SP algorithm scheme and Random pattern as its secondary scheduling strategy ignoring re-sequencing delay. In detail of SP scheduling algorithm, CTC(N)_SP_Random_R (DiaCTC(N)_SP_Random_R) represents the mean cell delay of CTC(N) with SP_Random scheduling algorithm and re-sequencing cell order. The difference between them is re-sequencing delay, as shown in the first graph of Fig. 6. CTC(N)_SA and DiaCTC(N)_SA denotes CTC(N) and DiaCTC(N) with Self-Adjusting scheduling algorithm.

Ignoring re-sequencing delay, DiaCTC(N) shows the best performance under all of traffic models. It can be seen as the optimal performance of DiaCTC(N) with SP scheduling algorithm. However, high re-sequencing delay happens when cells are re-ordered at output ports. DiaCTC(N) with SA algorithm has similar performance as iSLIP. But the performance of CTC(N) with SA algorithm is not desirable. Asymmetric data flow in CTC(N) make the scheduling pointers distributed over input ports be desynchronized slowly and it leads to high delay. It is worth to notice that, the delay of DiaCTC(N) with SA algorithm increases sharply when offered load is between 0.25 and 0.6 under bursty traffic and performs same performance as iSLIP when offered load is over 0.60. It implies that the scheduling pointers are trapped into bad pattern and are hard to be desynchronized under low offered load and bursty arrivals. However, with increasing offered load, all of VOQs becomes non-empty and schedulers operate efficiently. Considering the fully distributed controlling property and zero signal exchange, the performance of DiaCTC(N) with SA algorithm is really significant.

Fig. 6 (a-b): Mean cell delay under uniform traffic; (a) Bernoulli uniform traffic and (b) Bursty uniform traffic

Fig. 7 (a-b): Mean cell delay under uniform traffic; (a) Change’s traffic and (b) Asymmetric traffic

In summary, DiaCTC(N) enhance CTC(N) significantly. DiaCTC(N) with Self-Adjusting scheduling algorithm has similar performance with iSLIP. However, without signal exchange and with fully distributed controlling property, SA scheduling algorithm on DiaCTC(N) is more simple and more scalable. Through DiaCTC(N) with SP scheduling algorithm suffers from high re-sequencing delay, it can be reduced by using sophisticated queuing management strategy and approaches to its optimal performance.

CONCLUSION

In our previous studies, we proposed an innovative agile crossbar switch architecture CTC(N) and its improved architecture DiaCTC(N), who have fully distributed control property. We call them CTC(N)-class switches. However, due to multiple path switching, CTC(N)-class switches suffer from cell out-of-sequence problem. In this study, out-of-sequence problem is discussed from two ways, i.e., re-sequence and scheduling algorithm. Re-sequence method causes high additional delay and extra hardware. We propose a new scheduling algorithm named Self-Adjusting and we show that it is able to keep cells’ original arriving sequence. Simulations show that high performance can be achieved with SA scheduling algorithm. Re-sequencing delay can be reduced by using sophisticated queuing management strategy. We will address this issue in our subsequence papers.

REFERENCES

  • Hopcroft, J.E. and R.M. Karp, 1973. An n^5/2 algorithm for maximum matchings in bipartite graphs SIAM J. Comput., 2: 225-231.
    CrossRef    


  • McKeown, N., A. Mekkittikul, V. Anantharam and J. Walrand, 1999. Achieving 100% throughput in an input-queued switch. IEEE Trans. Commun., 47: 1260-1267.
    CrossRef    


  • Tamir, Y. and H.C. Chi, 1993. Symmetric crossbar arbiters for VLSI communication switches. IEEE Trans. Parallel Distributed Syst., 4: 13-27.
    CrossRef    


  • Anderson, T.E., S.S. Owicki, J.B. Saxe and C.P. Thacker, 1993. High speed switch scheduling for local area networks. ACM Trans. Comput. Syst., 11: 319-352.
    CrossRef    Direct Link    


  • McKeown, N., P. Varaiya and J. Warland, 1993. Scheduling cells in an input-queued switch. IEE Elect. Lett., 29: 2174-2175.
    CrossRef    


  • McKeown, N., 1995. Scheduling algorithms for input-queued cell switches. Ph.D. Thesis, UC Berkeley


  • McKeown, N., 1999. The iSLIP scheduling algorithm for input-queued switches. Networking IEEE/ACM Trans., 7: 188-201.
    CrossRef    


  • Chao, J., 2000. Saturn: A terabit packet switch using dual round robin. IEEE Comm. Mag., 38: 78-84.
    CrossRef    


  • Rojas-Cessa, R., E. Oki and H.J. Chao, 2001. CIXOB-k: Combined input-crosspoint-output buffered packet switch. Proc. IEEE GLOBECOM, 4: 2654-2660.
    CrossRef    


  • Rojas-Cessa, R., E. Oki and H.J. Chao, 2005. On the combined input-crosspoint buffered switch with round-robin arbitration. IEEE Trans. Commun., 53: 1945-1951.
    CrossRef    


  • Chrysos, N. and M. Katevenis, 2003. Weighted fairness in buffered crossbar scheduling. Proceedings of the Workshop on High Performance Switching and Routing, June 24-27, Torino, Italy, pp: 17-22.


  • Chuang, S.T., S. Iyer and N. McKeown, 2005. Practical algorithms for performance guarantees in buffered crossbars. Proc. Annu. Joint Conf. IEEE Comput. Commun. Soc., 2: 981-991.
    CrossRef    


  • Lin, M. and N. McKeown, 2005. The throughput of a buffered crossbar switch. IEEE Commun. Lett., 9: 465-467.
    CrossRef    


  • Chrysos, N. and M. Katevenis, 2007. Crossbars with minimally-sized crosspoint buffers. Proceedings of the Workshop on High Performance Switching and Routing, May 30-June 1, Brooklyn, NY., pp: 1-7.


  • He, S.M., S.T. Sun, H.T. Guan, Q. Zheng, Y.J. Zhao and W. Gao, 2008. On guaranteed smooth switching for buffered crossbar switches. IEEE/ACM Trans. Network., 16: 718-731.
    CrossRef    


  • Qu, G., H.J. Chang, J. Wang, Z. Fang and S.Q. Zheng, 2011. Contention-tolerant crossbar packet switches. Int. J. Commun. Syst., 24: 168-184.
    CrossRef    


  • Qu, G., H.J. Chang, J. Wang, Z. Fang and S.Q. Zheng, 2010. Designing fully distributed scheduling algorithms for contention-tolerant crossbar switches. Proceedings of the International Conference on High Performance Switching and Routing, June 13-16, Richardson, TX., pp: 69-74.


  • Zheng, S.Q., 2011. High-speed multi-Input FIFO queues with applications. Technique Report UTDCS-04-11. 2011.

  • © Science Alert. All Rights Reserved