Identifying P2P Application with DHT Behaviors

Information Technology Journal

Year: 2011 | Volume: 10 | Issue: 3 | Page No.: 565-572
DOI: 10.3923/itj.2011.565.572

Identifying P2P Application with DHT Behaviors

Lin Ye, Hongli Zhang and Qiang Dai

Abstract: Since, the emergence of peer-to-peer applications, their traffic has gradually become the dominant component on some links, which has a significant impact on the underlying infrastructure, such as internet topology, routing systems and network strategies. It is necessary to make some adjustments in design for future, which first needs accurate identification of P2P traffic. In this study, we focus on the explicit behaviors of Distributed Hash Tables (DHT), which are deployed widely in different P2P applications. Therefore, we develop a systematic methodology to identify P2P hosts in a statistical way, i.e., based on behavior patterns of DHT functions. We analyze four representative behaviors in the view of functions, including bootstrap pattern, routing pattern, diversity pattern and short-session pattern. Actually, it is proved that these behaviors are very different from traditional applications in pattern through a series of detailed experiments. At last a novel algorithm that relies on behavior patterns of DHT is proposed. Our experiment results show that we are able to identify more than 90% of P2P hosts with at least 95% accuracy and also can deal with the silent clients, which figures out the identification in a new aspect.

Fulltext PDF Fulltext HTML

How to cite this article

Lin Ye, Hongli Zhang and Qiang Dai, 2011. Identifying P2P Application with DHT Behaviors. Information Technology Journal, 10: 565-572.

Keywords: P2P, DHT, behavior, identification and pattern

INTRODUCTION

Over the last decade, peer-to-peer (P2P) applications have drastically grown to represent a dominant component of internet traffic, which are used widely in many fields (Jiang et al., 2009; Peng and Zheng, 2010). According to the statistics report, though P2P traffic distributes unevenly over the Internet, it still contributes 50% of total traffic on average in the backbone, overwhelming some links in extreme cases. Since, P2P is organized as an overlay network regardless of the underlying topology, when more users participate into P2P networks for better download efficiency, the traffic will have a great impact on internet infrastructure, such as routing systems or flat-rate pricing business model (Piatek et al., 2009). Now there are several unfriendly battles occurring between users and ISPs. Users expect to enjoy the benefits of sharing resources. For example, the most popular use of P2P today is for file sharing in networks such as BitTorrent and eMule. But network oblivious sharing forces ISPs to invest in insufficient capacity, which increases costs for both backbone ISPs and edge ISPs. However, what is worse, many customer-facing ISPs offer flat-rate pricing and have to absorb the costs externalized by P2P applications. In response, some ISPs have elected to rate-limit or simply block these problem protocols (Marcel et al., 2008). It is necessary to adjust some strategies for future designs and deployments, such as Quality-of-Service and capacity planning, which foremost needs to identify P2P traffic.

Therefore, many researchers focus on the identification of P2P traffic, trying to look for stable characteristics or patterns, which can give a clue to show what kind of applications are roaring. Traditional approaches simply classifies P2P traffic by default well-known port numbers (Saroiu et al., 2002; Sen and Wang, 2004), such as 6881-6889 in BitTorrent and 6346-6347 in Gnutella. However, since the Recording Industry Association of American (RIAA) begins to keep their eyes on P2P networks and ISPs decide to throttle, current P2P applications have to disguise themselves in different ways (Wang and Sun, 2009.), e.g., using random ports, communicating with each other in tunnels or operating on proprietary protocols. Traffic classification must examine payload by protocol signatures to guarantee the accuracy. Studies (Sen et al., 2004; Gummadi et al., 2003) develop signature-based payload methodology in order to deal with random ports and masquerading techniques. They can have high accuracy for known or public protocols, but fail to work out encryption and unknown ones, which also involve in violating privacy and processing overhead. Further, some novel researchers stress on traffic characteristics and behaviors of P2P applications recently, which makes use of the information from the transport layer, e.g., the source and destination IP addresses or ports, the average packet size, the inter-arrival times between packets. Study (Jeffrey et al., 2006) exploits the distinctive characteristics by clustering algorithms, including total number of packets, mean packet size, mean payload size, total number of bytes transferred and mean inter-arrival time of packets. Work (Karagiannis et al., 2004, 2005) develop a novel systematic methodology to identify P2P flows at the transport layer in multiple-level views, which reveals the inherent behaviors of a P2P host at different levels: (1) social level, (2) network level and (3) application level. In summary, pattern-based methods are better fit with the mutable P2P networks, but there are still some pitfalls to overcome: (1) some patterns are not intuitive and susceptible to traffic samples. (2) CPU cycles and memory become the bottles of classification, which can not handle with high speed network.

Different from previous work, this study intends to explore the patterns of Distributed Hash Tables (DHT), which are widely used in P2P applications. We try to find several intrinsic patterns which can distinguish DHT hosts from others. As a DHT node, it usually has:

•	Bootstrap pattern: We capture the behavior pattern of a node as indicated by its bootstrapping mechanism. In order to join in DHT smoothly, each node will try to query as many nodes as it ever knew when it starts up or reconnects, which results in the burst of UDP packets in a short while

•	Routing pattern: We capture the behavior pattern of a node in terms of its routing role in DHT. For example, it often updates its routing table to keep itself fresh and the pointers to other nodes correct and meanwhile responses for routing queries from others

•	Diverse pattern: In order to overcome the churn, each node always performs parallel queries, which means many distinct nodes will be involved in one action and the number of ip that one node connects is higher than traditional applications

•

Short-session pattern: We capture the intrinsic behavior pattern of DHT protocol as control message. Compared with data transmission, control message is short-session and less than four messages are exchanged between two hosts in one interaction. Short-session pattern can help us distinguish DHT from other applications, such as live streaming

The significance of our algorithms proposed in this study lies in the intrinsic functions as a DHT node, which offers a distinct advantage over payload analysis: we can identify previously unknown P2P protocols. Also, the algorithms are deployed easily. Specially, the highlights of our study include:

•	This study explores the patterns of DHT underlying in most P2P applications and only depends on the characteristics of UDP packets, which neither inspects the payload nor uses the indirect patterns extracted from complex machine learning algorithms

•	Our method can identify more than 90% of P2P hosts with at least 95% accuracy and the method is deployed easily

•	Our method can monitor silent clients which are online but do not download any files that previous work did not ever mention

THE BEHAVIORS OF DHT

Without loss of generality, we choose KAD that is deployed in eMule as our target network. KAD is a Kademlia-based peer-to-peer routing protocol (Maymounkov and Mazieres, 2002) implemented by several of the most popular P2P applications. Similar to other DHT protocols, each KAD node is assigned a unique identifier in a 128 bit space and they are built together as a tree. Kademlia defines proximity on the basis of the XOR metric, i.e., the distance between keys X and Y is simple the integer value of X⊕Y. For example, the distance between 0100 and 0011 is 0111 (or 7). To achieve the goal of distributed routing, each node also maintains a local routing table containing a set of buckets populated with others that are closer to ranges of the key space. The routing table is constructed so that each lookup reduces the XOR distance to the target key by 1/2, providing logarithmic lookup.

Kademlia also defines several RPC operations to implement the storage and retrieval of <key, value> pairs. Kademlia performs iterative lookups until no closer nodes can be obtained: a node looks up a target key by sending parallel RPCs to several nodes whose IDs are closest to the key in its routing table. The nodes that receive a lookup RPC reply by sending back a list of nodes which are closest to the key in their ID space. Each node always tries to keep α (α≥3) parallel RPCs at the same time. A lookup terminates when some node replies with the target key, or until the last nodes whose IDs are closest to the key do not return any new closer node ID.

The whole process of P2P applications can be divided into two logical parts in the view of functions: searching/routing and downloading. In the routing process, each node needs to search the given keys or keep routing table fresh actively, which acts like a normal user. Meanwhile, they must response various requests from others to maintain distributed routing passively as a routing node. Routing messages packed in UDP from KAD are the main traffic. In the downloading process, when a node begins to share files with others using TCP as a servent, data messages become the main traffic alternatively. Because the downloading process must be triggered by users’ intention, the routing process is persistent all the time and routing behavior is stable. From the perspective of behaviors, KAD is an overlay network independent on the underlying infrastructure. The nodes have to bootstrap theirselves into KAD quickly in an efficient way. Besides, every node needs to maintain neighbors’ status and guarantee the correctness of routing table, which forms some special behavior patterns of KAD.

According to the analysis of Kademlia protocol, each node must be responsible for local routing nearby in KAD. First, each node has to reply every lookup from others and must update their status in its routing table to guarantee a better performance, which is called as routing pattern. Second, due to the dynamic of nodes (or churn), each node will send a burst of RPC packets to the nodes that it ever knew in order to bootstrap itself into KAD immediately and shows an abnormal pattern of traffic in a short while, which is called as bootstrap pattern. Third, no matter what kind of behaviors mentioned above, there is a basic rule that a node tries to communicate with a number of nodes, which is called as diverse pattern. Finally, routing message is a kind of control message functionally. KAD is responsible for routing and one interaction usually only contains one query and the corresponding reply, which is called as short-session pattern.

THE IDENTIFICATION OF BEHAVIOR PATTERNS

Before discussing the identification of behavior patterns, we first need to focus on whether the characteristics of UDP packets are observable enough as the metrics. We add up all UDP packets in a host with eMule installed for a long time and separate KAD packets from others to compute its proportion of total UDP packets. Figure 1 demonstrates the percentage of KAD packets in 8 h and tell us KAD indeed takes on the majority (more than 50%) most of the time. Figure 2 represents the CCDF of KAD percentage and half of total UDP packets belong to KAD at 75.6% of observation points, which means KAD affects the number of UDP packets markedly and the characteristics of UDP packets are proper for traffic identification.

To have a close look at the patterns of KAD behaviors in real life, in Fig. 3 the number of KAD packets are shown when one host is running eMule without any downloading task.


Fig. 1:	Percentage of KAD Packets


Fig. 2:	CCDF of KAD Packets Percentage


Fig. 3:	The No. of KAD packets number

The purpose of empty downloading list tries to obtain the traffic of a silent client that has the minimum necessary routing message, which is the worse case in our background. It will be better when some files are sharing.


Algorithm 1:	Behavior pattern identification algorithm

We can see the number scatters in three areas primarily (4000, 6000), (900, 2000) and (50, 200). The highest one is correspondence with the bootstrap behavior in the experiment and the others are active areas initiated by routing behavior. The number of UDP packets is at least 2-3 times more than usual case in both behaviors.

Combining the analysis of all previous sections yields our first algorithm based on behavior patterns for the identification of KAD nodes. Algorithm 1 presents in detail the procedure every time interval. Across time intervals, we denote a candidate node which might be a KAD node as a binary tuple {IP, Port}. A hash table called NodesTable is designed to store the information of candidate nodes. Each {IP, Port} pair that is an item of the NodesTables is equipped with sets that include: (1) an OppositeList recording the distinct nodes that it communicated with. As long as a node is considered as one KAD node, the nodes in its OppositeList also belong to KAD. (2) The number of sent packets. To save the memory, we study the common routines in KAD. We found in fact when one routine happens, the node that issues the query is related to most of packets with one NodesTable item and a long OppositeList while the rest are related to one packet with one NodesTable item and a short OppositeList. Obviously, the source node and its OppositeList can contain many other nodes naturally. Therefore, the NodesTable will be scanned periodically to delete the items with a short OppositeList.

At the end of each check interval, we analyze all {IP, Port} pairs in the NodesTable. This analysis is based on the thresholds of packet number when routing and bootstrap behaviors happen, which are respectively denoted as the symbols of PACKET_ROUTING_LIMIT and PACKET_BOOTSTRAP_LIMIT in Algorithm 1. We evaluate the thresholds for many times and choose them empirically. When the number of packets is over any of them, the nodes and their OppositeLists will be inserted into the KADNodesList as a result. Furthermore, the KADNodesList can help known KAD nodes to omit the procedure.

We set up our experiments in a medium network owing about several hundreds of hosts in March 2010, which is a subnet of our campus network. Since, the first byte in every KAD packet is pre-defined as protocol signatures, we can evaluate the accuracy of our methodology by comparing nonpayload versus payload estimates of P2P hosts. Two identification algorithms are installed into the gateway machine that is the only connection to the Internet. We choose 3000 as the threshold of bootstrap behavior and 30 as routing behavior in one minute. The corresponding true positives (TP) and false positives (FP) show in Table 1 as below. Specially, the clients in the network are divided into two classes: silent clients and downloading ones.

From Table 1, we can conclude: (1) bootstrap behavior can guarantee high accuracy with less false positives between two behaviors, which seems bootstrap behavior is more reliable than routing behavior for identification. But by comparing the times that two behaviors are detected in the experiments, we found the appearance of bootstrap behavior is much less than routing behavior, which makes bootstrap behavior need more time to be detected. (2) Bootstrap behavior is almost the same in metrics for different clients, but routing behavior is apt to point out the downloading client more accurately. (3) Silent clients are more difficult to detect than downloading ones. In summary, it is proved that the behavior patterns of DHT nodes are adequate for the identification of P2P hosts.

IMPROVEMENTS OF IDENTIFICATION ALGORITHM

As Table 1 shown, bootstrap behavior has high accuracy with less false positives, but it does not appear often because it will occur when a client starts or reconnects into KAD occasionally. Routing behavior has a short detection cycle, but it will yield many false positives due to its low threshold. Overall, Algorithm 1 still needs to develop some improvements to decrease the risk of false positives. We examined all misclassification in the experiment results and found false positives mainly come from two aspects: communication on well-known ports and data communication by applications. Applications working on well-known ports that are widely deployed send continuous packets. Data transmission can produce a large throughout. Both of them bring in additional traffic to influence the identification results of KAD nodes. Therefore, we introduce another three heuristics to filter them. In the following subsections, we describe three heuristics that augment on our basic methodology to limit the magnitude of false positives.

The well-known ports: Though the usage of UDP packet characteristics is definitely simple and efficient for the identification of KAD nodes, other applications also use UDP such as DNS and OICQ, which affect the identification accuracy of the basic algorithm greatly. To determine non-P2P applications in our network that use UDP often, we examined all packets to collect the well-known ports list. We found that besides KAD, only a few applications use UDP transport protocol: NETBIOS, NETBIOS-DS, DNS, IRC, OICQ, which collectively typically use a small set of port numbers such as 135, 137, 139, 445, 8000 etc.

Table 1:	Experiment results of Algorithm 1

Table 2:	Excluded ports for well-known UDP protocol


Fig. 4:	The No. of distinct IP

Table 2 lists all such applications found, together with their well-known ports. Excluding packets using ports presented in Table 2, over 80% of the remaining packets belong to KAD.

The pattern of ip diversity: Due to the staleness of KAD nodes with the passage of time, each node has to maintain many distinct nodes to avoid the negative effect of unpredictable nodes that will be offline at any time. Furthermore, in order to get a better response time, they also launch parallel RPCs onto remote nodes to query. Both of them make each node connect as many nodes as possible meanwhile, which might be similar to a worm. Figure 4 shows a measurement result on ip diversity when the behaviors happen and the number of distinct ip is over 50. So, the number of distinct ip is a good heuristics to filter noise traffic.

The pattern of short-session: Routing message that is used for exchanging routing information among KAD nodes is a kind of control message in nature. A normal routing interaction is usually composed of a query and its corresponding reply between one pair of nodes, which is called as short-session pattern in this study.


Algorithm 2:	Improved behavior pattern identification algorithm

Different from routing function, data communication transfers file or media data with the maximum packet size. As long as data connection is set up, the volume of traffic will be large and lasting, which means a lot of packets will be sent in both directions. The difference gives us a clue to pick out routing messages from data ones.

According to our new design, current NodesTable and OppositeList structures are not sufficient. A KnownPorts lists all well-known ports to match and filter the unnecessary packets. Every node in the OppositeList needs to record the number of packets communicated with the node that owns it in the NodesTable. The final algorithm is illustrated in Algorithm 2.

Table 3:	Experiment results of Algorithm 2

We continue our experiments in the same network environment. In Algorithm 2, two more parameters, IP_DIVERSITY and SHORT_SESSION_LIMIT, are separately 20 and 4. The corresponding true positives (TP) and false positives (FP) show in Table 3.

The improved algorithm shows us a better performance on the accuracy and correctness in both behaviors in Fig. 5 and 6. Particularly, routing behavior has an obvious promotion with the help of ip diversity pattern and short-session pattern, which means we can succeed in filtering the communication on well-known ports and data communication.


Fig. 5:	The true positives comparison of Algorithm 1 and 2


Fig. 6:	The false positives comparison of Algorithm 1 and 2

CONCLUSIONS

In this study, a novel approach is proposed for the identification of P2P applications, whose traffic is a significant and growing component of Internet traffic. We present a systematic methodology that counts on the behavior patterns of distributed hash tables embedded to identify P2P hosts indirectly. Especially our algorithm profiles the behavior patterns only based on the characteristics of UDP packets when a DHT node is running. The diverse pattern and short-session pattern are also found to improve the performance of basic algorithm in the communication. Compared to previous work, the method neither needs the payload signatures of different protocols, nor uses the indirect patterns extracted from complex machine learning algorithms. An important feature of our algorithm is the advantage to identify unknown P2P protocols and deal with a silent client.

The experiment results show that our work is able to effectively discover more than 90% of P2P hosts and the number of false positives is bound to less than 5% on average, which uses a very few of resources. Though sometimes the algorithm needs more time to detect the behavior patterns in the worse situation, it indeed provides a lightweight way to identify P2P traffic.

We have considered a number of future extensions to strengthen our algorithm. First, we wish to cut down detection time by using signal processing methods. Second, we will apply our algorithm to other P2P applications, such as Gnutella or KaZaa, where some new patterns might be developed sequently.

ACKNOWLEDGMENTS

This study is partially supported by the National Grand Fundamental Research 973 Program of China under Grant No. G2007CB311101; the National Natural Science Foundation of China under Grant No. 2009AA01Z437, National Natural Science Foundation od China (60703014) and programe for New Century Excellent Talents in University (NCET-07-0245). The authors also gratefully acknowledge the helpful comments and suggestions of the reviewers, which have improved the presentation.

REFERENCES

Jiang, X., S. Jiang and T. Peng, 2009. A multi-channel multimedia content distribution strategy using multiple description coding. Inform. Technol. J., 8: 1084-1093.
CrossRef Direct Link

Peng, T. and Q. Zheng, 2010. Resource occupation of peer-to-peer multicasting. Inform. Technol. J., 9: 438-445.
CrossRef Direct Link

Piatek, M., H.V. Madhyastha, J.P. John, A. Krishnamurthy and T. Anderson, 2009. Pitfalls for ISP-Friendly P2P Design. HotNets, New York City

Marcel, D., M. Alan, H. Andreas and P.G. Krishna, 2008. Detecting bittorrent blocking. Proceedings of the 8th ACM SIGCOMM Conference on Internet Measurement, Oct. 20-22, ACM., Vouliagmeni, Greece, pp: 3-8.

Saroiu, S., K.P. Gummadi, R.J. Dunn, S.D. Gribble and H.M. Levy, 2002. An analysis of internet content delivery systems. Proceedings of the 5th Symposium on Operating Systems Design and Implementation, (OSDI'02), ACM, New York, USA., pp: 315-328.

Sen, S. and J. Wang, 2004. Analyzing peer-to-peer traffic across large networks. IEEE/ACM Trans. Network, 12: 219-232.
Direct Link

Wang, X. and X. Sun, 2009. Fair blind signature based authentication for super peer P2P network. Inform. Technol. J., 8: 887-894.
CrossRef

Sen, S., O. Spatscheck and D. Wang, 2004. Accurate, scalable in-network identification of P2P traffic using application signatures. Proceedings of the 13th International Conference on World Wide Web, May 17-20, 2004, New York, USA., pp: 512-521.

Gummadi, K.P., R.J. Dunn, S. Saroiu, S.D. Gribble, H.M. Levy and J. Zahorjan, 2003. Measurement, modeling and analysis of a peer-to-peer file-sharing workload. ACM SIGOPS Operat. Syst. Rev., 37: 314-329.
Direct Link

Erman, J., M. Arlitt and A. Mahanti, 2006. Traffic classification using clustering algorithms. Proceedings of the SIGCOMM Workshop on Mining Network Data, September 11-15, 2006, Pisa, Italy, pp: 281-286.

Karagiannis, T., A. Broido, M. Faloutsos and K. Claffy, 2004. Transport layer identification of P2P traffic. Proceedings of the 4th ACM SIGCOMM Conference on Internet Measurement, Oct. 25-27, ACM Press, Taormina, Sicily, Italy, pp: 121-134.

Karagiannis, T., K. Papagiannaki and M. Faloutsos, 2005. BLINC: Multilevel traffic classification in the dark. Proceedings of the Conference on Applications, Technologies, Architectures and Protocols for Computer Communications, August 22-26, 2005, Philadelphia, Pennsylvania, USA., pp: 229-240.

Maymounkov, P. and D. Mazieres, 2002. Kademlia: A peer-to-peer information system based on the XOR metric. Lecture Notes Comput. Sci., 2429: 53-65.
CrossRef Direct Link

HOME JOURNALS CONTACT

Information Technology Journal

Year: 2011 | Volume: 10 | Issue: 3 | Page No.: 565-572 DOI: 10.3923/itj.2011.565.572

Identifying P2P Application with DHT Behaviors

Lin Ye, Hongli Zhang and Qiang Dai

How to cite this article

Lin Ye, Hongli Zhang and Qiang Dai, 2011. Identifying P2P Application with DHT Behaviors. Information Technology Journal, 10: 565-572.

Keywords: P2P, DHT, behavior, identification and pattern

REFERENCES

Year: 2011 | Volume: 10 | Issue: 3 | Page No.: 565-572
DOI: 10.3923/itj.2011.565.572