Structural Query Optimization in Native XML Databases: A Hybrid Approach
As XML (eXtensible Mark-up Language) is gaining its popularity in data exchange over the Web, querying XML data has become an important issue to be addressed. In native XML databases (NXD), XML documents are usually modeled as trees and XML queries are typically specified in path expression. The primitive structural relationships are Parent-Child (P-C), Ancestor-Descendant (A-D), sibling and ordered query. Thus, a suitable and compact labeling scheme is crucial to identify these relationships and henceforth to process the query efficiently. We propose a novel labeling scheme consisting of < self-level:parent> to support all these relationships efficiently. Besides, we adopt the decomposition-matching-merging approach for structural query processing and propose a hybrid query optimization technique, TwigINLAB to process and optimize the twig query evaluation. Experimental results indicate that TwigINLAB can process all types of XML queries 15% better than the TwigStack algorithm in terms of execution time in most test cases.
eXtensible Mark-up Language (XML) is primarily used as a data exchange format. However, the amount of data exchanged and transmitted often grows exponentially via the Web medium. As such, web applications such as search engines, e-commerce and e-learning portals require advanced tools for managing data. Furthermore, user demands are no longer restricted to retrieval of full-text queries but also requests for more specific data (structural queries). This drives the requirement for efficient and reliable storage and query of large-scale XML data (Haw and Rao, 2005).
This study is concerned with structural queries. There are two main approaches for structural query processing in native XML databases (NXD). The first approach is to traverse the XML database sequentially to find the matching pattern. This approach certainly poses a new challenge, because it may not meet the processing requirements under heavy access requests (Li and Moon, 2001). As a result, some researchers had complemented it with index-based techniques in order to reduce the portion to be scanned during query evaluation. In this context, many indexing techniques have been introduced to optimize the query processing speed.
The second approach is executed through a series of processes involving decomposition, matching and merging (Yao and Zhang, 2004). Firstly, a complex query pattern is decomposed into a set of basic binary structural relationship between pairs of nodes. These relationships can be either of ancestor-descendant (A-D) or parent-child (P-C) relationship. The query pattern can then be matched by matching each of the binary structural relationships against the NXD. Next, these matches are merged together to form the path solution. Another similar approach is to decompose the twig query into a set of path queries, followed by a join operation to reconstruct the matched twig pattern. However, these approaches still suffer from having large intermediate results, which may not participate in the final results. This certainly imposes serious scalability and efficiency issues.
In this study, we adopt the decomposition-matching-merging approach to decompose the twig query into a set of path queries and propose a novel hybrid query optimization technique, INLAB (combination of INdexing and LABeling techniques) which consists of createINLAB encoding and TwigINLAB algorithms. The index structures of INLAB allow us to efficiently find all elements that belong to the same parent or ancestor. The proposed labeling scheme quickly determines the A-D, P-C, sibling and ordered query relationship between elements in the NXD. Moreover, our approach can reduce the number of inspections performed on irrelevant results during the merging process and hence improves on the efficiency.
Our contribution can be summarized as follows:-
||The proposed createINLAB labeling scheme can be used for determining
the four main types of relationships: (i) A-D (ii) P-C (iii) sibling and
(iv) order-based queries efficiently.
||A query optimization algorithm, TwigINLAB to process twig
queries without traversing the whole XML tree.
XML query: Much research has been done on XML query languages. Examples of prior work on XML query languages are XML-QL, Lorel, Quilt, XPath and XQuery (Abiteboul et al., 1997; Anonymous, 2005) and XML query algebras such as XOM, XAL, YATL and Lore algebra (Frasincar et al., 2002; Zhang and Dong, 1999; Abraham et al., 2004). XML algebra typically provides a solid ground to define the semantics of a query language, to analyze its power of expression and to perform query optimizations. The Lore algebra (McHugh and Widom, 1999) uses cost-based optimization to efficiently evaluate path expressions based on an idiosyncratic set of logical and physical operators. Christophides et al. (2000) propose YATL, which introduces a set of idiosyncratic operators based on hybrid technologies of relational and object-oriented databases. XOM (Zhang and Dong, 1999) is a complete and closed algebra composed of six object operators, but do not provide any optimization support. El-Sayed et al. (2003) study the challenges related to handling order in the XML context and propose a key encoding for XML nodes to support node identity and node order. Recently, Abraham et al. (2004) propose to extend the Niagara algebra (Galanis et al., 2002; Viglas et al., 2002) by proposing an unnest-and-select operator to simplify the logical expression of query, improves processing time and to reduce the storage requirement. On the other hand, XML query languages typically provide the means to extract and manipulate data from XML documents. Although most of the query languages differ in detailed grammars and representation, they share a common feature, that is, queries usually make use of path expression for query evaluation (Yao and Zhang, 2004). Hence, we will discuss two types of queries, namely path query and twig query.
Path query defines query on one single element at a time while twig query defines
query on two or more elements. Thus, they are also known as Simple Path Expression
and Branching Path Expression, respectively. In both cases, query nodes may
be elements, attributes or texts. However, query edges for path query are either
P-C or A-D relationships, whereas query edges for twig query pattern may be
either P-C, A-D or sibling (preceding and following) relationships. In XPath
(Anonymous, 2004) notation, P-C relationship is denoted by / while
A-D relationship is denoted by //. There are two types of sibling
relationships, which also determine the ordering of the relationship; namely
preceding-sibling (denoted by preceding-sibling::*) and following-sibling (denoted
by following-sibling::*). Figure 1 shows the queries and their
corresponding path expression specified in XPath notation. For present study,
our focus is on twig query.
Indexing and labeling: As mentioned earlier, some researchers have complemented
structural queries with index-based techniques. Index structures have been introduced
to address the problem of performance degradation due to excessive traversal.
Among them are DataGuide (Goldman and Widom, 1997), T-index (Milo and Suciu,
1999), A(k)-index (Kaushik et al., 2002a), Index Fabric (Cooper et
al., 2001), APEX Index (Chung et al., 2002), D(k)-index (Chen et
al., 2003), M(k)-index (He and Yang, 2004), F&B index (Kaushik et
al., 2002b) and Structural Map (Zheng et al., 2002). DataGuide provides
general path indices that summarize all paths in the tree starting from the
root to the respective node. T-index selects paths based on specific templates,
while APEX, A(k)-index and D(k)-index select the most frequently used paths
in queries by restricting the path length to k. M(k)-index improves D(k)-index
by avoiding over-refinement of irrelevant index by targeting the data nodes
which are relevant to frequent queries only. Zheng et al. (2002) propose
two types of structural maps namely Full Structural Map and Partial Structural
Map. Full Structural Map indexes all of the path segments between any two nodes
in the guide while Partial Structural Map indexes only partial paths that are
frequently accessed. The F&B index (Kaushik et al., 2002b) uses the
Forward and Backward bi-similarity technique to answer all branching path expressions.
|| Various types of queries
However, the size of the F&B index is usually as large as the original
document. Unlike the other approaches, Index Fabric encodes its label path as
string and stores them in a balanced Patricia trie. However, all these approaches
still suffer from large index size growth.
As such, several labeling schemes have been proposed to overcome the shortcomings of structural indexing and to annotate the hierarchical relationship present in the XML tree. Among some of the labeling schemes are tree traversal order (Dietz, 1982), tree location address (Kimber, 1993), extended preorder traversal (Li and Moon, 2001), simple prefix (Cohen et al., 2002), ORDPATH (ONeil et al., 2004), prime number labeling (Wu et al., 2004) and BIRD (Weigel et al., 2005). In tree traversal order, each node is labeled with a pair of unique integer consisting of preorder and postorder traversal sequences. Thus, given any two nodes A and B, A is an ancestor of B if and only if A occurs before B in the preorder traversal of XML tree and after B in the postorder traversal. Li and Moon (2001) propose to extend Dietzs labeling scheme and to integrate with an indexing mechanism to enable efficient search by value and structure. Their labeling scheme is based on a pair of numbers <order, size>, where order is generated during the preorder traversal and size is assigned as an arbitrary integer larger than the total number of descendants for each node to accommodate future insertion gracefully. In tree location address and simple prefix, each ancestor node is a prefix of its descendant. A node id (nid) is the concatenation of the nids through the path from the root to the respective node. The concept for ORDPATH is similar to Dewey Order (Tatarinov et al., 2002), which encodes the P-C relationship by extending the parent label with a component for the child. However, ORDPATH reserves the even numbering for further node insertion. On the other hand, the BIRD labeling scheme is based on a structural summary similar to DataGuide (Goldman and Widom, 1997). BIRD labeling is compatible with document order where the nodes visited later in a pre-order traversal of the document tree have larger BIRD numbers. In addition, each node has an integer weight, which determines whether the reconstruction process is necessary. In prime number labeling, using the top-down approach, each non-leaf node will be given a unique prime number. The label of each node is the product of its parent nodes label (parent-label) and its own assigned number (self-label).
Twig query processing Navigational approach based on algebraic expression: Algebraic expressions render the possible algebraic optimization for faster evaluation in a Query Optimizer (Jagadish et al., 2001; Abraham et al., 2004; Zhang, 2006). Recently, there emerged two types of navigational approaches to speed up query processing, namely, query-driven and data-driven. In the query-driven approach such as Natix (Brantner et al., 2005), each location step in the path expression is translated into a transition of a set of nodes. Conversely, in the data-driven approach such as XNav (Josifovski et al., 2005), the query is translated into an automaton and the data tree is traversed according to the current state of the automaton.
Basically, the Natix query-processing engine translates a path query into a native XML algebraic expression. Next, each location step is translated into an Unnest-Map operator, which is then translated into a physical operator to directly retrieve the navigation in the Natix storage system. Kanne et al. (2005) optimize the Natix I/O performance by considering multiple Unnest-Map operators as a whole and schedule I/O accordingly.
XNav is a navigational processing technique based on finite state automata. This algorithm requires only one pass of the input XML data, possibly skipping some tree nodes. Therefore, it is analogous to the sequential scan operator in relational database systems. The difference is that XNav has a more complex data access pattern.
There are strengths in the algebraic approach, i.e., it facilitates iterator-based applications, support full-text and structural queries and proven performance in relational database systems (Brantner et al., 2005). However, in this study, we choose to do optimization using the decomposition-match-merge approach, because our main concern is on structural queries only as well as data streaming processing under constrained devices with limited computing and disk storage space ability. Future work may involve comparisons between the algebraic and the decomposition-match-merge approaches.
Decomposition-matching-merging approach: Twig query pattern matching typically involves decomposition-matching-merging processes with (1) decomposition of query pattern into a set of binary relationships between each pair of nodes or into a set of path queries (2) matching of each binary component of the query pattern against the NXD and (3) merging-joining matched pairs to obtain the final result.
Our INLAB approach focuses on optimizing all three sub-processes; introducing novel compact labeling scheme, optimizing the matching phase and reducing the number of inspection required in the merging phase.
In the first sub-process, most researchers build inverted indices (very popularly used in information retrieval systems) on the XML document. Text words are indexed in T-index while elements are indexed in E-index, which maps elements to inverted lists (Zhang et al., 2001; Bruno et al., 2002; Lu et al., 2004). These techniques use the labeling of (docno, begin: end, level) for an element and (docno, wordno, level) for a text word as the positional representation of XML elements and texts. However, we use <self-level: parent> as the positional representation instead.
Most literature focuses on the second sub-process: matching binary structural relationships. Zhang et al. (2001) and Al-Khalifa et al. (2002) propose MPMGJN (multi-predicate Merge Join) and Stack-Tree algorithm, respectively to match the binary structural relationship. These algorithms accept two lists of sorted individual matching nodes and structurally join pairs of nodes from both lists to produce the matching of the binary relationships. The difference between MPMGJN and Stack-Tree is that MPMGJN requires multiple scans on input lists for the matching process. In contrast, Stack-Tree algorithms are more efficient as they use stack to maintain the ancestor or parent nodes and therefore require only one time scan per input list. However, these approaches still produce large intermediate results. To address this problem, Bruno et al. (2002) propose TwigStack, a holistic twig join algorithm which uses a chain of linked stacks to compactly represent the intermediate results and subsequently join them to obtain the final results. However, this algorithm is only optimal for A-D relationships. Lu et al. (2004) extend TwigStack and propose TwigStackList, which can support both P-C and A-D relationships efficiently. Jiang et al. (2003) extended the holistic approach and propose pipelining joining multiple inverted lists at one time so that no intermediate results are generated. Besides, they also extended the holistic approach to process twig queries with an OR predicate (Jiang et al., 2004). Recently, Jiao et al. (2005) and Yu et al. (2006) propose a holistic path join algorithm for path and twig query, respectively using NOT predicates.
Another similar approach is to decompose the twig query into a set of path queries instead. Polyzotis et al. (2004) propose methods to reduce the number of intermediate results by introducing a filtration step based on some notion of synopses to facilitate query-approximate answers. They propose both TREESKETCH and TWIG-XSKETCH. Amer-Yahia et al. (2002) propose to preprocess the query patterns before the matching phase is executed. Since the efficiency of tree pattern matching depends on the size of the pattern, it is essential to identify and eliminate redundant nodes in the pattern before the matching phase takes place. On the other hand, Zezula et al. (2004) propose a novel technique, tree signature, to represent tree structures as ordered sequences of pre-order and post-order ranks of the nodes. They use tree signatures as index structure and find qualifying patterns through integration of structurally consistent path query.
Merging together the structural matches in the final process poses the problem of selecting a good join ordering. Wu et al. (2003) propose a cost-based join order selection of structural join. Kim et al. (2004) suggest partitioning all nodes in an extent into several clusters. Given two extents to be joined, Kim et al. (2004) propose filtering out unnecessary clusters in both extents prior to the joining process.
OVERVIEW OF INLAB
CreateINLAB labeling scheme: In our createINLAB labeling scheme, given an XML tree T, any label consists of <self-level:parent> representation, where (i) self is obtained by doing a pre-order traversal of the tree nodes (ii) level of a node is its distance from the root and (iii) parent is the direct node which relates to the self node.
Structural relationships between nodes can be efficiently determined from the label as follows:
P-C relationship: Node1 is the parent of node2 if and only if node1.self = node2.parent.
Sibling relationship: Node1 is the sibling of node2 if and only if node1.parent = node2.parent.
Ordered query relationship (predecessor and successor):
||Node1 is the predecessor node of node2
if and only if node1.self < node2.self.
||Node1 is the successor node of node2 if and only if node1.self > node2.self.
A-D relationship: Node1 is possible as an ancestor of node2 if and only if level difference, leveldiff = node2.level - node1.level >= 1. A multiple look-up via PCTable shown in Fig. 2b is necessary as long as the leveldiff > 1 is true to confirm the A-D relationship. Further explanation is shown in Algorithm 3.
Figure 2a and b show the fragment of streams
and index table generated based on the sample XML document during the createINLAB
Overview of TwigINLAB processing: Figure 3 shows the
overall processes involved in TwigINLAB processing. Initially, the query pattern
is being analyzed in analysisQueryPattern() function.
|| (a) Fragment of streams generated (b) fragment of index table
generated based on a sample XML document
|| Overall flow of TwigINLAB
For each query edge, if the twig is of P-C relationship, the parent and child(s)
details will be updated in the twigPC repository. Next, the partitionTwig()
function takes place. During this function, the twig pattern is decomposed into
two or more path queries. Each possible match is pushed into the stack in the
twigJoin() function. Next, these matches are merged back through the mergeTwig()
function. Finally, the final solutions are output through the outputSolution()
Some basic operations involved in these functions include operation over stack, operation over hashtable and operation over vector. Operations over a stack are empty() to examine if the stack contains no entry, pop() to remove an entry, push() to add an entry, peek() to peek on the entry at the topmost, element At (index) to retrieve the entry at position specified and size() to return the total entry in a particular stack. Operations over a hashtable includes get(key) to retrieve each value which belong to the key and put(key, value) to add an entry into the hashtable. Operation over a vector is addElement(entry) to add an entry. Function isRoot(q) and isLeaf(q) each examines whether a query node is a root or a leaf node, respectively.
The analysisQueryPattern() function: The analysisQueryPattern() function is presented in Algorithm 1. For each SAX event, if the start tag is found (lines 8-24), this function pushes the tag into tStack. At every point during computation, each entry in tStack (from bottom to top) is guaranteed to lie on a root to leaf sequence. This function also identifies each relationship between two adjacent nodes and stores it into the twigPC table as in line 18. Every entry in the twigPC contains unique parent query node information as control in line 11. However, if there are more query nodes, which belong to the same parent, they will be added as a collection of childs (line 13). If the query edge is of P-C relationship as specified by character 1, this information will be updated in the twigPC table by the function setPCRelation() as in line 32. On the other hand, for each SAX event where an end tag is found, the top entry in tStack is removed. The output of this function using the query specified in Fig. 3 is shown in Fig. 4.
The partitionTwig() function: The partitionTwig() function is depicted
in Algorithm 2. This function partitions the twig query pattern into a set of
path queries. Starting from the root of twig query pattern, for each start tag
event, it pushes the tag into twigStack as in lines 9-10. When it reaches an
end tag event (lines 11-33), it checks whether the currentNode at the top of
twigStack is a leaf node. If it is a leaf, the query node will be added one
by one to the vpathList according to the Last In First Out (LIFO) sequence of
a stack operation. Finally, if the currentNode is the root, it will firstly
be added to the vpathList and next the vpathList sequence will be output in
reverse order by the function reverse() (lines 44-50). Thus, the final output
of this function is a set of path queries in root-to-leaf order in pq hashtable.
However, if the current tag is a non-leaf node, the top entry at the top of
twigStack will be removed.
||Fragment of twigPC generated during the analysisQueryPattern()
If the SAX event is end of document, for each entry
in pq hashtable, it recursively calls the initialize() and twigJoin() functions
The next process involves finding matches for the set of path queries. This process is similar to pathINLAB() as presented in work (Haw and Rao, 2007). The difference is partitionTwig() processes a set of path queries recursively. For clarity, this function is rewritten again and depicted in Algorithm 3.
The twigJoin() function: Let q denote the query node in the path query
and e denote the positional representation of occurrence in the encoded data
streams, Tq. Function getRoot() is self-explaining; returning the root of path
Associated with each q in a path query is Tq, which contains occurrences,
eq. Each eq in the stream is sorted by their self attributes. We use Cq to point
to the current eq in Tq. Initially, Cq points to the first occurrence. Function
eof(Cq) tests whether Cq is at the end of Tq. To proceed to the next occurrence,
we use the advance(Cq) function. In addition, we also associate a stack Sq for
each query node. Each entry in the stack consists of eq that matches to the
In the twigJoin() algorithm (Algorithm 3a), it repeatedly calls the getNext()
function to get the next q to process as in line 5.
At lines 6-9, partial answers
from the stacks that cannot be extended to final answers are removed-in the
procedure cleanParentStack() and cleanSelfADStack() -given the knowledge of
the next eq to be processed. Procedure cleanParentStack() is invoked if the
returned node to be processed is other than the root (a root does not have any
parent). During this function call, if the parent stack is not empty, the entry
at the topmost in the parent stack is compared with the current node and will
be removed if it does not participate in the partial solution. However, if the
returned node to be processed is the root or its parent stack is not empty,
procedure cleanSelfADStack() is invoked. During this function call, if the stack
of the current node is not empty, the topmost entry will be compared to the
current node and will be removed if it does not participate in the partial solution.
Each potential eq, which may fulfill the matching criteria, is pushed into the
stack by the procedure moveToStack() for further processing. If qString is a
leaf node, the solution should be output as in lines 11-12. Note that path solutions
should be output in root-leaf order so that they can be easily merged together
to form final path matches (line 19). Once the eq has been processed, lines
13-15 remove the eq from the stack and advance to the next eq.
In the getNext() function (Algorithm 3b), if q is a leaf query node (checked
by procedure isLeaf()), the function directly returns to output the solution
(line 24). In line 26, we recursively invoke the getNext() function until it
is terminated by either line 24 or 27. Path query has only one child per node,
thus the procedure getChild(q) returns the immediate children of node q. In
line 27, if any returned node n is not equal to child of q, we immediately return
n. Line 28 skips nodes that do not contribute to results, by checking whether
the two nodes are of A-D relationship.
During this process (line 44 of the checkAncestor() function), the index table,
PCTable (storing P-C relationship) is hashed to retrieve each query nodes
parent for comparison. Procedures getSelf() and getLevel() return the self and
level attributes of the current eq. Lines 32-33 return the next eq to be processed.
The mergeTwig() function: In the mergeTwig() function (Algorithm 4),
all partial solutions from the twigJoin() function are merged together to generate
the final solutions. This function begins by comparing each entry in the partial
solutions of two path queries at a time. All the occurrences in the partial
solutions are in sorted order of their self-attributes. If each entry first
node is equal (line 12), or if the query edge is of P-C relationship and the
second query node is of sibling and predecessor relationship, the partial solution
will be added to the final solutions as in lines 13-18. For query edge with
A-D relationship, if the second query node is a predecessor, it will be added
as a final solution. In both cases, the inner loop begins the iteration from
the current j position. Hence, this function skips the unnecessary iteration
of non-feasible partial solutions.
|| The merging process scenario
However, if the first node in the second path query is greater than node1,
the next inner loop will begin from position j-1 (for cases where, j > 0).
Figure 5 shows the merging process.
Complexity analysis of twiginlab: Here, we discuss the correctness of the TwigINLAB algorithm and analyze its complexity.
Definition 1: Let Q be the twig query. For each node q in Q, we say that the cursor, Cq initially points to the first occurrence of element q, eq0, which is also known as head element in the stream Tq.
Definition 2: After splitting Q into several path queries, pqi, each q (except the leaf node) in pqi has only one child, tempq and has the same root.
Definition 3: We say that a node q has the descendant extension if the child, tempq has an element occurrence etempni, which is a descendant of eqi.
Lemma 1: Suppose that for an arbitrary node q in the path query, we have getNext(q) = q. Then, the following properties hold:
||q has the descendant extension
||either (a) q = q or (b) parent(q) does not have the
descendant extension because of q.
Proof: For each path query, solution is generated by function twigJoin()
which recursively calls the getNext() function to get the next q to be processed.
In the getNext() function, if q is a leaf node, it verifies all properties,
so we return it in line 24. Otherwise, we get n=getNext (tempq), where tempq
is the direct child of q in line 25. If we have that n ≠ tempq, n verifies
properties (1) and 2(b) with respect to tempq, so we return n in line 27. Next,
we check if q and n have any A-D relationship. If they are not in A-D relationship
and if q self attribute is larger than n self attribute, we return n, else we
advance q to the next occurrence, eqi.
If q and n are of A-D relationship
and the q self attribute is smaller than n self attribute, it satisfies properties
(1) and 2(a), thus we return it in line 33.
Lemma 2: Suppose getNext(q) = q returns a query node q where, q ≠ q in line 27 of algorithm 3b. If the stack is empty, then the head element does not participate in any partial solution.
Proof: If node q is other than root, procedure cleanParentStack() will be invoked as in line 7 algorithm 3a. However, this function does nothing because the parent stack is empty. Thus, it directly proceeds to line 17 to advance to the next eq. Hence, the head element does not participate in any partial solution.
Lemma 3: All node occurrences, eqi are processed based on the order of their self attributes. So, the cursor, Cq always moves forward. Consequently, each eqi is accessed once only.
Proof: In Algorithm 3b, getNext(q) always returns the node with the
smallest self attribute as shown in line 29 (if the two nodes are not in A-D
relationship, but the node n self attribute is less than node q) and line 32
(if the two nodes are in A-D relationship but the node n self attribute is less
than node q self attribute).
Lemma 4: Suppose that, after the partitionTwig() function, there are m number of path queries, pq1, pq2,
, pqm,. Function mergeTwig() merges two paths at a time to form the final solution and skip the non-potential path.
Proof: Algorithm mergeTwig() check for potential merge-able path recursively two paths at a time as in line 5. Each partial solution of the first path is compared to each partial solution in the next path. If the first node in each path is the same (based on definition 2), the paths are merge-able and added to the solution as shown in lines 13-24, else the function skips the unnecessary iteration of non-feasible partial solutions as shown in lines 25-30.
Theorem 1: Given a twig query, Q and an XML document, D, algorithm TwigINLAB correctly returns all answers for Q on D.
Proof: Based on Lemma 2, we know that when getNext() returns a query node q in line 27 of getNext(), if the stack Sparent(q) is empty, then the head element, eq0 does not participate in any partial solution. Thus, any element in the ancestor of q that use eq in the descendant extension is returned by getNext() before eq. By using Lemma 1 and 2, for each node q in the query, the element involved in the root-to-leaf path solution is in Stack Sq. If getNext(q) is a leaf node, we output all path solutions that use eq. Finally, using Lemma 4, the path solutions are merged into a final solution.
Theorem 2: The complexity of our TwigINLAB algorithm is O (|Q| |I|), where, |I| and |Q| is the size of the XML stream and query, respectively.
Experimental setup: We have implemented INLAB using Java API for XML
Processing (JAXP). We run the experiment on all types of queries as:
||Path query with single type relationship
||Path query with mixed types relationship
||Twig query pattern with single type relationship
||Twig query pattern with mixed types relationship
Our experimental tests are divided into six main test cases as:
||Measuring the execution time of Q(A), Q(B), Q(C) and Q(D)
||Comparing the execution time of Q(C) on TwigINLAB and TwigStack
||Comparing the execution time of Q(D) on TwigINLAB and TwigStack
||Measuring the execution time of Q(A) with increasing path length on TwigINLAB
||Measuring the execution time of Q(C) with increasing number of branches
on TwigINLAB and TwigStack
All our experiments were performed on 1.7 GHz Pentium IV processor with 512 MB SDRAM on Windows XP. In all test cases, we use the Treebank dataset (84 MB) obtained from the University of Washington XML repository (Anonymous, 2002). The treebank dataset is a skew structured tree. Table 1 shows the experimental setup environment.
|| Experimental setup environment
|| Results for T1
Performance results: Figure 6 shows the execution time of all types of queries on TwigINLAB. From the results obtained, queries with single type of relationship perform better as compared with queries with mixed types of relationship. Besides, queries with edges of A-D relationship are slower due to the extra time needed to determine whether the two nodes is in A-D relationship by multiple lookups on the index table until the ancestor level is reached.
Figure 7 and 8 show the execution time
of several queries with single type and mixed types of relationship over TwigINLAB
and TwigStack, respectively. From Fig. 7 and 8,
we draw several observations and conclusions:-
||When the twig query contains only a single type of relationship,
TwigINLAB performs about 15% on average better as compared to TwigStack.
||TwigINLAB performs about 3% on average better as compared
to TwigStack on mixed types of relationship.
|| Results for T2
|| Results for T3
|| Results for T4
||When the twig query contains only P-C edges, TwigINLAB performs
around 18% better as compared to TwigStack (shown in Q1 and Q2 in Fig.
7). This may be due to the createINLAB labeling scheme which is optimal
to support P-C relationships.
||When edges below branching nodes contain only P-C relationships,
TwigINLAB performs better about 7% as compared to TwigStack (shown in Q1
and Q4 in Fig. 8).
||When edges below branching nodes contain mixed relationships
(as shown in Q2 and Q3 Fig. 8), the performance of TwigINLAB
is comparable to TwigStack.
|| Results for T5
In Fig. 9, by using TwigINLAB processing, the execution time increases linearly (only slightly) with the increment of path length. This result shows that TwigINLAB is more scalable as compared to TwigStack in terms of execution time. As such, TwigINLAB can support query on large-scale datasets efficiently. In Fig. 10, by using TwigINLAB processing, the execution time increases slightly with the increment of number of branches. Thus, TwigINLAB performs better in terms of scalability as compared to TwigStack.
In this study, we have presented a hybrid query optimization technique, INLAB, comprising the createINLAB to create INLAB encoding and the TwigINLAB algorithm to optimize the twig query pattern matching process. Most of the labeling schemes are affected by the fan-out of the tree. Compared to existing labeling schemes, the size for createINLAB labeling is fixed-length labels resulting in better storage utilization. In addition, using the createINLAB labeling scheme, structural relationships can be determined easily. Moreover, the proposed labeling scheme is integer based. Integer processing is very efficient compared to that of string or bit-vector. Using the createINLAB indexing, elements that belong to the same parent or ancestor can be determined easily by multiple lookups to the index table. Another advantage to createINLAB indexing is that the index size grows linearly with the XML tree.
Experimental results show that, in terms of execution time, on average, TwigINLAB
performs about 15% better compared to TwigStack on a skew structured tree such
as the Treebank dataset. Besides, TwigINLAB outperforms TwigStack if (i) query
has a single type of relationship only and (ii) the edges below branching nodes
contain only P-C relationships. For edges with A-D relationships below the branching
nodes, the performance of TwigINLAB is comparable to TwigStack. In addition,
TwigINLAB is more scalable compared to TwigStack in terms of execution time.
As such, TwigINLAB supports large-scale query of datasets efficiently.
The study can be further extended to compare the performance of TwigINLAB and TwigStack using a flat-structured tree. Hypothetically, the performance of TwigINLAB is expected to be better as it requires less number of lookups on the index table as compared to a skew-structured tree.
1: Abiteboul, S., D. Quass, J. McHugh, J. Widom and J.L. Wiener, 1997. The Lorel query language for semistructured data. Int. J. Dig. Libraries, 1: 68-88.
2: Abraham, J., N.S. Chaudhari and E.C. Prakash, 2004. XML query algebra operators and strategies for their implementation. Proceedings of the IEEE TENCON Conference, November 21-24, 2004, Thailand pp: 286-289.
3: Chien, S.Y., Z. Vagena and D. Zhang, 2002. Structural joins: A primitive for efficient XML query pattern matching. Proceedings of the International Conference on Data Engineering, February 26-March 1, 2002, San Jose, USA., pp: 141-152.
4: Amer-Yahia, S., S. Cho, L.V.S. Lakshmanan and D. Srivastava, 2002. Tree pattern query minimization. VLDB J., 11: 315-331.
5: Brantner, M., S. Helmer, C.C. Kanne and G. Moerkotte, 2005. Full-fledged algebraic XPath processing in natix. Proceedings of the 21st Conference on Data Engineering, April 5-8, 2005, IEEE Xplore, pp: 705-716.
6: Bruno, N., D. Srivastava and N. Koudas, 2002. Holistic twig joins: Optimal XML pattern matching. Proceedings of the International Conference on Management of Data, June 3-6, 2002, Madison, Wisconsin, pp: 310-321.
7: Chen, Q., A. Lim, K. Ong and J. Tang, 2003. D(k)-index: An adaptive structural summary for graph-structured data. Proceedings of the International Conference on Management of Data, June 9-12, 2003, San Diego, California, pp: 134-144.
8: Christophides, V., S. Cluet and J. Simeon, 2000. On wrapping query languages and efficient XML integration. Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, May 15-18, 2000, Texas, USA., pp: 141-152.
9: Chung, CW., J.K. Min and K. Shim, 2002. APEX: An adaptive path index for XML data. Procedings of the ACM SIGMOD International Conference on Management of Data, June 3-6, 2002, Madison, Wisconsin, pp: 121-132.
10: Cohen, E., H. Kaplan and T. Milo, 2002. Labeling dynamic XML trees. Proceedings of the 21st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, June 3-5, 2002, Madison, Wisconsin, pp: 271-281.
11: Cooper, B.F., N. Sample, M.J. Franklin, G.R. Hjaltason and M. Shadmon, 2001. A fast index for semistructured data. Proceedings of the 27th International Conference on Very Large Data Bases, September 11-14, 2001, IEEE Xplore, London, pp: 341-350.
12: Dietz, P.F., 1982. Maintaining order in a linked list. Acta Inform., 21: 122-127.
Direct Link |
13: El-Sayed, M., K. Dimitrova and E.A. Rundensteiner, 2003. Efficiently supporting order in XML query processing. Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, June 9-12, 2003, San Diego, California, pp: 147-154.
14: Frasincar, F., G. Houben and C. Pau, 2002. XAL: An algebra for XML query optimization. Aust. Comput. Sci. Commun., 24: 49-56.
Direct Link |
15: Galanis, L., E. Viglas, D.J. DeWitt, J.F. Naughton and D. Maier, 2002. Following the paths of XML Data: An algebraic framework for XML query evaluation. Technical Report, University of Wisconsin.
16: Goldman, R. and J. Widom, 1997. Data guides: Enabling query formulation and optimization in semistructured databases. Proceedings of the International Conference Very Large Data Bases, August 26-29, 1997, Athens, Greece, pp: 436-445.
17: Haw, S.C. and G.S.V.R.K. Rao, 2005. Query optimization techniques for XML databases. Int. J. Inform. Technol., 2: 97-104.
Direct Link |
18: Haw, S.C. and G.S.V.R.K. Rao, 2007. An efficient path query processing support for parent-child relationship in native XML databases. J. Digital Inform. Manage., 2: 82-87.
19: He, H. and J. Yang, 2004. Multiresolution indexing of XMl for frequent queries. Proceedings of the 20th International Conference on Data Engineering, March 30-April 2, 2004, IEEE Xplore, London, pp: 683-694.
20: Jagadish, H., L. Lakshmanan, D. Srivastava and K. Thompson, 2001. Tax: A tree algebra for XML. Proceedings of the International Conference on Database Programming Language, September 8-10, 2001, IEEE Xplore, pp: 149-164.
21: Jiang, H., W. Wang, H. Lu and J.X. Yu, 2003. Holistic twig joins on indexed XML documents. Proceedings of the 29th International Conference on Very Large Data Bases, September 9-12, 2003, Berlin, Germany, pp: 273-284.
22: Jiang, H., H. Lu and W. Wang, 2004. Efficient processing of twig queries with OR-predicates. Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, June 13-18, 2004, Paris, France, pp: 59-70.
23: Jiao, E., T.W. Ling, C.Y. Chan and S. Yu, 2005. PathStack: A holistic path join algorithm for path query with not-predicates on XML data. Proceedings of the 10th International Conference on Database Systems for Advanced Application, April 17-20, 2005, Beijing, China, pp: 113-124.
24: Josifovski, V., M. Fontoura and A. Barta, 2005. Querying XML streams. VLDB J., 14: 197-210.
Direct Link |
25: Kanne, C.C., M. Brantner and G. Moerkotte, 2005. Cost sensitive reordering of navigational primitives. Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, June 14-16, 2005, Baltimore, Maryland, pp: 742-753.
26: Kaushik, R., D. Shenoy, P. Bohannon and E. Gudes, 2002. Exploiting local similarity to efficiently ondex paths in graph-structured data. Proceedings of the 18th International Conference Data Enreering, February 26-March 1, 2002, San Jose, CA, USA., pp: 129-140.
27: Kaushik, R., P. Bohannon, J.F. Naughton and H.F. Korth, 2002. Covering indexes for branching path queries. Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, June 3-6, 2002, Madison, Wisconsin, pp: 133-144.
28: Kim, J., S.H. Lee and H-J. Kim, 2004. Efficient structural joins with clusters extents. Inform. Proc. Lett., 91: 69-75.
Direct Link |
29: Kimber, W.E., 1993. HyTime and SGML: Understanding the HyTime HYQ query language. Technical Report Version 1.1, IBM Corporation.
30: Li, Q. and B. Moon, 2001. Indexing and querying XML data for regular path expressions. Proceedings of the 27th International Conference on Very Large Data Bases, September 11-14, 2001, IEEE Xplore, London, pp: 361-370.
31: Lu, J., T. Chen and T.W. Ling, 2004. Efficient processing of XML twig patterns with parent child edges: A look-ahead approach. Proceedings of the 13th ACM International Conference on Information and Knowledge Management, November 8-13, 2004, Washington, DC., USA., pp: 533-542.
32: McHugh, J. and J. Widom, 1999. Query optimization for XML. Proceedings of the 25th International Conference on Very Large Data Bases, September 7-10, 1999, Edinburgh, Scotland, pp: 315-326.
33: Milo, T. and D. Suciu, 1999. Index structures for path expression. Proceedings of the 7th International Conference on Database Theory, January 10-12, 1999, IEEE Xplore, London, pp: 277-295.
34: O'Neil, P., E. O'Neil, S. Pal, I. Cseri, G. Schaller and N. Westbury, 2004. ORDPATHS: Insert-friendly XML node labels. Proceedings of the 2004 ACM SIGMOD international conference on Management of Data, June 13-18, 2004, Paris, France, pp: 903-908.
35: Polyzotis, N., M. Garofalakis and Y. Ioannidis, 2004. Approximate XML query answers. Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, June 13-18, 2004, Paris, France, pp: 263-274.
36: Tatarinov, I., S. Viglas, K.S. Beyer, J. Shanmugasundaram, E.J. Shekita and C. Zhang, 2002. Storing and querying ordered XML using a relational database system. Proceedings of the International Conference on Management of Data, June 3-6, 2002, Madison, Wisconsin, pp: 204-215.
37: Viglas, S.D., L. Galanis, D.J. DeWitt, D. Maier and J.F. Naughton, 2002. Putting XML query algebras into context. Technical Report, University of Wisconsin.
38: Weigel, F., K.U. Schulz and H. Meuss, 2005. The BIRD numbering scheme for XML and tree databases- deciding and reconstructing tree relations using efficient arithmetic operations. Proceedings of the International XML Database Symposium, August 28-29, 2005, Trondheim, pp: 49-67.
39: Wu, X., M.L. Lee and W. Hsu, 2004. A prime number labeling scheme for dynamic ordered XML tree. Proceedings of the 20th International Conference on Data Engineering, March 30-April 2, 2004, IEEE Xplore, London, pp: 66-78.
40: Wu, Y., J.M. Patel and H. V.Jagadish, 2003. Structural join order selection for XML query optimization. Proceedings of the International Conference on Data Engineering, March 5-8, 2003, Michigan University, MI, USA., pp: 443-454.
41: Yao, J.T. and M. Zhang, 2004. A fast tree pattern matching algorithm for XML query. Proceeding of the International Conference Web Intelligence, September 20-24, 2004, IEEE Xplore, London, pp: 235-241.
42: Yu, T., T.W. Ling and J. Lu, 2006. TwigStackList: A holistic twig join algorithm for twig query with not-predicates on XML data. Proceedings of the 11th International conference on Database Systems for Advanced Applications, April 12-15, 2006, Springer-Verlag, London, pp: 249-263.
43: Zezula, P., F. Mandreoli and R. Martoglia, 2004. Tree signatures and unordered XML pattern matching. Proceedings of the 30th Conference on Current Trends in Theory and Practice of Computer Science, January 24-30, 2004, IEEE Xplore, pp: 122-139.
44: Zhang, D. and Y. Dong, 1999. A data model and algebra for the web. Proceedings of the Workshop on Database and Expert System Application, September 3, 1999, IEEE Xplore, pp: 711-714.
45: Zhang, C., J. Naughton, D. DeWitty, Q. Luo and G. Lohman, 2001. On supporting containment queries in relational database management systems. Proc. ACM SIGMOD, 30: 425-436.
Direct Link |
46: Zhang, N., 2006. Query processing and optimization in native XML databases. Technical Report, University of Waterloo.
47: Zheng, S., A. Zhou, J.X. Yu, L. Zhang and H. Tao, 2002. Structural map: A new index for efficient XML path expression processing. Proceedings of the 3rd International Conference on Advances in Web-Age Information Management, August 11-13, 2002, IEEE Xplore, pp: 25-36.