Abstract: Data sorting is an intriguing problem that has attracted some of the most intense research efforts in the field of computer science for both its theoretical importance and its use in many applications. Quicksort is widely considered to be one of the most efficient sorting techniques, which depends on an appropriate pivot selection technique for its performance. In this study, a sequential quicksort was implemented using six different pivot selection schemes for minimizing the execution time of quicksort algorithm. The schemes were tested together using integer array data type. From the results obtained, median-of-five without random index selection scheme minimizes the execution time of quicksort algorithm by about 23-35% as compared to the other techniques. The results also indicated that there is an overhead associated with random index selection and this can affect the performance of a particular method. Thus, this factor needs to be considered in selecting an optimal pivot selection scheme.
INTRODUCTION
Sorting is the use of computer to put files in order. It is one of the most intensively studied problems in computing science. It continues to consume sizable fraction of time on mainframe and personal computers, accounting for roughly one quarter to half of their cycles (Brest et al., 2000). Sorting has continued to be an interesting theoretical and difficult practical problem. Many sorting algorithms have been proposed (Sedgewick, 1977; Hoare, 1961).
Although there is no internal sorting that is best for every situation, the quicksort algorithm introduced by Hoare (1961, 1962) is widely accepted as the most efficient internal sorting technique. Quicksort sorts a list of keys A[l], A[2],.., A[m] recursively by choosing a key p in the list as a pivot key around which to rearrange the other keys in the list. Ideally, the pivot key is near the median key value in the list, so that it is preceded by about half of the keys and followed by the other half. The keys of the list are rearranged such that for some n, A[l], A[2],.., A[n] contain all the keys with values less than p and A[n+1], A[n+2],.., A[m] contain all the keys with values greater than or equal to p. The elements A[l], A[2],.., A[n] are called the left sub list and the elements A[n+1], A[n+2],.., A[m] are the right sub list. Thus, the original list is partitioned into two sub lists where all the keys of the left sub list precede all the keys of the right sub list. After partitioning, the original problem of sorting the entire list is now reduced to the problem of sorting the left and right sub lists independently. Quicksort is then applied recursively to each of these sub lists until the sub list consist of just a single item. Not only is this algorithm simpler than many other sorting algorithms, but empirical (Sedgewick, 1977; van Emden, 1970) and analytical (Knuth, 2005) studies shows that quicksort can be expected to be up to twice as fast as its nearest competitors, with expected time complexity of O (nlogn) and a worst case of O (n2).
There is a continuous demand for greater computational speed from a computer system than currently possible (Moh et al., 1999). Areas requiring great computational speed include numerical modelling and simulation of scientific and engineering problems. Such problems often need huge repetitive calculation on large amount of data to give valid result. Thus, it is appropriate to study the effect of pivot selection schemes on the performance of quicksort algorithm.
Overview of pivot selection techniques: There are several ways to make the worst case for a quicksort algorithm very unlikely in practical situations. Instead of using the first element in the list as the partitioning element, one may use some other fixed element, like the middle element. This helps some, but simple anomalies can still occur. Using a random partitioning element will virtually prevent the anomalous cases from happening in practical sorting situations, but random number generation can be relatively expensive and does not reduce the average running time of the rest of the algorithm (Singleton, 1969).
Median-of-three method: The best choice of pivot would be the median of the array; unfortunately this is hard to calculate and would slow down quicksort considerably. The median of three modifications (Hoare, 1962) will actually improve the average performance of the algorithm while at the same time making the worst case unlikely to occur in practice (Weiss, 2000).
In this method, samples of size three are used at each particular stage. Primarily, sampling provides insurance that the partitioning elements dont consistently fall near the ends of the sub lists. To make the worst case unlikely the method uses the first, middle and the last elements as samples and the median of those three as the partitioning element. Using this method clearly eliminates the bad case for sorted input and actually reduces the running time of quicksort by about 5% (Hoare, 1962; Singleton, 1969).
Median-of-five with random index selection method: As larger sample size gives better estimates of the median, many researchers had made attempts to improve on the running time of quicksort by proposing different pivot selection techniques. Thus, median-of-five method was used in Brest et al. (2000) and Cerin (2002). The method uses a sample size of five elements, i.e., the first, middle, last and two other elements randomly picked through a random number generation function between the first and last elements. This technique gives a better load balancing and reduces the execution time of quicksort by more than 5%, but there is an overhead associated with random number generation. A snapshot of this method is presented in Fig. 1.
Median-of-five without random index selection method: In (Mohammed and
Othman, 2004), a new pivot selection scheme for quicksort algorithm was proposed,
which does not involve any use of random index selection. This method also uses
a sample size of five elements as in median-of-five with random index selection
scheme. The five elements are selected as follows: first, middle, last and other
two are at position
Fig. 1: | A snapshot of median50 scheme |
Fig. 2: | A snapshot of median51 scheme |
It then returns the middle element after sorting the new array. The returned middle element is the median of those five elements and is used as the partitioning element. Thus, it eliminates the use of random index selection.
The snapshot of the new pivot selection scheme is presented in Fig. 2. All the five elements are selected explicitly as shown in the snapshot. The first five elements are picked from the unsorted array and then sorted using sample sort. At the end the median of the five elements i.e., V[2]; is returned and will be used as the pivot element.
Median-of-seven without random index selection method: This method selects
seven elements, all without random selection. Thus, the number of elements to
be considered for selecting the pivot is increased. The elements are selected
as follows: low, high, middle,
Median-of-seven with random index selection method: This method just
like the previous method selects seven elements from the original array indexes.
The first five elements are selected as follows: low, high,
Median-of-nine without random index selection method: In this method,
the number of elements to be selected has increased by two. It does not employ
random selection. Thus, all the nine element are explicitly selected as follows:
low, high,
Median-of-nine with random index selection method: This scheme employs
random selection, in which two elements are selected randomly using the random
generating function used in Median-of-seven with random index selection. The
other seven elements are explicitly selected as follows: low, high,
Implementation: The algorithm was implemented in C++ programming language on a PC-cluster machine. Our implementation was sequential and thus the parallel capability of the PC-cluster was not utilized. The purpose of using a PC-cluster as a normal PC was to utilize its larger array size capability. The maximum array size that can be declared on the PC-cluster was two million elements. In the experiment, integer array data type is used. The data were randomly generated and 40 independent measurements were performed in each case. Thus, we operated on average values.
The implementation was based on three scenarios. In the first scenario, all the six schemes were run with the same quicksort algorithm using an array size of 100 to 330 thousand elements. Based on the first scenario, the best three schemes in terms of execution time were taken and implemented separately using an array size of 250 to 600 thousand elements. The idea was to make further analysis by increasing the array size. From the second scenario, the best two of the three schemes were implemented separately using an array size of 700 thousand to 1 million elements, in order to make further informed performance analysis. In all the three implementation scenarios, the array was generated randomly by a function in the program.
RESULTS AND DISCUSSION
The experimental results of the implemented sequential quicksort algorithm using the six pivot selection schemes are presented. Three different scenarios are analyzed. The first scenario compares the six techniques together, while the second compared the best three techniques of the first scenario and in the third scenario the best two techniques of the second scenario were compared. Table 1 shows the execution time of the six partitioning methods for the first scenario. The Median50, Median70 and Median90 represent median-of-five, median-of-seven and median-of-nine partitioning methods without random index selection, respectively. While Median51, Median71 and Median91 represent median-of-five, median-of-seven and median-of-nine partitioning methods with random index selection, respectively. The second scenario results are presented in Table 2 which shows the execution time of the three methods. The third scenario results are presented in Table 3 which shows the execution time of the two methods.
The results of Table 1 shows that, three of the methods i.e., median-of-seven with and without random selection and median-of-nine without random selection, have almost the same execution time for all the array sizes. Median-of-nine with random selection was the worse in term of speed. While the least in terms of execution time (fast in terms of speed) is median-of-five without random selection, then followed by median-of-five with random selection.
Table 1: | Average execution time (sec) of the six schemes |
Table 2: | Average execution time (sec) of the best three schemes of the first scenario |
Table 3: | Average execution time (sec) of the two best schemes of the second scenario |
Thus, the scheme in Mohammed and Othman (2004) was faster and therefore minimizes the execution time of quicksort algorithm sequentially by at least 31%.
Results of Table 2 shows that median-of-seven without random index selection has the worst performance, while the scheme in (Mohammed and Othman, 2004) has the best execution time and was about 35% faster than the two schemes.
Table 3 clearly shows the performance of the best two methods. Like the first two scenarios Tables, median-of-five without random index selection presented in Mohammed and Othman (2004) performs better than the other scheme presented in Brest et al. (2001).
CONCLUSIONS
In this study, pivot selection methods in quicksort algorithm and experimental study of their performance were presented. Table 1, 2 and 3 show that median-of-five without random index selection was faster than the other five schemes presented. Thus, the execution time of quicksort algorithm was minimized by about 23-30%. It is evident that the larger the array, the more critical a decision will be in selecting the appropriate or optimal median-of-N pivot selection method. Likewise, our result indicated that, there is an overhead associated with random index selection and this can affect the performance of a particular pivot selection scheme.
The work is constrained by our inability to generate larger array size that can be in tens or hundreds of million in sizes. Databases and lists in real world problems do not exist only in integer data type form, thus the need for comparing the schemes using different data types.
ACKNOWLEDGMENTS
We are grateful to the authorities of University Putra Malaysia for providing the research facilities and Usmanu Danfodiyo University, Sokoto, Nigeria for the study fellowship.