INTRODUCTION
Sorting is one of the most important operations in database systems and
its efficiency can influences drastically the overall system performance.
Sorting is frequently used in database systems to produce ordered query
results. It is also the basis of sortmerge join, a join algorithm employed
by many existing DBMSs and it is used in some systems for processing groupby
queries. Therefore its performance influences dramatically the performance
of the whole database system (Iyer and Dias, 1990; Surajit and Kyuseok,
1996). With the advent of parallel processing, parallel sorting has become
an important area for algorithm research.
Generally sorting algorithms can be divided into internal and external
algorithms (Selim, 1990). An external sorting algorithm is necessary if
the data set is too large to fit in main memory. In databases, since data
is stored in tables (or files) and is normally very large, database sorting
is therefore an external sorting. Obviously this is the common case in
database systems. Sorting algorithms are very well analyzed and examined
for sequential architectures. Also internal sorting algorithms for parallel
architectures are investigated (Alok and Plaxton, 1994; Dusseau et
al., 1996; Manzur and Brent, 2000), but the topic of external sorting
for parallel computer architectures has not yet received adequate consideration.
Parallel external sorting has not been fully explored. The traditional
approaches of parallel external sorting have been to perform local sort
in each processor in the first stage and to carry out merging by the host
or using a pipeline hierarchy of processors in the second stage (Mohan
et al., 1994; Selim, 1990).
SERIAL EXTERNAL SORTING:
A BACKGROUND
Serial external sorting is external sorting in a uniprocessor environment.
The most common serial external sorting algorithm is based on sortmerge.
The underlying principle of sortmerge algorithm is to break the file
up into unsorted subfiles, sort the subfiles and then merge the sorted
subfiles into larger and larger sorted subfiles until the entire file
is sorted. Notice that the first stage is to sort the first lot of subfiles,
whereas the second stage is actually the merging phase. In this scenario,
it is important to determine the size of the first lot of subfiles which
are be sorted. Normally, each of these subfiles must be small enough to
fit into main memory, so that sorting these subfiles can be done in main
memory using any internal sorting technique. We divide a serial external
sorting algorithm into two phases: sort and merge. The sort algorithm,
which incorporates a partitioning of the original file, is explained as
follows: First, determine R, the number of records, which we can reasonably
sort internally. Second, determine K the total number of disks we can
use. Third, sort R, records at a time internally, writing the results
in turn onto each of K/2 disks with file markers at their ends. Finally,
repeat the third step above, writing additional files onto the K/2 disks.
Once sorting of subfiles is completed, merging phase starts. An algorithm
for the merging process is described as follows. First, do a K/2way merge
using the first subfile from each disk, writing the output onto one of
the K/2 empty disks. Second, repeat the first step above for each of the
rest of the K/2 empty disks. Finally, repeat steps 1 and 2 above, merging
in rotation onto the K/2 subfiles until the original K/2 disks are empty.
As stated in the beginning that serial external sort is the basis for
parallel external sort, because in a multiprocessor system, particularly
in a sharednothing environment, each processor has its own data, and
sorting this data locally in each processor is done as per serial external
sort explained above. Therefore, the main concern in parallel external
sort is not the local sort, but whether local sort is done first or later
and how merging is performed.
The speedup of a parallel sort achievable on a multiprocessor depends
largely on how well the average memory latency and overhead of scheduling
and synchronization can be minimized. Based on the general strategies
utilized, most parallel sorts suitable for multiprocessor computers can
be placed into one of two rough categories: mergebased sorts and partitionbased
sorts. Mergebased sorts consist of multiple merge stages across processors,
and perform well only with a small number of processors. When the number
of processors utilized gets large, so does the overhead of scheduling
and synchronization, which reduces the speedup. Partitionbased sorts
consist of two phases: partitioning the data set into smaller subsets
such that all elements in one subset are no greater than any element in
another, and sorting each subset in parallel. The performance of partitionbased
sorts primarily depends on how well the data can be evenly partitioned
into smaller ordered subsets. Unfortunately, no general, effective method
is currently available and it is an open question of how to achieve linear
speedup for parallel sorting on multiprocessors with a large number of
processors.
DESCRIPTION OF THE ALGORITHM
Parallel Partitioned Sort (PPS) is influenced by the techniques used
in parallel partitioned join, where the process is split into two stages:
partitioning and independent local work (Mishra and Eich, 1992; Wolf et
al., 1993). In parallel partitioned sort, first we partition local
data according to range partitioning used in the operation. In this method,
the first phase is not a local sort. Local sort is not carried out here.
Each local processor scans its records and redistribute or repartition
according to some range partitioning.
After partitioning is done, each processor will have an unsorted list
in which the values come from various processors. It is then local sort
is carried out. Thus, local sort is carried out after the partitioning,
not before. It is also noticed that merging is not needed. The results
produced by the local sort are already the final results.

Fig. 1: 
Parallel partitioned sort algorithm 
Each processor
will have produced a sorted list and all processors in the order of the range partitioning method used in this
process are also sorted. Figure 1 shows an illustration
of this method.
The main benefit of parallel partitioned sort is that no merging is necessary,
and hence the bottleneck in merging is avoided. It is also a true parallelism,
as all processors are being used in the two phases. And most importantly,
it is a onelevel tree reducing unnecessary overhead in the pipeline hierarchy.
Example: Figure 1 gives an illustration of the
PPS algorithm. The data sequence to be sorted contains 12 entries, as
follows, S = {8, 12, 11, 4, 9, 3, 7, 2, 6, 10, 1, 5}, that will be sorted
using four processors (N = 4). The sequence S is initially scanned and
partitioned into four 3 entry subsequences. Each local processor scans
its records and redistribute or repartition according to some range partitioning.
After partitioning is done, each processor will have an unsorted list
in which the values come from various processors. It is then local sort
is carried out. Thus, local sort is carried out after the partitioning,
not before. No merging is needed as the results produced by the local
sort are already the final results.
Notice from the illustration that after redistribution phase, some of
the boxes are empty. Empty boxes are actually empty list as a result of
data redistribution. They indicate that they do not receive any values
from the designated processors. For example, the first empty box on the
left is because there are no values ranging from 13 from processor 1.
COMPUTATIONAL COMPLEXITY
This section is devoted to a detailed analysis of the computational complexity
of the proposed parallel sorting algorithm. The first phase of our algorithm
is a simply a scanning and partitioning of all the records. After that
records are redistributed using some range partitioning. Finally records
are sorted parallely using Quicksort. Quicksort (Hoare, 1962) is a very
wellknown sorting algorithm that has been extensively studied in the
literature. On average, the computational complexity of Quicksort is θ(n
log n), that corresponds to the number of comparisons to sort n values.
However, the complexity reaches θ(n^{2}) in the worst case,
though is significantly faster in practice. In the case of the PPS algorithm,
Quicksort is used in parallel to sort subsequences of size S/N, so as
the computational complexity of this algorithm is θ(S/N log (S/N)),
or θ((S/N)^{2}) in the worst case.
COMPARISON WITH OTHER SORTING ALGORITHMS
Table 1 shows a chart with the computational complexities
of some classical sorting algorithms, that allows to compare them with
that of our proposed algorithm. The computational complexities have a
first term that is the same for all sorting algorithms, which corresponds
to a first phase of local sorting based on Quicksort. The rest of the
terms in the complexities depend on the peculiarities of each algorithm.
Table 1: 
Complexities of sorting algorithms 

By comparing complexity of PPS algorithm with the complexities of classical
algorithms we can claim that the complexity of our algorithm is smaller than
that of all other classical sorting algorithms i.e., (OddEven Mergesort, Bitonic
Sort, Shellsort, PSRS).
CONCLUSIONS
The study presents a new fast parallel sorting algorithm called PPS (Parallel
Partitioned Sort). In this algorithm we make use of redistribution and
repartitioning concepts. After an initial scanning of the data sequence,
the algorithm works in two phases. The first phase redistributes the data
sequence using range partitioning. The second phase sorts in parallel
all the subsequences using Quicksort. The performance of PPS is compared
with other classical sorting algorithms. In general, the proposed PPS
algorithm represents an improvement with regards to classical algorithms,
like OddEven Mergesort, Bitonic Sort, Shellsort and Parallel Sorting
by Regular Sampling. The main benefit of parallel partitioned sort is
that no merging is necessary, and hence the bottleneck in merging is avoided.
It is also a true parallelism, as all processors are being used in the
two phases.