
Research Article


A New Sampling Design for a Spatial Population: Path Sampling 

Mena Patummasut
and
Arthur L. Dryver



ABSTRACT

This study proposed a new costeffective and convenient sampling
design for a spatial population, called “path sampling” and which
offers the ability to sample all of the units in the researcher’s path
traversed during the sampling. Path sampling is a design in which the researcher
selects a path or paths from start to finish, as opposed to selecting units.
Path sampling offers unbiased estimators for both mean and variance. This paper
covers the pros and cons of path sampling in comparison to simple random sampling
and cluster sampling.





Received: March 21, 2012;
Accepted: June 05, 2012;
Published: July 28, 2012


INTRODUCTION
A spatial setting can be represented as a geographical area partitioned into
single units. To estimate the population total or mean in an area, the population
study area is divided into spatial units generally of the same size and the
numbers of objects are counted on a selection of the units (Vincent,
2008). In sampling in a spatial population, there are many designs that
can be used, for example, simple random sampling, stratified sampling, cluster
sampling and systematic sampling or adaptive sampling in the case of a rare
or clustered population. Thompson (2002) illustrated the
application of those sampling designs to spatial populations. In cluster sampling,
a primary unit which is a sampling unit, consists of a cluster of secondary
units, usually in close proximity to each other. In the spatial setting, primary
units include spatial arrangements as square collections of adjacent units.
A simple random sample of m primary units is taken from M primary units in the
population. Thompson (1990) introduced adaptive cluster
sampling and this was compared to simple random sampling using simulation study
on the spatial population. Dryver and Thompson (2005)
and Dryver and Chao (2007) proposed more efficient estimators
for adaptive cluster sampling and their illustrative examples were applied to
spatial populations. Thompson (2006) proposed adaptive
web sampling for sampling a population in network and spatial settings. However,
it tends to be more efficient when used with many spatial populations (Thompson,
2011). Borkowski (2003) proposed simple Latin square
sampling ±k designs which was a new class of probability sampling design
that ensured that the sample was welldistributed over the study region when
a spatial correlation was present.
Many factors often go into choosing a sampling strategy to implement. Such
factors often include ease of implementation, cost, efficiency, etc. (Thompson,
2002; Mier and Picquelle, 2008). For example, simple
random sampling is more efficient, given the same number of data points sampled
as in cluster sampling; often, however, cluster sampling will be implemented,
as it is easier to implement and may cost less (Lohr, 1999).
By applying simple random sampling and cluster sampling, a sample may cover
all of the regions since each sampling unit has an equal chance of selection.
Thus, traveling from place to place to observe every unit selected for sampling
can be costly, as the distance traveled can be quite long (Hansen
et al., 1953). One of the difficulties is that of collecting quantities
of data dispersed over a large area. The new sampling design, path sampling,
introduced in this paper also addresses this issue, especially when the distance
travelled is a large part of the sampling cost.
PATH SAMPLING AND TECHNICAL NOTATION
This section deals with defining all possible paths in the spatial population,
the path sampling scheme and estimation. Suppose the researcher’s goal
is to estimate the population total or mean. Initially, it will be assumed that
the study region can be partitioned into an rxc (r: rows and c: columns) grid
of rc quadrats or units. The population consists of rc spatial units. Each population
unit is labeled with 2 coordinates, say (i, j) which are the row and column
of the unit, respectively, for i = 1, 2, 3,…, r and j = 1, 2, 3,…,
c. Associated with each unit (i, j), the value of the population variable of
interest is denoted as y_{(i,j)}. The parameter of interest in this
study is the population mean:
Path sampling design is a sampling design in which p distinct paths are selected by simple random sample without replacement from q paths in the population and the sample consists of all units in the selected paths. Thus, a path(s) is chosen instead of units. In this study, we use path sampling for spatial population.
Define all possible paths in a spatial population: A path is basically
the path or route taken from start to finish. Let q be the number of all possible
paths. Let P_{k} denote a path k for k = 1, 2, 3,…, q. A path will
be defined to start from row 1 and column j*; that is, a unit labeled (1, j*)
is a starting unit and end at a unit (1, j*+1). We began sampling at an edge,
at unit (1, j*), of a region because it was assumed to be more convenient and
less expensive than beginning inside or in the middle of a region. The path
k taken will begin from such a starting unit and then go to a particular row,
say row k, to the end of the row on the left and then go along row k+1 and comes
back to the starting unit. That is, path k taken will be from (1, j*) to (2,
j*) then to (k, j*) to (k, j*1) to (k, j*2) to (k, 1) to (k+1, 1) to (k+1,
2) to (k+1, c) to (k, c) to (k, c1) to (k, c2) to (k, j*+1) to (k, 1 j*+1)
to (k, 2 j*+1) and to (1, j*+1). Thus, for a spatial population of r rows,
there are q = r1 possible paths. In general, a path k in the spatial setting
population of r rows and c columns can be written as: P_{k} = ((1, j*),
(2, j*), (3, j*),..., (k, j*), (k, j*1), (k, j*2),..., (k, 1), (k+1, 1), (k+1,
2),..., (k+1, c), (k, c), (k, c1), (k, c2), (k, j+1), (k1, j*+1), (k2,
j*+1),..., (1, j*+1)) for k = 1, 2, 3,…, q = r1.
The number of units belonging to path P_{k} is 2c+2(k1). All possible paths are shown in Fig. 1. Notice that the numbers of units in each path are not the same. We can see that the paths overlap in column j* and j*+1 which are the goingout and comingback column, respectively. Also, the paths next to each other overlap with the row between them. Thus, it can be written that path k1 and path k overlap in row k for k = 2, 3,…, q = r1. We assume that we sample the units in a logical manner such that all units will only be observed once. Finally, the researcher can define the rows and columns arbitrarily; thus, path sampling is not limited in its starting and ending position even written as is.
Path sampling design: The spatial population of r rows and c columns
consists of units labeled (i,j) for i = 1, 2, 3,…, r and j = 1, 2, 3,…,
c. There are q = r1 possible paths in the population denoted by P_{1},
P_{2}, P_{3},..., P_{q}.

Fig. 1: 
All possible paths with a starting unit (1, j*) and all units
labeled with two coordinates in a spatial population 
By SRSWOR, p paths are selected from q possible paths in the population. Let
P_{k} denote a path k in the sample for k = 1, 2, 3,…, p. The sample
consists of all units in the selected paths. The sample is represented as P_{S}
= (p_{1}, p_{2}, p_{3},..., p_{p}). The probability
of selecting a sample is:
since paths are selected by SRSWOR and the inclusion probability of path k is:
There is an overlapping of paths, so, there are repeat observations. Although, each path has an equal probability of selection, the units do not have an equal probability of selection, as the same unit may be in one or more paths. The inclusion probability of each unit is the probability that a unit is included in the sample. In path sampling, the inclusion probability of unit (i, j) is denoted as π_{(i,j)}. It is defined as:
Since paths overlap in rows and columns, the probabilities that units are included
in the sample are not equal. That is, the inclusion probabilities of each unit
in a path are not equal. All paths overlap in column j* and j*+1 and some paths
overlap in a row. Thus, the inclusion probabilities can be divided into three
cases due to overlapping of paths.
Note: Some of the combinations in the numerator of Eq. 2 can equal to 0. Let the probability that both units (i, j) and (i',j') are included in the sample be denoted by π_{(i,j,(i',j')}, also called the joint inclusion probability. It is defined as:
The probability that the sample does not contain either units (i, j) or (i', j') is:
where, f = the number of paths not containing either units (i, j) or (i', j'). Thus: f can be found as follows. Let U_{1} be a set of all units in column j* and j*+1 (units type 1). U_{1}={(i_{1}, j_{1}) i_{1} = 1, 2, 3,…, r and j_{1} = j* and j*+1}. Let U_{2} be a set of all units not in column j* and j*+1 and not in the first row or the last row (unit type 2). U_{2}={(i_{2}, j_{2})  i_{2} = 2, 3,…, r1 and j_{2} = 1, 2, 3,…, j*1, j*+2, j*+3,…, c}. Let U_{3} be a set of all units in the first row and the last row but not in column j* or j*+1 (unit type 3). U_{3}={(i_{3}, j_{3})  i_{3} = 1, r and j_{3} = 1, 2, 3,…, j*1, j*+2, j*+3,…, c}. A formula of f is shown in Eq. 4: Note that if f < 0, then f is set equal to 0.
Estimation: Let p_{s} = (p_{1}, p_{2}, p_{3},...,p_{p})
denote the sample of paths selected. Let s denote the set of distinct units
in the sample. By using the HorvitzThompson estimator (Horvitz
and Thompson, 1952), an unbiased estimator of the population mean under
path sampling is:
Let I_{(i,j)} be the indicator function taking the value one if unit (i,j) is selected in the sample and 0 otherwise. It can be written as:
Therefore,
can be written in the alternative form:
is the unbiased estimator for the population mean μ.
The variance of
is:
and the estimator of this variance is: The estimate of variance may be negative.
The spatial population of 4 rows and 6 column as shown in Fig.
2 is considered. The population mean and variance are 8.208 and 549.6, respectively.
The objective is to estimate the population mean by using path sampling. First,
all possible paths are created. The number of rows in this population is r =
4 and the number of columns is c = 6. Thus, the number of all possible paths
is q = r1 = 41 = 3. In general, a path k in the spatial setting population
of r rows and c columns with starting unit (I, j*) is written as: P_{k}
= ((1, j*), (2, j*), (3, j*),..., (k, j*), (k, j*1), (k, j *2),..., (k, 1),
(k+1, 1), (k+1, 2),..., (k+1, c), (k, c), (k, c1), (k, c2),..., (k, j*+1),
(k1, j*+1), (k2, j*+1),..., (1, j*+1)) for k = 1, 2, 3,…, q = r1.
Let the starting unit be (1, 3), so, j* = 3. Thus, we have all possible paths
with their labeled units as follows:
P_{1} 
= 
((1, 3), (1, 2), (1, 1), (2, 1), (2, 2), (2, 3), (2, 4),
(2, 5), (2, 6), (1, 6), (1, 5), (1, 4)) 
P_{2} 
= 
((1, 3), (2, 3), (2, 2), (2, 1), (3, 1), (3, 2), (3, 3), (3, 4), (3, 5),
(3, 6), (2, 6), (2, 5), (2, 4), (1, 4)) 
P_{3} 
= 
((1, 3), (2, 3), (3, 3), (3, 2), (3, 1), (4, 1), (4, 2), (4, 3), (4, 4),
(4, 5), (4, 6), (3, 6), (3, 5), (3, 4), (2, 4), (1, 4)) 
Since the number of units belonging to P_{k} is 2c+2(k1), the number
of units belonging to P_{1} is 2(6)+2(11) = 12 units, the number of
units belonging to P_{2} is 2(6)+2(21) = 14 units and the number of
units belonging to P_{3} is 2(6)+2(31) = 16 units. Suppose the number
of sample paths is 2, so, by using SRSWOR, p = 2 sample paths are selected.
There are 3 possible samples which are p_{s1} = (P_{1}, P_{2}),
p_{s2} = (P_{1}, P_{3}) and p_{s3} = (P_{2},
P_{3}).
p_{s1} 
= 
(P_{1}, P_{2}) reduces to s_{1} =
{(1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (1, 6), (2, 1), (2, 2), (2, 3),
(2, 4), (2, 5), (2, 6), (3, 1), (3, 2), (3, 3), (3, 4), (3, 5), (3, 6)} 
p_{s2} 
= 
(P_{1}, P_{3}) reduces to s_{2} = {(1, 1), (1,
2), (1, 3), (1, 4), (1, 5), (1, 6), (2, 1), (2, 2), (2, 3), (2, 4), (2,
5), (2, 6), (3, 1), (3, 2), (3, 3), (3, 4), (3, 5), (3, 6), (4, 1), (4,
2), (4, 3), (4, 4), (4, 5), (4, 6)} 
p_{s3} 
= 
(P_{2}, P_{3}) reduces to s_{3} = {(1, 3), (1,
4), (2,1), (2, 2), (2, 3), (2, 4), (2, 5), (2, 6), (3, 1), (3, 2), (3, 3),
(3, 4), (3, 5), (3, 6), (4, 1), (4, 2), (4, 3), (4, 4), (4, 5), (4, 6)} 

Fig. 2: 
All possible paths of the spatial population for 4 rows and
6 columns with a yvalue of each unit 
Next, the inclusion probabilities are calculated by the formula Eq. 2. First, the inclusion probabilities for units in column 3 and 4 (unit type 1) will be calculated. For i = 1, 2, 3, 4 and j = 3 and 4, we have:
Then, we get:
Next, the inclusion probabilities for units not in column 3 and 4 and not in the first row or last row (unit type 2) will be calculated. For i = 2, 3 and j = 1, 2, 5, 6:
Then:
Finally, the inclusion probabilities for units in the first row and the last row but not in column 3 or 4 (unit type 3) are calculated. For i = 1 and 4 and j = 1, 2, 5, 6:
Then:
The inclusion probabilities are shown in Fig. 3. Estimates of the mean for all possible samples are shown in Table 1. It can be seen that _{ps} is an unbiased estimator since its bias is zero. Recall that p_{s1} = (P_{1}, P_{2}) reduce to s_{1} = {(1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (1, 6), (2,1), (2, 2), (2, 3), (2, 4), (2, 5), (2, 6), (3, 1), (3, 2), (3, 3), (3, 4), (3, 5), (3, 6)} corresponding to y = {8, 7, 30, 24, 6, 5, 0, 10, 112, 35, 5, 8, 7, 7, 32, 0, 0, 5}. By using Eq. 5:
Similarly, the estimate of variance is calculated using Eq. 8.
Table 1: 
Estimates of the mean and variance estimator for all possible
samples 


Fig. 3: 
The inclusion probabilities of the population of 4 rows and
6 columns 
SIMULATION STUDY Rare and nonrare population data are used in a simulation to examine the performance of path sampling compared to a comparable sampling design which in this research are SRSWOR and cluster sampling. The simulation consists of 1000 iterations. The formula used to estimate the variance is:
where,
is the value for the relevant estimator for sample
and is the average of the
(Dryver and Thompson, 2005).
Simulation study for rare population: The authors used bluewinged teal
data (Smith et al., 1995) in Fig.
4 for part of the simulation study, as it is a rare population. In cluster
sampling, let a cluster be an entire column, consisting of 10 units, as shown
in Fig. 4. This population data have high variation among
clusters with CV of 4.26. The expected sample size will be denoted E(υ)
and the sample size used in the other designs was set equal the ceiling of the
E(υ) for path sampling. For cluster sampling, the number of clusters sampled
was set equal to the ceiling of .
In SRSWOR, the sample size was set equal to E(υ) in order to compare it
to path sampling.
The results from the simulations are shown in Table 2. From these results, for starting unit (1, 1) and (1, 10), path sampling was more efficient than cluster sampling since the relative efficiency was greater than 1. Noticeably, the yvalues in column 17, 18 and 19 were higher than others, so, there was high variation among the clusters in this population. This made cluster sampling less efficient. However, path sampling was less efficient than SRSWOR since the relative efficiency was less than 1. Notice that when the starting unit is in a highvalued column which is unit (1,17), path sampling was more efficient than SRSWOR since the relative efficiency was greater than 1 and much more efficient than cluster sampling since the relative efficiency was greater than 4. Simulation study for nonrare population: Two simulated data were considered. First, we used the simulated data in Fig. 5. Each unit was Poisson distributed with a mean of 50. To compare path sampling to cluster sampling, let a cluster be a cluster of an entire column. In this population, the CV among the clusters is 0.04. The simulation results are shown in Table 3. From the simulation results in Table 3, it can be seen that path sampling was less efficient than both cluster sampling and SRSWOR because the relative efficiency was less than 1. Noticeably, there was a small variation of yvalues, so there was low variation among clusters (CV among clusters is 0.04) in this population. This makes cluster sampling more efficient.
Table 2: 
Results from the simulations on bluewinged teal data 

The number in parentheses is the number of clusters selected
in cluster sampling, m_{c} is the No. of units in a cluster sample,
* means that such a starting unit is on a high yvalue column j* or has
high yvalue column j*+1. R.E.cls = 

Fig. 4: 
Clusters in bluewinged teal data 

Fig. 5: 
Simulated data, each unit is Poisson distributed with a mean
of 50 with CV among clusters of 0.04 

Fig. 6: 
Simulated data with CV among clusters of 1.46 
Table 3: 
Results from the simulation on a nonrare population with
low CV among clusters 

Table 4: 
Results from simulation on nonrare population with high
CV among clusters 

The number in parentheses is the number of clusters selected
in cluster sampling, m_{c} is the No. of units in a cluster sample,
* means that such a starting unit is on a high yvalue column j* or has
high yvalue column j*+1. R.E.cls = 
Next, simulated data, as shown in Fig. 6 is used. All units were the same as the population data in Fig. 5, except column 6, 10 and 15. The yvalues in these 3 columns were replaced with a higher value. To compare path sampling with cluster sampling, let a cluster be a cluster of a column. This population data had high variation among the clusters with CV among clusters of 1.46. The simulation results are shown in Table 4.
According to the simulation results in Table 4, for starting
unit (1, 2) and (1, 17), path sampling was more efficient than cluster sampling
because the relative efficiency was greater than 1. Noticeably, the yvalues
in column 6, 10 and 15 were very higher than the others, so there was high variation
among clusters (CV of 1.46) in this population. This made cluster sampling less
efficient. Notice that when the starting unit is in a highvalued column which
are unit (1, 5), (1, 10) and (1, 15), path sampling was much more efficient
than cluster sampling since the relative efficiency was greater than 2.
For starting unit (1, 2) and (1, 17), path sampling was less efficient than SRSWOR since the relative efficiency was less than 1 for any p. However, for the starting unit in a highvalued column which are unit (1, 5), (1, 10) and (1, 15), path sampling was more efficient than SRSWOR for p = 1 since the relative efficiency was greater than 1 but it was less efficient than SRSWOR for p>2 because the relative efficiency was less than 1. DISCUSSION
Path sampling can be very costeffective for sampling many units. This is true
when cost is mainly a function of distance travelled, as the number of units
sampled equals the number of units travelled. In path sampling, the researcher
can sample all of the consecutive units in a path traversed during the sampling.
On the other hand, for cluster sampling, the cost of traveling between clusters
will be higher the more widespread the sample (Hansen et
al., 1953). In situations with budget constraints it is possible that
a researcher could sample more units with path sampling, thus giving it an added
advantage in this respect. Unfortunately, for path sampling the number of units
in the final sample is random and can vary a lot as a result of the number of
units in each path vary. Therefore, the expense of sampling when cost is a function
of distance travelled would also be random, possibly creating budget problems.
However, the expected sample size in path sampling can be obtained as with adaptive
cluster sampling (Thompson, 1990). It is the sum of
the inclusion probabilities.
As a result of the way in which the paths were formed, path sampling is a type
of unequal probability sampling and the authors used the HorvitzThompson estimator
for estimation of the population mean. Similarly in path sampling, much of the
literature has applied the HorvitzThompson estimator (Birnbaum
and Sirken, 1965; Thompson, 1990; Nafiu
and Adewara, 2007) because of the unequal probability of selection. For
the HorvitzThompson estimator, it is desirable to have the yvalues proportional
to the probability of selection in order to obtain a relatively small variance
which is observed by Horvitz and Thompson (1952). This
limitation is clear when comparing path sampling to simple random sampling in
the simulation results in Table 2, 3 and
4. If there is an auxiliary variable correlated with the variable
of interest it is desirable, when possible, to select a starting and ending
point for the paths which would have high yvalue units having a high probability
of selection and viceversa for lowvalued units.
In addition, when the CV from cluster to cluster in cluster sampling is high,
then path sampling may be a viable alternative to cluster sampling, as can be
seen in Table 2 and 4. Correspondingly,
Chih (2011) mentioned that cluster sampling is less
efficient when the betweencluster variability is large.
Path sampling should be implemented when two conditions are metwhen the cost
of the sampling is mainly a function of distance travelled and when it is believed
that the yvalues are positively correlated with the probability of selection.
It is known that the ratio estimator is often more precision (Dryver
and Chao, 2007). Therefore, if there is an auxiliary variable known to be
correlated with the variable of interest, then perhaps a ratio estimator for
path sampling should be considered. Finally, for rare and hidden populations,
further research should be carried out that investigates combining adaptive
cluster sampling and path sampling.
CONCLUSION In this study, sampling in a spatial population was studied. Sampling a spatial population by applying previous sampling designs, such as simple random sampling and cluster sampling, was inconvenient because the researcher had to travel from place to place to observe every unit in a sample. Thus, path sampling was proposed and compared to simple random sampling and cluster sampling. Path sampling is more convenient and costeffective but less efficient in some circumstances. According to the simulation results for a rare population and a nonrare population with high variation of yvalues among clusters, path sampling is more efficient than cluster sampling but less efficient than SRSWOR. However, for a nonrare population with a low variation of yvalues among clusters, path sampling is less efficient than cluster sampling and SRSWOR. An illustrative example was offered by applying path sampling to a spatial population of 4 rows and 6 columns. The calculation of the estimate of the mean and variance was also shown. Finally, all possible paths in this study are created in a certain way, so that, inclusion probabilities and joint inclusion probabilities can be obtained and the HorvitzThompson estimator can be applied. Another form of path could be created that is more convenient and costeffective. Moreover, other estimators could be created to improve the precision. In a rare and clustered population, adaptive path sampling could be of interest. ACKNOWLEDGMENT We are grateful to the Commission on Higher Education, Thailand, for financial support through a grant under the Strategic Scholarships Fellowships Frontier Research Networks.

REFERENCES 
1: Dryver, A.L. and S.K. Thompson, 2005. Improved unbiased estimators in adaptive cluster sampling. J. R. Stat. Soc.: Ser. B (Stat. Methodol.), 67: 157166. CrossRef  Direct Link 
2: Horvitz, D.G. and D.J. Thompson, 1952. A generalization of sampling without replacement from a finite universe. J. Am. Stat. Assoc., 47: 663685. Direct Link 
3: Lohr, S.L., 1999. Sampling: Design and Analysis. 1st Edn., Duxbury Press, California, USA., ISBN13: 9780534353612, Pages: 512.
4: Mier, K.L. and S.J. Picquelle, 2008. Estimating abundance of spatially aggregated populations: Comparing adaptive sampling with other survey designs. Can. J. Fish. Aquat. Sci., 65: 176197. CrossRef  Direct Link 
5: Smith, D.R., M.J. Conroy and D.H. Brakhage, 1995. Efficiency of adaptive cluster sampling for estimating density of wintering waterfowl. Biometrics, 51: 777788. Direct Link 
6: Thompson, S.K., 2002. Sampling. 2nd Edn., John Wiley and Sons, New York, USA., ISBN13: 9780471291169, Pages: 367.
7: Borkowski, J.J., 2003. Simple Latin square sampling K designs. Commun. Stat. Theory Methods, 32: 215237.
8: Dryver, A.L. and C.T. Chao, 2007. Ratio estimators in adaptive cluster sampling. Environmetrics, 18: 607620. CrossRef  Direct Link 
9: Hansen, H.M., N.W. Hurwitz and G.W. Madow, 1953. Sample Survey Methods and Theory. John Wiley, New York, USA.
10: Thompson, S.K., 1990. Adaptive cluster sampling. J. Am. Stat. Assoc., 85: 10501059. Direct Link 
11: Thompson, S.K., 2006. Adaptive web sampling. Biometrics, 62: 12241234. Direct Link 
12: Thompson, S.K., 2011. Adaptive network and spatial sampling. Surv. Mehtodol., 37: 183196. Direct Link 
13: Vincent, K.S., 2008. Design variations in adaptive web sampling. M.S. Thesis, Simon Fraser University, Burnaby, BC Canada.
14: Birnbaum, Z.W. and M.G. Sirken, 1965. Design of sample surveys to estimate the prevalence of rare diseases: Three unbiased estimates. Vital Health Statistics Series 2.
15: Chih, C.P., 2011. The design effects of cluster sampling on the estimation of mean lengths and total mortality of reef fish. Fish. Res., 109: 295302. CrossRef 
16: Nafiu, L.A. and A.A. Adewara, 2007. On the use of network sampling in diabetic J. Res. National Dev., 5: 59.



