INTRODUCTION
An increasing popular approach for estimating the parameters of marginal models for repeated binary responses is the GEE methodology. To assess the fit of a model, it is necessary to identify the influential elements. In particular, Liang[1] and Prentice[2] have developed moment-based generalized estimating equations which only require specification of the form of the first two moments of the vector of binary responses for each individual. Instead the modeling the association between the pair of binary responses in terms of the marginal correlations, Lipsitz[3] , Liang[4] and Carey[5] propose using the marginal odds ratio. Carey[5] estimate the marginal odds ratio using conditional residuals and have shown that their estimating equations for the odds ratio are highly efficient when compared to the optimal second-order joint estimating equations (GEE2) of Liang[4]. Carey[5] also demonstrate that there are very significant computational savings from using their method rather than the optimal joint estimating equations. Albert[6] proposed generalized estimating equations for estimating the parameters of both the mean and partial correlation structures. They highlighted on the use of this method for modeling the effect of spatial location and subject-specified covariates on spatially correlated binary data. Albert[7] describe a methodology for jointly modeling the number of events and the vector of correlated binary severity measures. They functionally linked the regression parameters for the counts and binary means and discussed GEE approach for parameter estimation. They also discussed the conditions under which the proposed joint modeling approach provides marked gains in efficiency relative to the common procedure of simply modeling the counts. In this study, we demonstrate that the measures of association between pairs of binary responses, e.g., the parameters can be estimated using conditional residuals and the usual GEE estimator can also be found using unconditional residuals.
Generalized estimating equations: The GEE approach provides consistent
estimators of the regression parameters which needs only the correct specification
of the form of the mean function of the vector of responses for each individual.
In longitudinal studies, there is an implicit ordering of the times of the observations
of each individual. We assume that the ith individual is observed at times t
= 1,2,.........,Ti, where, Ti need not be the same for
all N individuals. With binary response obtained at time t, we form a Ti
x 1 vector
where, the binary random variable Yit = 1 if the ith individual
has response 1 (success) at time ti and Yit = 0 otherwise.
Each individual has a Jx1 covariate vector xit, measured at time
t, which includes both time-stationary and time-varying covariates. Let
represent the TixJ matrix of covariates for the ith individual. In
the cluster data setting, Yi is the vector of binary responses for
the Ti units within a cluster.
The marginal distribution of Yit is Bernoulli with
where, θit = ln (πit /(1-πit)) and β is a Jx1 vector of parameters. The πit (β) can be grouped together to form a vector πi (β) containing the marginal probabilities of success, πit (β) = E[Yi/Xi, β] = [πi1,.........,πiTi]. Since Yit is binary, the logistic link function, θit = xitβ, is a natural choice, although, in principle any link function could be chosen.
We are interested in making inference about β, as well as the parameters,
say α of the joint distribution of Yis and Yit (Table
1), where:
This joint probability can be modeled in terms of the two marginal probabilities
πis(β) and πit(β), as well as an association
parameter (contained in α). Although the following methods can be used
for any association parameter (e.g., marginal odds ratio, kappa coefficient,
relative risk), we focus on the marginal correlation coefficient. From Table
1, the correlation between the responses at times s and t is
In terms of the correlation coefficient, the joint probability πist can be written as
In the following, let α denote the parameters of the correlation between pairs of binary responses. Then, to estimate (β,α), we suggest modifying the estimating equations proposed by Carey[5] which were originally developed to estimate the marginal odds ratio. The estimating equations for β are given by:
where, Di = δπi/δβ and Vi
is the TixTi working covariance matrix of
Yi.
Table 1: |
Cross-classification probabilities for times s and t, s≠t |
 |
The tth diagonal elements of Vi(α,β) is
var(Yit) = πit (1-πit), which is
specified entirely by the marginal distributions i.e., by β. The st th off-diagonal elements of Vi is cov(Yis,Tit)
= πits -πisπit, where πits
is specified by Eq. 2.
If α is unknown (which is typically the case), then it must be estimated
with a set of estimating equations similar to (3). Following Carey[5]
for a pair of times s<t we form the conditional residuals {Yit
- E(Yit)|Yis = yis, Xi}, that
is, deviations about conditional expectations. These random variables can than
be grouped together to form the [Ti(Ti-1)/2]x1 vector
of conditional residuals, (Ui - ηi), where:
Ui = {Ui12, Ui13,......., Ui (Ti-1) Ti},
ηi = {ηi12, ηi13,........, ηi(Ti-1)Ti},
with Uist = Yit and ηist = E(Yit|Yis
= yis, Xi), for s<t.
From Table 1,
In order to form another set of moment estimating equations similar to (3), we need to take appropriate linear combinations of [Ui - ηi]. Thus a second set of (moment) estimating equations for α is given by:
where, Ci = δηi / δα and Wi
= diag{var(Yit|Yis = yis)} with var(Yit|Yis
= yis) = ηist(1-ηist). Using these
estimating equations, the estimate
is the solution to (3) and (5) and can be obtained using a Gauss-Seidel algorithm. Using Taylor series expansions similar to Prentice[2] assuming that
regression for Yi and the model for the association has been correctly
specified,
is consistent for (β,α). In addition,
has an asymptotic distribution which is multivariate normal with mean vector
0. In contrast to (5), Prentice[2] forms the unconditional residuals
{YisYit- E(YisYit|Xi)},
that is, the deviations about unconditional expectations. These random variables
can then be grouped together to form [Ti(Ti-1)/2]x1 vector,
(pi-vi), where Pi = {Pi12,Pi13,.........,Pi
(Ti-1) Ti},
vi = {vi12,vi13,..........,vi(Ti-1)Ti},
with Pist = (Yis,Yit) and vist =
E(Yis, Yit|Xi), for s<t. Then, Prentice[2]
proposes the following set of (moment) estimating equations for α,
where, Ai = δvi / δα and Pi
≈cov(Pi). Prentice[2] also suggests specifying the
working covariance matrix for Pi as diag{var(Pist)},
where var(Pist) = πist(1-πist) since
(Yis,Yit) is binary. Assuming that the regression for
Yi and the model for the association has been correctly specified,
the estimating equations proposed by Prentice[2] yield estimate
that are consistent for (β,α). In addition,
has an asymptotic distribution which is multivariate normal with mean vector
zero.
Data and variables: In our study we have used the repeated measures data diabetes mellitus to carry out the analysis. Here the follow up data on 528 patients registered at BIRDEM (Bangladesh Institute of Research and Rehabilitation in Diabetes, Endocrine and Metabolic disorders) in 1984-94 are used to identify the risk factors responsible for the transitions from controlled diabetic to confirmed diabetic state as well as confirm diabetic to controlled stage of diabetes. We have taken into account the four consecutive visits of the patients from the registration. The response variable is defined in terms of the observed glucose level 2 h of 75 g glucose load for each follow-up visit. The cut-off point for the blood glucose level is 11.1 m mol L-1. If the observed response is less than 11.1, then the patient is defined as non diabetic (categorized as 0) if the response is greater than or equal to 11.1 then the patient is said to be diabetic (categorized as 1) according to WHO (1985) criteria. We include six independent variables in this study. They are age, sex, education level, area of residence, family history of father and mother and time. Out of these variables, age represents the age of the respondents at each visit. Time represents the length of time of the consecutive visits. These two variables are continuous variables and used directly in the analysis. Sex, education level, area of residence and family history of father and mother are categorical variables. Here sex is a dichotomous variable with two categories 0 and 1, 0 stands for female and 1 stands for male.
Education level is categorized again 0 and 1. Here, 0 represents the patients
having below secondary education and 1 represents the patients having the secondary
education or more. Area has two ategories, 0 represents rural and 1 represents
urban or semi-urban. FHFM represents the genetic history of the parents. This
variable has two categories, 0 representing the non-diabetic father and mother
and 1 representing anyone or father and mother diabetic.
RESULTS AND DISCUSSION
The logistic regression model is considered as one of the most important and widely applicable techniques in analyzing repeated outcome variables. To assess the fit of a model, it is necessary to identify the influential elements. In the logistic regression analysis for repeated binary measures we adjust for setting and the covariates. We assumed independence, exchangeable and autoregressive working correlation structures and we obtained standard errors.
These analyses were carried out using specially written S-plus program and results shown in Table 2 and 3. We found that for the repeated binary responses, the variables education level, area, family history of father and mother (i.e., the disease status of the parents) and time are significant under independence, exchangeable and autoregressive correlation assumptions and thus have considerable effect in changing the disease status. We also found that under all assumptions education level and area shows negative association and Family History of Father and Mother (FHFM), time shows positive association. Among these variable education level, area and time are significant at 5% level of significance in all cases (GEE for conditional and unconditional residuals) (Table 2 and 3). The only variable Family History of Father and Mother (FHFM) is significant at 10% level of significance in all cases. The estimated coefficients of the variables age and sex are found to be insignificant in all cases. Hence it may be conclude that the variables age and sex has no significant effect on the transition from confirmed diabetes state to controlled diabetes state.
Table 2: |
Estimates obtained by GEE assuming the various correlation
structures within repeated outcomes with associated Wald test statistic
for conditional residuals |
 |
Table 3: |
Estimates obtained by GEE assuming the various correlation
structures within repeated outcomes with associated Wald test statistic
for unconditional residuals |
 |
Table 4: |
Asymptotic relative efficiency of the GEE Estimator based
on conditional and unconditional residuals relative to the MLE |
 |
From Table 4, we found the asymptotic efficiency of the GEE estimator assuming exchangeable and autoregressive correlation relative to the ML method. Comparing the results we come to the conclusion that parameters are estimated more efficiency by the GEE estimator based on conditional residuals than the unconditional residuals. Under the assumption of autoregressive correlation structure the asymptotic relative efficiency is more than other correlation structures.
CONCLUSIONS
From the present data set it can be seen that parameter estimates based on both conditional and unconditional residuals are more efficient than the ML estimates. We may conclude that for analyzing the data in case of chronic disease (i.e., diabetic mellitus), where the response variable is binary and the resulting estimates of GEE based on conditional residuals can be used more efficiently under the assumption of autoregressive correlation than that based on unconditional residuals. Furthermore, estimating equations based on conditional residuals could be constructed to estimate the association of repeated ordinal data. We conjecture that these estimating equations will also be more efficient than the estimating equations based on unconditional residuals in the present data set.
ACKNOWLEDGMENT
We would like to express my gratitude to the Director of BIRDEM for giving
us kind permission to use their data. We are indebted to the Chairman, Department
of Statistics, University of Dhaka, Bangladesh for his kind cooperation through
this research.