HOME JOURNALS CONTACT

Journal of Applied Sciences

Year: 2005 | Volume: 5 | Issue: 7 | Page No.: 1228-1231
DOI: 10.3923/jas.2005.1228.1231
Generalized Estimating Equations for Conditional and Unconditional Residuals in Diabetes Mellitus Data
Md. Abdus Salam Akanda, Kawsar Jahan, Maksuda Khanam and M. Ataharul Islam

Abstract: This study focused for estimating the parameters of marginal model for repeated binary responses through the Generalized Estimating Equations (GEE) methodology. The GEE were applied to observe how certain covariates relate to change of the disease status overtime. In addition, we focused on the methodology of GEE using conditional and unconditional residuals along with common correlation structures seen in longitudinal studies. Here, the GEE has been applied to the data of four repeated binary observations of the registered patients at BIRDEM. We demonstrate that the estimator of the correlation based on conditional residuals is nearly efficient when compared with maximum likelihood. This estimator also yields more efficient estimates of the correlation than the usual GEE estimator that is based on unconditional residuals. Finally the results of applying the data set are presented.

Fulltext PDF Fulltext HTML

How to cite this article
Md. Abdus Salam Akanda, Kawsar Jahan, Maksuda Khanam and M. Ataharul Islam, 2005. Generalized Estimating Equations for Conditional and Unconditional Residuals in Diabetes Mellitus Data. Journal of Applied Sciences, 5: 1228-1231.

Keywords: uncondition residul, conditional residual, gee, marginal model and Logistic regression

INTRODUCTION

An increasing popular approach for estimating the parameters of marginal models for repeated binary responses is the GEE methodology. To assess the fit of a model, it is necessary to identify the influential elements. In particular, Liang[1] and Prentice[2] have developed moment-based generalized estimating equations which only require specification of the form of the first two moments of the vector of binary responses for each individual. Instead the modeling the association between the pair of binary responses in terms of the marginal correlations, Lipsitz[3] , Liang[4] and Carey[5] propose using the marginal odds ratio. Carey[5] estimate the marginal odds ratio using conditional residuals and have shown that their estimating equations for the odds ratio are highly efficient when compared to the optimal second-order joint estimating equations (GEE2) of Liang[4]. Carey[5] also demonstrate that there are very significant computational savings from using their method rather than the optimal joint estimating equations. Albert[6] proposed generalized estimating equations for estimating the parameters of both the mean and partial correlation structures. They highlighted on the use of this method for modeling the effect of spatial location and subject-specified covariates on spatially correlated binary data. Albert[7] describe a methodology for jointly modeling the number of events and the vector of correlated binary severity measures. They functionally linked the regression parameters for the counts and binary means and discussed GEE approach for parameter estimation. They also discussed the conditions under which the proposed joint modeling approach provides marked gains in efficiency relative to the common procedure of simply modeling the counts. In this study, we demonstrate that the measures of association between pairs of binary responses, e.g., the parameters can be estimated using conditional residuals and the usual GEE estimator can also be found using unconditional residuals.

Generalized estimating equations: The GEE approach provides consistent estimators of the regression parameters which needs only the correct specification of the form of the mean function of the vector of responses for each individual. In longitudinal studies, there is an implicit ordering of the times of the observations of each individual. We assume that the ith individual is observed at times t = 1,2,.........,Ti, where, Ti need not be the same for all N individuals. With binary response obtained at time t, we form a Ti x 1 vector

where, the binary random variable Yit = 1 if the ith individual has response 1 (success) at time ti and Yit = 0 otherwise. Each individual has a Jx1 covariate vector xit, measured at time t, which includes both time-stationary and time-varying covariates. Let represent the TixJ matrix of covariates for the ith individual. In the cluster data setting, Yi is the vector of binary responses for the Ti units within a cluster.

The marginal distribution of Yit is Bernoulli with

(1)

where, θit = ln (πit /(1-πit)) and β is a Jx1 vector of parameters. The πit (β) can be grouped together to form a vector πi (β) containing the marginal probabilities of success, πit (β) = E[Yi/Xi, β] = [πi1,.........,πiTi]. Since Yit is binary, the logistic link function, θit = x’itβ, is a natural choice, although, in principle any link function could be chosen.

We are interested in making inference about β, as well as the parameters, say α of the joint distribution of Yis and Yit (Table 1), where:

This joint probability can be modeled in terms of the two marginal probabilities πis(β) and πit(β), as well as an association parameter (contained in α). Although the following methods can be used for any association parameter (e.g., marginal odds ratio, kappa coefficient, relative risk), we focus on the marginal correlation coefficient. From Table 1, the correlation between the responses at times s and t is

In terms of the correlation coefficient, the joint probability πist can be written as

(2)

In the following, let α denote the parameters of the correlation between pairs of binary responses. Then, to estimate (β,α), we suggest modifying the estimating equations proposed by Carey[5] which were originally developed to estimate the marginal odds ratio. The estimating equations for β are given by:

(3)

where, Di = δπi/δβ and Vi is the TixTi “working” covariance matrix of Yi.

Table 1: Cross-classification probabilities for times s and t, s≠t

The tth diagonal elements of Vi(α,β) is var(Yit) = πit (1-πit), which is specified entirely by the marginal distributions i.e., by β. The st th off-diagonal elements of Vi is cov(Yis,Tit) = πitsisπit, where πits is specified by Eq. 2.

If α is unknown (which is typically the case), then it must be estimated with a set of estimating equations similar to (3). Following Carey[5] for a pair of times s<t we form the conditional residuals {Yit - E(Yit)|Yis = yis, Xi}, that is, deviations about conditional expectations. These random variables can than be grouped together to form the [Ti(Ti-1)/2]x1 vector of conditional residuals, (Ui - ηi), where:
Ui = {Ui12, Ui13,......., Ui (Ti-1) Ti}’, ηi = {ηi12, ηi13,........, ηi(Ti-1)Ti}’, with Uist = Yit and ηist = E(Yit|Yis = yis, Xi), for s<t.

From Table 1,

(4)

In order to form another set of moment estimating equations similar to (3), we need to take appropriate linear combinations of [Ui - ηi]. Thus a second set of (moment) estimating equations for α is given by:

(5)

where, Ci = δηi / δα and Wi = diag{var(Yit|Yis = yis)} with var(Yit|Yis = yis) = ηist(1-ηist). Using these estimating equations, the estimate is the solution to (3) and (5) and can be obtained using a Gauss-Seidel algorithm. Using Taylor series expansions similar to Prentice[2] assuming that regression for Yi and the model for the association has been correctly specified, is consistent for (β,α). In addition, has an asymptotic distribution which is multivariate normal with mean vector 0. In contrast to (5), Prentice[2] forms the unconditional residuals {YisYit- E(YisYit|Xi)}, that is, the deviations about unconditional expectations. These random variables can then be grouped together to form [Ti(Ti-1)/2]x1 vector, (pi-vi), where Pi = {Pi12,Pi13,.........,Pi (Ti-1) Ti}’, vi = {vi12,vi13,..........,vi(Ti-1)Ti}’, with Pist = (Yis,Yit) and vist = E(Yis, Yit|Xi), for s<t. Then, Prentice[2] proposes the following set of (moment) estimating equations for α,

(6)

where, Ai = δvi / δα and Pi ≈cov(Pi). Prentice[2] also suggests specifying the “working” covariance matrix for Pi as diag{var(Pist)}, where var(Pist) = πist(1-πist) since (Yis,Yit) is binary. Assuming that the regression for Yi and the model for the association has been correctly specified, the estimating equations proposed by Prentice[2] yield estimate that are consistent for (β,α). In addition, has an asymptotic distribution which is multivariate normal with mean vector zero.

Data and variables: In our study we have used the repeated measures data diabetes mellitus to carry out the analysis. Here the follow up data on 528 patients registered at BIRDEM (Bangladesh Institute of Research and Rehabilitation in Diabetes, Endocrine and Metabolic disorders) in 1984-94 are used to identify the risk factors responsible for the transitions from controlled diabetic to confirmed diabetic state as well as confirm diabetic to controlled stage of diabetes. We have taken into account the four consecutive visits of the patients from the registration. The response variable is defined in terms of the observed glucose level 2 h of 75 g glucose load for each follow-up visit. The cut-off point for the blood glucose level is 11.1 m mol L-1. If the observed response is less than 11.1, then the patient is defined as non diabetic (categorized as 0) if the response is greater than or equal to 11.1 then the patient is said to be diabetic (categorized as 1) according to WHO (1985) criteria. We include six independent variables in this study. They are age, sex, education level, area of residence, family history of father and mother and time. Out of these variables, age represents the age of the respondents at each visit. Time represents the length of time of the consecutive visits. These two variables are continuous variables and used directly in the analysis. Sex, education level, area of residence and family history of father and mother are categorical variables. Here sex is a dichotomous variable with two categories 0 and 1, 0 stands for female and 1 stands for male.

Education level is categorized again 0 and 1. Here, 0 represents the patients having below secondary education and 1 represents the patients having the secondary education or more. Area has two ategories, 0 represents rural and 1 represents urban or semi-urban. FHFM represents the genetic history of the parents. This variable has two categories, 0 representing the non-diabetic father and mother and 1 representing anyone or father and mother diabetic.

RESULTS AND DISCUSSION

The logistic regression model is considered as one of the most important and widely applicable techniques in analyzing repeated outcome variables. To assess the fit of a model, it is necessary to identify the influential elements. In the logistic regression analysis for repeated binary measures we adjust for setting and the covariates. We assumed independence, exchangeable and autoregressive working correlation structures and we obtained standard errors.

These analyses were carried out using specially written S-plus program and results shown in Table 2 and 3. We found that for the repeated binary responses, the variables education level, area, family history of father and mother (i.e., the disease status of the parents) and time are significant under independence, exchangeable and autoregressive correlation assumptions and thus have considerable effect in changing the disease status. We also found that under all assumptions education level and area shows negative association and Family History of Father and Mother (FHFM), time shows positive association. Among these variable education level, area and time are significant at 5% level of significance in all cases (GEE for conditional and unconditional residuals) (Table 2 and 3). The only variable Family History of Father and Mother (FHFM) is significant at 10% level of significance in all cases. The estimated coefficients of the variables age and sex are found to be insignificant in all cases. Hence it may be conclude that the variables age and sex has no significant effect on the transition from confirmed diabetes state to controlled diabetes state.

Table 2:
Estimates obtained by GEE assuming the various correlation structures within repeated outcomes with associated Wald test statistic for conditional residuals

Table 3:
Estimates obtained by GEE assuming the various correlation structures within repeated outcomes with associated Wald test statistic for unconditional residuals

Table 4: Asymptotic relative efficiency of the GEE Estimator based on conditional and unconditional residuals relative to the MLE

From Table 4, we found the asymptotic efficiency of the GEE estimator assuming exchangeable and autoregressive correlation relative to the ML method. Comparing the results we come to the conclusion that parameters are estimated more efficiency by the GEE estimator based on conditional residuals than the unconditional residuals. Under the assumption of autoregressive correlation structure the asymptotic relative efficiency is more than other correlation structures.

CONCLUSIONS

From the present data set it can be seen that parameter estimates based on both conditional and unconditional residuals are more efficient than the ML estimates. We may conclude that for analyzing the data in case of chronic disease (i.e., diabetic mellitus), where the response variable is binary and the resulting estimates of GEE based on conditional residuals can be used more efficiently under the assumption of autoregressive correlation than that based on unconditional residuals. Furthermore, estimating equations based on conditional residuals could be constructed to estimate the association of repeated ordinal data. We conjecture that these estimating equations will also be more efficient than the estimating equations based on unconditional residuals in the present data set.

ACKNOWLEDGMENT

We would like to express my gratitude to the Director of BIRDEM for giving us kind permission to use their data. We are indebted to the Chairman, Department of Statistics, University of Dhaka, Bangladesh for his kind cooperation through this research.

REFERENCES

  • Liang, K.Y. and S.L. Zeger, 1986. Longitudinal data analysis using generalized linear models. Biometrika, 73: 13-22.
    Direct Link    


  • Lipsitz, S.R., N.M. Laird and D.P. Harrington, 1991. Generalized estimating equations for correlated binary data: Using the odds ratio as a measure of association. Biometrika, 78: 153-160.


  • Prentice, R.L., 1988. Correlated binary regression with covariates specific to each binary observation. Biometrics, 44: 1033-1048.
    Direct Link    


  • Liang, K.Y., S.L. Zeger and B. Qaqish, 1992. Multivariate regression analysis for categorical data (with discussion). J. Royal Stat. Soc. Ser. B, 54: 3-40.


  • Carey, V., S.L. Zeger and P.J. Diggle, 1993. Modeling multivariate binary data with alternating logistic regressions. Biometrica, 80: 517-526.


  • Albert, P.S. and L.M. Shane, 1995. A generalized estimating equation approach for spatially correlated binary data: Applications to the analysis of neuroimaging data. Biometrics, 51: 627-638.


  • Albert, P.S., D.A. Follmann and H.X. Barnhart, 1997. A generalized estimating equation approach for modeling random length binary vector data. Biometrics, 53: 1116-1124.

  • © Science Alert. All Rights Reserved