INTRODUCTION
In the reoccurrence of breast cancer, there are many factors that might contribute
or associated with it and a significant amount of studies have been published
(Freedman et al., 2009; Brawley,
2009; Eleni and Gabriel, 2008; Habibi
et al., 2008) to determine which factors or attributable variables
are significantly contributing in predicting the relapse time and breast cancer
therapy. The purpose of this study is to identify the significant attributable
variables as well as all possible interactions among those variables and also
to construct a statistical model to predict the relapse time as a function of
those attributable variables and interactions so that given the information
of the attributable variables, we will be able to identify how much time it
takes before the reoccurrence of breast cancer for a specific patient. Once,
the relapse time is obtained from the model with respect to the single treatment
of tamoxifen alone and the model with respect to combined treatment of tamoxifen
and radiation, it can provide very important guidance on which treatment to
choose to increase reoccurrence time (Hancke et al.,
2009; Kimmick et al., 2009; Throckmorton
and Esserman, 2009; Hershman et al., 2008;
Schnitt and Harris, 2008; Trent
et al., 2007). Three different kinds of statistical models are presented:
a parametric regression model which assumes certain probabilistic distribution
for the error term; a semiparametric modelcox proportional hazard model which
assumes a proportional hazard and a cure rate model which takes into consideration
cure rate, i.e., part of the patients that will never experience reoccurrence
of breast cancer and for those patients who are subject to reoccurrence, various
parametric models are constructed to predict the relapse time.
MATERIALS AND METHODS
Between December 1992 and June 2000, a total of 769 women were enrolled and
randomized (Fyles et al., 2004) of which 386
received combined radiation (Taylor and Kim, 1993) and
tamoxifen (RT+Tam) and the rest, 383, were assigned to tamoxifenalone arm (Tam).
The last follow up was conducted in the summer of 2002. The study introduced
only those 641 patients that were enrolled at the Princess Margaret Hospital:
320 were treated with radiation and tamoxifen and 321 were treated with tamoxifen
only. A summary of the actual data used in the present study is shown in Fig.
1.
The proposed statistical models in the present study are constructed for patients
in RT+Tam Group and Tam group, respectively. Information concerning potential
prognostic factors (attributable variables) are pathsize (size of tumor in cm);
hist (Histology: DUC = Ductal, LOB = Lobular, MED = Medullar, MIX = Mixed, OTH
= Other); hrlevel (Hormone receptor level: NEG = Negative, POS= Positive); hgb
(Hemoglobin g L^{1}); nodediss (Whether axillary node dissection was
done: Y = Yes, N = No); age (Age of the patient in years).

Fig. 1: 
Patients treatment data 
The dependent variable or response variable is the relapse time (in years)
of a given patient.
One important question that we will address is that which of these attributable variables are significantly contributing to the response variable  the relapse time. In addition, identify all possible contributing to relapse time.
ACCELERATE FAILURE TIME MODEL (AFT) AND COX PROPORTIONAL HAZARD (COXPH) MODEL
AFT model: When covariates are considered, we assume that the relapse
time has an explicit relationship with the covariates. Furthermore, when a parametric
model (Lawless, 2003), is considered, we assume that the
relapse time follows a given theoretical probability distribution and has an
explicit relationship with the covariates.
Let T denote a continuous nonnegative random variable representing the survival time (relapse time in this case), with probability density function (pdf) f(t) and cumulative distribution (cdf) F(t) = Pr(T<t). We will focus on the survival function S(t) = Pr(T>t), the probability of being alive at t reoccurrence free in this case. In this model, we start from a random variable W with a standard distribution in (∞, +∞) and generate a family of survival distributions by introducing location and scale parameters to relate to the relapse time as follows:
where, α and σ are the location and scale parameters, respectively.
Adding covariates into the location parameter we have:
where, the error term W has a suitable probability distribution, e.g., extreme
value, normal or logistic. This transformation leads to the Weibull, lognormal
and loglogistic models for T. This type of statistical models are also called
Accelerated Failure Time (AFT) model. More information about AFT models can
be found (Yamaguchi, 1992; Shang
and Jeremy, 2002; Lawless, 2003).
CoxPH model: An alternative approach to modeling survival data is to assume that the effect of the covariates is to increase or decrease the hazard function by a proportionate amount at all durations. Thus:
or
where, λ_{0}(t) is the baseline hazard function or the hazard
for an individual with covariate values 0 and is
the relative risk associated with the covariate values x. Subsequently, for
the survival functions:
Hence, the survival function for covariates x is the baseline survivor raised to a power.
Parameter estimates in the coxPH model are obtained by maximizing the partial likelihood as opposed to the likelihood. The partial likelihood is given by:
The log partial likelihood is given by:
In application of the CoxPH model, we also included the interactions of the attributable variables.
MODEL RESULTS
The major objective of applying these models is to identify which of the six attributable variables are significant contributing to the relapse time of breast cancer patients receiving different treatments. The six explanatory variables used in the models are pathsize (size of tumor in cm); hist (Histology: DUC = Ductal, LOB = Lobular, MED = Medullar, MIX = Mixed, OTH = Other) hrlevel (Hormone receptor level: NEG = Negative, POS = Positive); hgb (Hemoglobin g L^{1}) nodediss (Whether axillary node dissection was done: Y = Yes, N = No) age (Age of the patient in years).
The most commonly used AFT models such as exponential, Weibull and lognormal
AFT models and CoxPH model are applied. After running the model including all
covariates and interactions between covariates, number of parameters that drive
the attributable variables are reduced using stepwise regression based on Arkariki
Information Critria (AIC) (Akaike, 1974), is a measure
of the goodness of fit of an estimated statistical model.
Table 1: 
Significant factors in parametric regression models for RT+Tam
group 

The stars (*) in the table indicates that the variable is
significant at significance level of 0.05 
Table 2: 
Significant factors in parametric regression models for Tam
group 

It is trades off the complexity of an estimated model against how well the
model fits the data. It is given by:
AIC = 2log (likelihood)+2(p+k)
where, p is the number of parameter and k is the number of parameters in the distribution. Statistical models with lower AIC are preferred. Table 1 given below shows the covariates and interactions in the related statistical models chosen using the AIC as well as their corresponding pvalues for the breast cancer patients that were treated with both radiation and tamoxifen.
As can be seen from the table, age, pathsize, nodediss, hrlevel and the interactions between age and nodediss and interaction between nodediss and hrlevel are significant with respect to relapse time of breast cancer patients who received radiation and tamoxifen. The interaction of pathsize and hrlevel proves to be significant only in Weibull AFT model.
Table 2 given below address the same aspects as Table 1, for breast cancer patients that were treated with tamoxifen only.
For patients who received tamoxifen only, only nodediss, hrlevel as single attributable variables are significant with respect to relapse time in this group. It is worth noticing that although age itself is not significantly contributing to relapse time, the interaction between age and nodediss is significant. hgb is found to be significant only in lognormal AFT model.
Comparing the results from the two treatment groups, for each group at significance level of 0.05, the three AFT models give almost the same results. Significant prognostic factors for relapse time of breast cancer patients who received combined treatment of radiation and tamoxifen are age, pathsize, nodediss, hrlevel, age:nodediss, nodediss:hrlevel which appears statistically significant in all lognormal, exponential and Weibull regression models. Only in Weibull regression model pathsize: hrlevel shows significant contribution to the model. For patients who are in Tam arm, all three models show nodediss, hrlevel and age:nodediss are significant contributing, only in lognormal regression model hgb shows significance.
Furthermore, significant prognostic factors identified using CoxPH model confirm our conclusion. There are six significantly contributing variables two of which are interactions for RT+Tam arm and three significantly contributing variables one of which is interaction for Tam arm.
Next the predicted survival curves of the three AFT models and CoxPH model for each arm are compared to KaplanMeier survival curve along with 95% confidence band to determine the best predicting model for relapse time and the results will be shown and discussed.
KaplanMeier vs. Parametric survival analysis: From the above four models, we identified the significant attributable variables and interactions between them that contributes to the relapse time of breast cancer patients in two different treatment groups. To investigate which model gives the best fit of the relapse time of breast cancer patients in those two groups, graphical presentation would be a useful tool. In this study, KaplanMeier curve as a commonly used nonparametric survival curve and its 95% confidence limits are plotted against the survival curves obtained from the four models discussed above to see which model gives the closest curve to KaplanMeier survival curve.
KaplanMeier is equivalent to the empirical distribution when we have censored data and its estimator of the survival function is estimated by:
where, S(t) is the probability that an individual will not have reoccurrence
of breast cancer after time t and t_{1}≤t_{2}≤t_{3}≤...≤t_{n}
are the observed times until reoccurrence for a sample size n; n_{i}
is the number of survivors just prior to time t_{i} and d_{i}
is the number of deaths at time t_{i}.
Using the breast cancer data for patients from RT+Tam arm, the KaplanMeier curves along with its 95% confidence limits against the lognormal AFT model are plotted in Fig. 2.
As can be seen from Fig. 2, for the second year, third year and around the sixth year , the survival curve from lognormal AFT model runs out of the 95% confidence band of KaplanMeier curve.
For exponential AFT model, the same graphical representation is given in Fig. 3 below. Form this Fig. 3 the survival curve estimated from the exponential AFT model is off the 95% confidence band from year 1 to year 4 and from year 5 to year 6.
Figure 4 shows the graph of survival curve obtained from
the Weibull AFT model, it deviates from the 95% confidence band of the KaplanMeier
in a similar pattern as the survival curve of the exponential AFT model.

Fig. 2: 
Survival curve from lognormal regression model vs. KaplanMeier
survive curve and its 95% confidence interval for RT+Tam 

Fig. 3: 
Survival curve from exponential regression model vs. KaplanMeier
survive curve and its 95% confidence interval for RT+Tam 
However, in Fig. 5 which shows the survival curve obtained from the CoxPH model, it is clear that most of the time, the survival curve lies within the 95% confidence band of KaplanMeier curve.
Thus, we can conclude from the above analysis that CoxPH model with interactions gives a better prediction of relapse possibility of breast cancer patients in RT+Tam arm comparing to the three AFT models.
Similarly, we proceed to perform a survival analysis of the relapse time for
the patients who are treated with tamoxifen only. Figure 68
show the survival curves obtained from lognormal, exponential and Weibull AFT
models. It is clear that those survival curves fall out of the 95% confidence
limits of the KaplanMeier curve most of the time. However, in Fig.
9 which shows the survival curve obtained from the CoxPH model with interactions,
we can see the survival curve lies within the 95% confidence band.

Fig. 4: 
Survival curve from Weibull regression model vs. KaplanMeier
survive curve and its 95% confidence interval for RT+Tam 

Fig. 5: 
Survival curve from CoxPH model vs. KaplanMeier survive
curve and its 95% confidence interval for RT+Tam 

Fig. 6: 
Survival curve from lognormal regression model vs. KaplanMeier
survive curve and its 95% confidence interval for Tam 

Fig. 7: 
Survival curve from Exponential regression model vs. KaplanMeier
survive curve and its 95% confidence interval for Tam 

Fig. 8: 
Survival curve from Weibull regression model vs. KaplanMeier
survive curve and its 95% confidence interval for Tam 

Fig. 9: 
Survival curve from CoxPH model vs. KaplanMeier survive
curve and its 95% confidence interval for Tam 
Table 3: 
Reoccurrencefree possibility 

Therefore, we can conclude that for patients who received tamoxifen only,
CoxPH model with interactions gives a more precise prediction of the relapse
time than AFT model.
Since, CoxPH model gives better prediction of relapse possibility than AFT models for both groups, we recommend CoxPH model to approximate the probability of having 2, 5 and 8year reoccurrencefree and the results are shown in Table 3.
Although, there is consistency on identifying significant prognostic factors
for reoccurrence of breast cancer, it can be seen from the above six graphs,
regression model might not be a good choice for predicting purpose. CoxPH models
with interactions show more efficiency over regression models with respect to
predicting power. So, it would be advisable to use CoxPH model with interactions
to predict the relapse time of a breast cancer patient given all the information
of the attributable variables. And as can be seen form the reoccurrencefree
table, patients with combined treatments have higher possibility of free of
reoccurrence of cancer than those with single treatment. More information on
comparing the survival models can be obtained (Nardi and
Schemper, 2003; Orbe et al., 2002).
Cure rate statistical model
Model introduction: Any clinical trial consists of a heterogeneous group
of patients that can be divided into two groups. Those who respond favorably
to the treatment and become insusceptible to the disease are called cured. The
others that do not respond to the treatment remain uncured. It would be of interest
to determine the proportion of cured patients and study the causes for the failure
of the treatment or reoccurrence of the disease. Unlike the above mentioned
survival parametric regression model and semiparametric CoxPH model with interactions
that assume each patient is susceptible to failure of treatment or reoccurrence,
cure rate statistical models are survival models consisting of cured and uncured
fractions. These models are being widely used in analyzing cancer data from
clinical trials. The first model to estimate cure fraction was developed by
Boag (1949) which is called mixture model or standard
cure rate model. Further development of this model can be found (Peng
et al., 1998; Goldman, 1984; Farewell,
1982; Ghitany et al., 1992; Theodora
et al., 2008; Fabien and Pierre, 2007; Uddin
et al., 2006).
Let π denote the proportion of cured patients and 1π is the proportion of uncured patients, then the survival function for the group is given by:
where, S_{u}(t) is the survival function of the uncured group. It follows that the density function is given by:
For uncured patients, we assume that the failure time or relapse time T follows a classical probability distribution and also we can add the effect of covariates into the model using the parametric survival regression models that we studied in the previous section. For cure rate π, it can either be assumed constant or dependent on covariates by a logistic model, that is:
Thus, covariates may be used either in cure rate or in the failure time probability distribution of the uncured patients. These different conditions will be considered in developing the modeling process.
Estimates of parameters in the model can be obtained by maximizing the overall likelihood function given by:
where t_{i} is the observed relapse time and σ_{i} is the censoring indicator with σ_{i} = 1 if t_{i} is uncensored and σ_{i} = 0, otherwise.
Model results for the breast cancer data: For the survival regression part, Weibull, lognormal (Lnormal), Gamma, generalized loglogistic (GLL), loglogistic (Llogistic), generalized F (GF), extended generalized gamma (EGG) and Rayleigh parametric regression are used. The following cases encompass the above statistical analysis as set forth.
Case 1: No covariates in π and S_{u}(t): When both cure rate and survival curve of uncured groups are independent of covariates. But we get very different cure rates using different distributions which suggest the model is very sensitive to the underlying distribution of the failure time of uncured patients.
Case 2: No covariates in π, six single covariates in S_{u}(t): When we consider covariates in survival function of uncured group, Table 4 shows there is some kind of consistency of cure rate among different distribution assumptions.
Case 3: Six single covariates in π, six single covariates in S_{u}(t): When we consider covariates in both cure rate and survival function of uncured group. Although, we add six covariates into cure rate, there is not much improvement in the likelihood and sometimes the likelihood is even lower, which suggests cure rate might not be dependent on those covariates; instead, we can consider it as a constant.
Case 4: No covariates in π, six single covariates and their interactions in S_{u}(t): Since, there is no significant difference in maximum likelihood between case 2 and 3, it shows including covariates does not improve the model much. Thus, in the following analysis, we consider cure rate as a constant, i.e., independent of those covariates. Table 5 shows uniformity of cure rates using different parametric regression models.
Table 4: 
Likelihood and cure rate with covariates in uncured survival
function 

Table 5: 
Likelihood and cure rate with covariates and interactions
in uncured survival function 

After computing the AIC of the above models for each group, the smallest one for RT+Tam is Gamma, the smallest one for Tam is EGG. Hence, we choose mixture cure model with Gamma regression for uncured RT+Tam group and with EGG regression for uncured Tam group. For patients who received radiation and tamoxifen, 10% of them will be cured of breast cancer and not be subject to reoccurrence. However, for those who received tamoxifen alone, only 7.48% will be cured of breast cancer which suggests that giving radiation to breast cancer patients who take tamoxifen could possibly decrease the probability of reoccurrence of breast cancer.
Thus, it is clear that cure rate model is useful in identifying the cure rate of breast cancer in each treatment group. 10% of the breast cancer patients who received combined treatment with tamoxifen and radiation would be cured of breast cancer and not susceptible to breast cancer again. However, the percentage of cured patients in tamoxifen group is only 7.48% which is much lower than that of the combined treatment group. This not only provides us insight on treatment selection with respect to cure rate, but also gives an creditable estimation of the percentage of cured patients.
CONCLUSION
By applying AFT and CoxPH models, the significant factors and interactions that contribute to relapse time of a breast cancer patient receiving different treatments are identified and AFT and CoxPH gives consistent results. With respect to predicting survival curve, CoxPH model gives better fit than AFT models. Thus, given information of covariates of a given breast cancer patient, CoxPH model with interactions can be applied to determine the time before reoccurrence of breast cancer. From a different perspective, cure rate model takes into consideration the fact that some part of the patients are cured and will never experience reoccurrence. It is found that cure rates for RT+Tam and Tam groups both are independent of the covariates and are different. For RT+Tam group, the cure rate is 0.1 which is higher than that of Tam group which is 0.0748. Thus, using the cure rate statistical model we conclude that patients received combined treatment of radiation and tamoxifen are more likely to be cured of breast cancer and less susceptible to reoccurrence of breast cancer than those who received single treatment.
ACKNOWLEDGMENT
The author would like to acknowledge the useful suggestions of Dr. James Kepner, Vice President, American Cancer Society, on the subject study.