INTRODUCTION
Researches on software reliability indicate the existence of software aging
phenomenon in long-running software system, in which the system performance
degrades gradually with time and even sudden downtime appears (Avritzer
and Weyuker, 1997). Software aging has been observed in software system
such as operating system, web server, Java virtual machine and so on (Vaidyanathan
and Trivedi, 1999; Grottke et al., 2006;
Cotroneo et al., 2007). The causes of software
aging can be mainly divided into two categories. One is system instantaneous
collapse caused by deadly system abnormalities and the other is system performance
degradation and memory capacity decline caused by resources exhaustion.
In order to attack software aging and avoid the serious losses caused by sudden
system failures, a proactive and preventive fault-tolerant technique called
software rejuvenation is introduced by Huang et al.
(1995). It involves taking measures to periodically roll back the running
software in order to avoid the occurrence of greater failures in the future.
The current study on software rejuvenation technology mainly has two kinds
of policies: the model-based policy and the measurement-based policy. In the
model-based policy, the software rejuvenation model is set up based on the assumption
of some distributions for system failures to determine the optimal rejuvenation
interval. Analytical modeling has been used to address this issue in several
research studies. Huang et al. (1995) introduced
the continuous Markov process to build two-phase software rejuvenation model
by taking into consideration the downtime costs. Garg et
al. (1998) presented a software rejuvenation model of a transaction
processing system and obtained the optimal rejuvenation interval by minimizing
the probability of transaction loss and maximizing the system availability.
Xie et al. (2005) took the semi-Markov process
into account for building a model to estimate software rejuvenation schedule
by maximizing the system availability. Avritzer et al.
(2010) introduced a software rejuvenation model in mission critical systems
that are subjected to worm infection via tracking both the state of the mission
and the customer affecting metric. Dohi et al. (2012)
considered a two step failure process model with periodic rejuvenation and derived
the optimal rejuvenation policy maximizing the system reliability. In general,
the model-based software rejuvenation policy has advantage in the scenario of
less change of runtime state yet it has poor flexibility and is hard to make
decision in time.
In the measurement-based policy (Grottke et al.,
2006; Cotroneo et al., 2007; Matias
et al., 2010; Kourai and Chiba, 2011), the
runtime state and performance parameters of software system are monitored and
detected continuously and the software rejuvenation interval is estimated based
on the collection and statistical analysis of system data. This kind of policy
has high flexibility in real-time decision-making and is suitable for large
change of operation scenario. However, it needs frequently monitor the system
with more system resources exhaustion and the computation is relatively complex
and expensive.
For the reason that system failures occurred randomly, it probably takes a period of time for system failures to be discovered. In this case, the periodical inspection mechanism is introduced to discover system failures. However, the frequent inspection of running software system incurs some overhead and brings greater system losses, it is important to determine when and how often software rejuvenation and system inspection should be initiated.
In this study, from the mathematic modeling point of view, a software rejuvenation
model is built with periodical inspection policy and probability statistics
method. In order to improve software reliability, the function of system unavailability
and cost rate is given and optimal system inspection interval and software rejuvenation
interval are derived. Finally, the numerical results are shown to validate the
proposed model.
MODELLING AND ANALYSIS OF PERIODICALLY INSPECTED SOFTWARE REJUVENATION POLICY
Most previous studies which have focused on model-based software rejuvenation policy put forward different algorithms or models to derive the optimal rejuvenation interval. As shown in Fig. 1, the system begins operation with the best performance value of Pmax. When the system operates at the time δ΄, system performance reduces to the value of Pmin and thereby software rejuvenation is performed to bring the system back to the initial health state. However, due to the random occurrence of system failures, system performance could have declined to the minimum at time δ΄΄ (δ΄΄<δ΄) and eventually the system crashed during software rejuvenation interval δ΄. In this case the reactive maintenance method such as system recovery is taken action to recover the system to its initial state.
Basic idea of periodically inspected software rejuvenation policy: For the reason that system failures occurred randomly, it is hard to discover them immediately. Thus it is necessary to adopt the periodically inspected software rejuvenation policy, in which the system inspection is performed at intervals δ and the error log is used to record the information of whether the system failures occur or not.
|
Fig. 1: |
Periodical rejuvenation model of software system, δ΄
is the software rejuvenation interval, δ΄΄ is the time when
system crashes, Pmax is the best performance value, Pmin
is the worst performance value |
|
Fig. 2: |
Failures are detected after the inspection for k(k≤n) times,
δ is the inspection interval, τr is the execution time
to recover from system failures, n is the system inspection times |
|
Fig. 3: |
No failure is detected after the inspection for n times, δ
is the inspection interval, τR is the execution time to
perform software rejuvenation, n is the system inspection times |
The system completes a renewal cycle in two cases. In one case as shown in Fig. 2, the system failures are discovered during k (k≤n) inspection intervals and then the system recovery is carried out to bring the system to initial state. In another case as shown in Fig. 3, no failure is detected after the n inspection intervals and then the software rejuvenation is performed to bring the system to initial state.
The basic assumptions on the periodically inspected software rejuvenation policy are:
• |
The value of τR and τr have
been selected on whatever practical basis and τR<τr |
• |
After the system recovery or software rejuvenation, software system will
return to the new initial state |
• |
The probability function of failure rate is increasing monotonically and
has general distribution |
• |
The inspections discover system failures and have no effect on the operation
of software system |
Periodically inspected software rejuvenation modeling and performance analysis:
Based on the periodically inspected software rejuvenation policy above,
the average renewal cycle begins with a renewal and ends either with software
rejuvenation after time nδ (with probability
(nδ)
or with system recovery (with probability F(nδ). Therefore, the average
length of a renewal cycle L(δ,n) is expressed as follows:
where, F(t) is the distribution function of failure rate,
(t) is the survive function and
(t) = 1-F(t).
The expected uptime for software system in a cycle is:
Software system is down when software recovery or system rejuvenation is started. Thus, the expected downtime per cycle is deduced as:
In addition, the expected number of system inspection intervals in a cycle is:
It is assumed that cI is the expected inspection cost and cf is the expected cost during system downtime which consists of the time of system recovery, software rejuvenation and system failure. Then the average total cost per cycle is:
The cost rate is defined as the ratio of the average total cost per cycle to the average length of a renewal cycle. Then the cost rate function r(δ,n) is:
The system availability is defined as the ratio of the expected uptime per cycle to the average length of a renewal cycle. Then the system availability function is:
According to formula (7), the system unavailability function is derived as:
A practical task is to minimize the cost rate by proper selection of δ and n. A reasonable approach is to calculate the derivative of the cost rate function r(δ,n) or the system unavailability function U(δ,n) and if it satisfies r'(δ,n) = 0 or U'(δ,n) = 0, then the optimal δ is obtained.
Boundary conditions: By analyzing the asymptotic behavior of cost rate function r(δ,n) for increasing δ and any fixed n = 1, two boundary conditions are concluded as follows:
• |
Theorem 1: The cost rate of software system is not
more than the expected cost during system downtime, i.e. r(δ,n) = cf |
Proof: From formula (6):
Consider the mean time to system failure which is given by:
It follows that:
Consider that
.
Thus:
Formula (6) yields:
Furthermore, note that:
Hence:
Therefore, r(δ,n)≤cf, the assertion is proven.
• |
Theorem 2: The optimal system inspection interval is
not less than the ratio of the expected inspection cost to the expected
cost during system downtime, i.e.: |
Proof: According to Theorem 1:
Formula (6) yields:
For the survive function
(t) is non-increasing such that:
Hence, it follows that:
Based on the triangle inequality, Eq. (9-11)
yield:
That is:
Therefore, the assertion is proven.
NUMERICAL RESULTS AND ANALYSIS
Based on the periodically inspected software rejuvenation model above, numerical experiments are performed by taking system unavailability and cost rate as evaluation indicators of system reliability. The model is solved for multiple values of δ and n and the optimal value of δ is determined. Meanwhile, the affects of software rejuvenation and periodical inspection on system reliability are investigated. By minimizing system unavailability and cost rate, the optimal software rejuvenation interval and system inspection interval are obtained.
Numerical values of system parameters are given in Table 1. All the system parameter values are selected by experimental experience for demonstration purposes.
Figure 4 shows the relationship between system inspection
interval and cost rate for different values n and in the case of failure rate
following the Weibull distribution F(t), where:
Here, c = 3x10-3, α = 2.8x10-3 and b = 4.
Table 1: |
System parameter |
 |
cf: Expected cost during system downtime, cI:
Expected inspection cost, τR: Execution time to perform
software rejuvenation, τr: Execution time to recover from
system failures |
It can be observed from Fig. 4 that for the fixed inspection
times n, when the system inspection interval δ is very small which means
the frequency of triggering software rejuvenation is very high, the system is
almost unavailable and the cost rate is considerable. With the increase of the
system inspection interval δ, the cost rate reduces rapidly and goes to
the minimum at the point of optimal inspection interval δ*. And then the
inspection interval δ continues to increase, the possibility of occurring
failures are steadily rising and the cost rate becomes bigger. In addition,
on the whole, the higher the inspection times n is, the lower the cost rate
and the optimal inspection interval δ* are.
The cost rate and system unavailability versus optimal inspection interval with varying inspection times n is shown in Table 2, from which the optimum inspection interval δ* is selected when the value of the cost rate function r(δ,n) and the system unavailability function U(δ,n) reach the minimum. It can be seen From Table 2 that the overall optimal combination which minimize the cost rate and unavailability is n = 9 and δ* = 5 h (i.e. the optimal software rejuvenation interval is 45 h).
|
Fig. 4: |
Cost rate versus system inspection interval, n is the system
inspection times |
Correspondingly, the system unavailability U(δ*,n) is 0.1641 and the cost rate r(δ*,n) is 1.0872. It also can be seen that the reduction of inspection times leads to larger inspection interval and software rejuvenation interval and increasing the cost rate and the system unavailability.
Finally, the differences between our periodically inspected software rejuvenation
model and the general software rejuvenation model without periodical inspection
policy given by Huang et al. (1995) are investigated.
Aiming at the latter model, Fig. 5 illustrates the relationship
between software rejuvenation interval and system unavailability in cases of
failure rate following with the Weibull and exponential distribution. It can
be observed that system unavailability decreases with software rejuvenation
interval δ΄ increases, attains a minimum at δ΄ = 15 h and
δ΄ = 21 h, respectively and then gradually increases. Corresponding,
the minimum value of system unavailability is 0.02604 and 0.02025, respectively,
which is greater than 0.01641 derived from the periodically inspected software
rejuvenation model above. It indicates that our periodically inspected software
rejuvenation model has superiority in the aspect of improving system availability
compared with the general software rejuvenation model.
|
Fig. 5: |
System unavailability versus software rejuvenation interval |
Table 2: |
Comparison between optimal inspection interval and cost rate/unavailability |
 |
δ* is the optimal inspection interval, r(δ*,n) is
the cost rate, U(δ*,n) is the system unavailability |
CONCLUSION AND IMPLICATIONS
Software aging is an important potential factor that affects the software reliability. As a proactive and preventive software fault tolerant technique, software rejuvenation is a main method for counteracting software aging. Considered system failures occurred randomly and analyzing the runtime state of software system, a software rejuvenation model is presented by using periodical inspection policy. The optimal inspection interval and optimal software rejuvenation interval are selected via minimizing system unavailability and cost rate. Then boundary condition of cost rate and system inspection interval is deduced. Finally, quantitative analysis and numeric experiment results show that selecting optimal system inspection interval and scheduling optimal software rejuvenation can greatly reduce the average cost and improve the system reliability.
Analytic work in the future may focus on the fine-granularity software rejuvenation model considering of system inspections on part of software system. Dynamically selecting the software rejuvenation policy is another avenue for new development.
ACKNOWLEDGMENTS
The author would like to thank the sponsors of the National Natural Science Foundation of China under Grant No. 61100173, Scientific Research Plan Project of Shaanxi Education Department of China under Grant No. 09JK642, Doctoral Fund No. 116-210912 and Scientific Research Plan Project of Xian Technology University under Grant No. 116-210907.