Now more and more Commercial-Off-The-Shelf (COTS) hardware and software components
are employed in spacecrafts. A fault tolerance mechanism for satellite processing
application had been implemented through redundant COTS components (McLoughlin
et al., 2003) and Remote Exploration and Experimentation (REE) project
was launched to bring commercial cluster technology into space (Some
and Ngo, 1999). In future more applications and projects like them will
run in space to execute scientific explorations. Generally software and hardware
of these COTS are not radiation hardened to protect them from radiation, for
example Galactic Cosmic Ray (GCR) and Solar Proton will cause Single Event Effects
(SEEs) (Some et al., 2001) in registers and memory.
In order to reduce influence of SEEs some fault-tolerant mechanisms were designed
and implemented by both hardware and software.
Recently, software-implemented fault tolerance became focus of research. Oh
et al. (2002) brought up control-follow checking. Torres-Pomales
(2000) discussed single-version and multi-version software fault tolerance
techniques in detail. The theory and practice of fault-tolerant high-performance
matrix multiplication in REE Project of NASAs High Performance Computing
and Communication Program was designed and implemented by Gunnels
et al. (2001). Before a fault-tolerant system is deployed, it must
be tested and validated to check if fault-tolerant mechanisms take effect. These
tests are more important in systems composed of COTS since, there are no radiation-hardened
chips used. Usually these tests are implemented by fault injection techniques
(Some et al., 2001). These techniques are basically
divided into four categories:
||Hybrid fault injection-a combination of the former three methods
Hardware-implemented fault injection techniques need special hardware and may
produce permanent harm to object hardware. Simulation-implemented fault injection
techniques are difficult to create models, more cost in computation and inaccurate.
The SWIFI techniques are naturally simulations of hardware faults in order make
the system behave as if real hardware faults had occurred. Compared to the other
techniques SWIFI techniques have some special advantages: (1) no complexity
models, (2) less development effort, (3) lower cost and (4) increased portability
(Some et al., 2001). There are quite a few SWIFI
based fault injectors such as FIAT (Han et al., 1995),
FERRARI (Kanawati et al., 1995), NFTAPE (Stott
et al., 2002), DOCTOR ORCHESTRA (Dawson et
al., 1996), BOND (Baldini et al., 2000),
and Xception (Carreira et al., 1998). They run
on Unix, Linux and WindowsNT platforms. In this study, MOFI is like BOND and
Holodeck, since all of them run on WindowsNT platform. But there are basic differences
of injecting method between them.
BOND and Holodeck focus on injecting faults into API (Application Programming Interface) parameters and when these parameters pass through interfaces of layers and components of software, faults are injected. So both of them need to intercept API, modify parameters and continue to pass. This way obviously can only inject determined faults into special locations but cant inject fault into any positions of registers and memory. The MOFI is different from them. It can inject faults into registers and any location of memory. The difference comes from the aim of injection. In both BOND and Holodeck SWIFI techniques are aided approaches to normal software test. In MOFI the aim is to emulate SEEs occurring in registers and memory in order to evaluate reliability of software which will run in spacecraft and other ground systems that work in high radiation environments.
Xception is similar with MOFI in functions. Xception operates at the exception handler level and needs debugger mechanism of CPU. Both MOFI and Xception minimize intrusion into application. The MOFI doesnt depend on specific hardware architecture.
This study describes the theory of MOFI fault injector and the methods to compute reliability parameters and analysis the influence of different injecting locations distributions on injecting experiments results.
MOFI can inject faults into registers and memory when object applications are
running. The MOFI can inject one-bit or multi-bits faults that emulate SEEs
effectiveness into registers and memory with the same method. There are only
limited registers in x86 architecture where our experiment platform Windowsxp
runs. In one research detailed techniques of injecting faults into 75 registers
of POWERPC were studied (Some et al., 2001).
The registers in x86 are less than POWERPC so the work of injecting faults is
easier. So, in this study, we mainly focus on injecting faults into memory and
all computations and experiments are surrounding it. First it is necessary to
study the application memory image when it is running as a process.
Process is put into memory according to various regions such as code region, data region, heap region, stack region and so on. When MOFI injects faults into application it just injects faults into regions of process of application. Figure 1 is a concept display of process memory image. This concept display corresponds to almost any popular operation system that supports multitask and virtual memory. The number on left is the address of memory cell. The range is from 0x00000000 to 0xFFFFFFFF. Theoretically MOFI can inject faults into all memory cells of process of an application.
||Process memory image
Figure 2 shows what is the SEEs. The MOFI can emulate the SEEs effectively. In this example MOFI injects one-bit SEEs fault into address 0x00001000 by changing the fourth bit from 1 to 0. So after injected fault the content of address 0x00001000 becomes 01001010. The MOFI can inject one-bit or multi-bits fault into memory space of the running process and this is its main injecting method.
THEORY OF INJECTION OPERATION AND ALGORITHM
MOFI is an application-level software-implemented fault injector providing
an easy-to-use environment for fault injection experiments including massive
fault injection campaigns to get statistical data on fault performance and accurate
fault injection into any determinate location for specific purpose. The user
just needs to tell MOFI the name of object application that will be injected
or PID (process identifier) of its process. Although, MOFI could inject faults
into any applications, with or without source code, that run in windows, in
our injecting experiments all applications are with source codes. This is important
because if a campaign needs to implement 1000 injecting experiments and the
injected applications are not silent with exceptions, the human operator might
have to close windows Application Error dialog hundreds times. The massive human
work is terrible and will affect the precision of time in experiments records.
So, (SetErrorMode) MSDN, 2007. All API functions referred in this study come
from it function could be added into source code and after compiled again the
application will be silent with exceptions. With a configuration file and execution
Tcl scripts, a campaign of a thousand fault injection experiments may be automatically
run, results collected and evaluated and statistical results obtained without
The injected application is to be created by injector using (CreateProcess) API. But after created the process of application is not as a subprocess of injector, there no longer exists relationship between application and injector.
Some et al. (2001) classified the result of
each injecting experiment as Correct, Incorrect, Crash, Hang, or Invalid. The
reason of so many classes is in that research faults were injected into code
region, data region, heap region and stack region with uniform distribution.
In present research the mainly injected region is code region and almost any
fault injected into this region of a running application will cause it hang
or crash. So the number of fault classes declines and judgment standard is easier:
Injector observes injected application after a fault was injected into its memory
space. The application is checked through (FindWindow) on a prescribed time
value. If (FindWindow) return NULL within the time value it indicates that this
injection makes application failure. Otherwise if (FindWindow) still could find
window handler of application it indicates that this injection didnt make
application failure. If (SetErrorMode) is used when windows operation system
finds a failure occurring the most usual disposal route is to close the application
and collect all resources allocated to it. The behaviors can be seen is the
windows (or dialogs) opened by application disappearing or the PID of application
disappearing in task management. So far the detailed techniques were discussed.
Now an injecting method algorithm will be given.
Injecting Method Algorithm (IMA): Set N is the number of injecting experiments and T is a prescribed time value that injector needs to check whether the injected application is failure. In each experiment only a fault will be injected into an object application. After injecting a fault the injector will observe the injected application within T. If within T the injector find application failure then it finishes this experiment, records experiments results and begins next experiment (if number of experiment is less than or equal to N). If within T application doesnt failure the injector will consider that this injection cant hit application and begin next injecting experiment. Set application failure number is fn during all experiments are executed.
1: (Initiate) Set N, T and produce a list X(N) that contains all locations
that will be injected. Each experiment chooses a location to inject. Set
I = 0, fn = 0, m = 1|
||Step 2: (Inject) If I = N, inject fault at location
X(i), go to step 3. Else finish experiments and exit
||Experiment record format of IMA
3: (Observe) If within T application failure then fn = fn+1 and go
to step 4. Else also go to step 4|
4: (Record) Record the results of this experiment, I = i+1, go to
In step 1, the way to choose X(N) is deserved to discuss. Different ways make different experiment results. This will be discussed at end of the study.
In Step 4, we need to record the results of experiments. The format of record is described by Table 1.
Table 1 gives all fields of one record. By algorithm IMA each experiment has one record so after N experiments N records will be produced.
#Num stands for number of data field. The description after it stands for name of data field. #1 records the injecting experiment id, for example for the ith (i = N) experiment the id is i. #2 records the starting time of each experiment. #3 records the location that will be injected in this experiment. #4 is a flag it indicates whether the content of location could be read. If content cant be read the Read-flag will be set to 0, else 1. #5 will store the content that is read. If #4value is 0 then #5 will be set to FFFFFFFF. #6 and #7 are similar to #4 and #5. #6 indicates whether injector can inject fault into the memory cell pointed by #3. #7 stores the fault that is injected. #8 and #9 stand for finishing time of experiment and executing time, respectively. And #9 is also the surviving time of application after injected. #10 is a failure flag. It indicates whether this injecting experiment causes application failure. #11 stores accumulating number of failures from the first experiment. It equals value of fn in IMA.
In algorithm IMA the experiment time is mainly consumed in Step 3 and the value is O (1). So, all time needed to execute N injecting experiments can be computed as:
So, the result isΦIMA and it easy to know that max (ΦIMA(N)) = TN.
Next section the method to compute interesting parameters by experiment results will be discussed.
ANALYSIS AND COMPUTATION OF PARAMETERS
Interesting parameters can be computed by the records of experiment results.
To demonstrate how to compute parameters by format of record, a fault injecting
campaign will be run. Injecting strength, MTTF (Mean Time To Failure) and fault
sensitivity are chosen as parameters that will be computed. Any application
can be used as injected object. In this campaign a GUI sine wave generator is
chosen. The campaign will be executed by IMA. 50 experiments will be run so
N = 50 and by scale of application code region the prescribed time value is
set to 2 sec, T = 2 sec. For each parameter first we give its definition and
computing formula by format of record then illustrate the computing results
and analyze its meaning.
Definition 1: (injecting strength λ): It is the average value of number of application failures caused by injector per unit time. It can be computed by Eq. 2. All items on right of Eq. 2 are described in Table 1.
where, λ (j, n) represents the injecting strength of injecting experiment form jth time to nth time. It is obvious that λ (j, n) is an average measure of injecting effect. We can extract #8 and #11 from experiment records and plot them in Fig. 3.
In Fig. 3, x-axis stands for time of the whole campaign. In this campaign 50 experiments are run and experiment time is about 120 sec. By Eq. 2 and Fig. 3, λ can be computed on any time interval between (0,120). If λ = 0 it means that injector didnt cause application failure during the corresponding time interval, for example during (0, 10) accumulated failure count is 0 so λ(0, 10) = 0.
Definition 2: (failure time): The time interval from time when application is injected to time when failure happens.
Usually average failure time, i.e., MTTF (Mean Time To Failure), is an interesting parameter. So by definition 2 we directly give the formula Eq. 3 to compute MTTF.
where, f-flag (i) is #10 field in table 1 and e-time (i)
is #9. The i in Eq. 3 is the sequence number of N experiments.
We can extract #1 and #9 from experiment records and plot them in Fig. 4.
In Fig. 4, x-axis stands for the number of experiments, i.e.,
N = 50. Y-axis is the application run time for each experiment. There are 37
experiments in which the run time is greater than 3.
||Compute injecting strength
1sec and others run time about 0.2 sec. Because T = 2 sec and by IMA when t
= 2 injector believes that this injection doesnt make application failure,
in Fig. 4, we can see only 13 experiments are considered to
make application failure. We can compute MTTF by experiment records through
Eq. 3. In this campaign MTTF = 0.058 sec. This implies that
when the sine wave application is running in space if SEEs happens in its code
region it will fail in 0.058 sec. So, if deploying some fault-tolerant mechanism
to avoid failure the mechanism should be effective in time within 0.058 sec.
From Fig. 4 all the time of 37 experiments after injected is greater than 3.1 and T = 2 sec, this error comes from the tasks schedule of operate system. It doesnt affect our understanding and analysis of experiment results.
Sensitivity: From Fig. 1, it can be seen that an application is put into memory image based on different kinds of region. For example there are code region, date region, heap region and stack region. In this study, present research focuses on code region. The characteristic of injecting faults injected into this region is that it almost causes application exception immediately. We can extract #3 and #10 from experiment records and plot them in Fig. 5.
In Fig. 5, value of y-axis comes from #10 of each experiment record. It only has two values: 0 and 1.1 means application failure and 0 no failure. X-axis comes from #3 of each experiment record. It is obvious that the failures are converge at the top third of injected code region, so we can conclude that this part of region is more sensitive to SEEs faults. The meaning of this parameter is that it will be easier to target the locations region where failure occurs once an application fails. Although, in this study we focus on injecting faults into code region, MOFI can inject faults into any regions; next we discuss how to compute sensitivity about different regions by records. For different applications the sizes of regions allocated are different from each other. It is hard to define sensitivity in this situation but for each region the corresponding sensitivity can be computed according to the way we compute sensitivity of code region discussed above. No matter what situation the method to compute the number of failures occurring in corresponding region is important. For example, code-num stands for number of failures occurring in code region. The value can be computed by Eq. 4.
where, Inject-addr is the #3 of each record and f-flag (i) is the #10. By Eq. 4 other numbers such as data-num, heap-num and stack-num can be computed. When we use uniform distribution to produce X (N) over all four regions, four numbers can be compared to analyze the sensitivity.
EXPERIMENTS BASED ON DIFFERENT X (N)
In above injecting experiment implemented by IMA, we use uniform random distribution to produce locations list X (N) over the whole code region. What about other distributions? Here, we will discuss the influences of the different distributions on experiments results.
We choose uniform, normal and exponential as three different distributions.
Three different lists are produced by corresponding distributions, respectively
on the same area of code region (Note that in this section the code region is
different from the one used to demonstrate how to compute parameters in above
section. The differences come from different original locations and different
sizes of injected areas in code region). Then three campaigns are run. Each
campaign includes N = 50 experiments as before and uses its own random distribution
to generate list X (N). We use X-Exponential, X-Normal and X-Uniform to indicate
corresponding experiment curves. In all three campaigns T is set to 2 sec. Figure
6 shows experiment time and failure number of three campaigns, respectively.
||Experiment time-failure number
In Fig. 6, x-axis stands for time of experiments executed. Y-axis stands for failure number of 50 experiments. There are three curves in this Fig. and each of them represents a distribution. It is obvious that when we use exponential random distribution to generate locations list X (N), corresponding curve indicated by X-Exponential, the time of 50 experiments is longest and when use normal random distribution, corresponding curve indicated by X-Normal, the time is shortest.
Figure 7 shows experiment number and failure number of three campaigns, respectively. X-axis stands for the number of experiments. It corresponds to #1 of each experiment. Y-axis is same with Fig. 6.
In Fig. 7 it can be found that when use normal random distribution
the number of failures caused by injection is the most but exponential random
is smallest. Comparing Fig. 6 and 7 a conclusion
can be obtained, i.e., the more number of failures the less time of experiments.
Now lets analyze the reason. By IMA set the time of finishing N experiments
is t-all. Set t-obser = Tx(N-fn) that is time injector need to observe application
when injection doesnt make it failure.
number- failure number
Set t-fail = MTTFxfn that is whole time needed to observe application on condition
that it fails during all N experiments. So,
where, T, N and MTTF are constant, then
Since, T = MTTF, dt_all/dfn = 0. So, if fn is smaller, t_all is longer.
There are many other distributions that can generate X(N). Although, we dont
test all of them, by Eq. 5 and 6 and Fig.
6 and 7 we could conclude that there always is a distribution
that makes fn smallest and t-all longest.
If one kind of distribution generating X(N) can make more failures than others with same number of experiments, it will be considered better than others. This is because the more failures make more problems found. These problems can be used as clues to deploy fault-tolerant mechanism in order to improve the reliability of application and system. It can be seen that in Fig. 6 and 7 X-Normal is a better distribution than others in the three campaigns.
A fault injection tool MOFI for use in injecting experiments on Windows platform has been developed. Using this tool reliability parameters of application were computed and analyzed. In the course of the experiments, the different location lists X(N) generated by different random distributions were used to execute injecting experiments. By analysis of experiments result, it proves that there always exists a better distribution than others to run injecting experiment by MOFI.
This research is supported by the Aeronautical Innovation Fund under grants No.CASC0409, China.