What is Software Reliability?
The topic of software reliability is concerned with all life cycle activities
that prevent, detect, remove, and/or mitigate software faults, and that
verify/validate the degree to which the operating software will not cause
system failures. Software reliability is (quantitatively) defined as the
probability of failure-free operation of a software program for a specified
time under specified operating conditions. However, having a "mean
time to failure (MTBF)" with "margins" and "confidence
limits", even with the appropriate accompanying data evidence, is
not generally sufficient to convince customers, regulatory authorities,
or even the system/software suppliers that the software satisfies its
requirements. Thus, software reliability is also (qualitatively) demonstrated
by the process followed to develop the software. The dimension of a good
development process are: use of best practice engineering methods, implementation
of fault tolerance design, application of specialized methods and procedural
methods to ensure mistake-proof loading and/or operation. These tools
and techniques provide evidence improving the confidence that the software
will not cause a system failure. These reliability assurance techniques
can be extended, eg. Fault Tree Analysis or Fault Seeding, to ensure safety-
and/or security-critical requirements are met.
What does Software Reliability Encompass?
Software, today, is typically the major concern for reliability
assurance in most important system applications. Because the software
component typically manages the overall system functions, faults in the
software may cause critical system outages. Such system failures due to
software faults are classified as "software failures". Faults
can cause the system to hang, crash, or not perform a task that the customer
wishes it to perform. Thus, it is important to use methods and techniques
that provide evidence that the software component has been designed, implemented,
tested, installed, and, as necessary, updated without faults that might
result in undesirable system failures.
There are similarities between hardware and software failures
and also differences. Software failures are primarily the result of design
defects (during development or maintenance). Other failure sources include
use-induced degradation as well as inadequate operational procedures and
logistics operations documentation that is considered part of the "software
data package". Hardware failures are primarily the result of physical
wear out. Other failure sources include design defects, manufacturing
quality deficiency, or maintenance or operating errors. Some system failures
are the result of a combination of hardware and software faults. It is
generally easier to implement changes to software than to hardware, although
any component change must be part of a system support concept that includes
continued reliability analysis. Hardware is generally repaired to an original
state, unless there is a reason to modify it. Software can frequently
be returned to its original state by re-initializing. This cleans up the
software envrionment (eg. emptying queues and correcting memory leaks).
The software structure can be changed also to correct, enhance, and adapt
the code, so as to become a new version, that is, a new product.
Both hardware and software must be managed as an integrated
system. The reliability of the system will depend on the reliability of
the hardware and software as an integrated whole. Some techniques to manage
the system reliability will be similarly applied to hardware and software
components whereas other techniques will be unique to hardware or to software.
In addition, the application of a given technique may be different for
software than for hardware.
There are no existing methods that guarantee delivered software
has no faults. That is, there is always some likelihood that under certain
environmental conditions and system operational use, faults in software
will be encountered that result in failures of the system. In short, software
reliability is not "1.0". There are existing methods and techniques
that correlate with delivery of software with reduced faults/failures.
It is desirable to provide sufficient quantitative and qualitative evidence
that appropriate development and support activities have been properly
conducted to prevent, detect, remove, and/or mitigate possible software
faults, particularly those faults that might result in critical system
failures.
How might faults be prevented, detected, removed, and/or
mitigated in the software development and/or support activities? What
techniques might be used to provide quantitative or qualitative evidence
that faults capable of causing a system failure do not exist in the software
component? Given limited resources and time, which combination of techniques
provides the "optimum" cost/benefit results - and what measures
are appropriate and "accurate enough"? How are decisions made
to select such techniques and how is the evidence from the use of such
techniques collected and presented? It is these concerns for which research
and practical experience provide some guidelines - both for management
of a software reliability program and conduct of life cycle activities
using appropriate software engineering and reliability-specific techniques.
It is from these types of guidelines, typically constrained to specific
application domains and operational scenarios, that software reliability
standards can be established.
When high levels of reliability need to be assured, it will
be necessary to use several sources of evidence to support reliability
claims. Combining such disparate evidence to aid decision making is itself
a difficult task and a topic of current research. Four areas of evidence
are important in terms of benefits and limitations:
- evidence from software components and structure;
- evidence from static analysis of the software product;
- evidence from testing of software under operational conditions; and
- evidence of process quality.
Among the challenges to provide software reliability assurance,
there are cultural issues in addition to hard technical research questions
to be investigated.
Reliability Confidence Limits
A comprehensive list of software reliability references can be found here.
A list of software reliability standards can be found here.
|
|