Software System                
Safety

Introduction
In the past, industry in general considered increased productivity as the most important aspect of Software Engineering. Little consideration was given concerning the reliability or safety of the software product.

In recent years, the role of the software and hardware has become the command and control of complex and costly systems upon which human lives may depend. This role has compelled the Department of Army and industry to establish goals of highly reliable, productive, and safe software in which hazard-causing faults or errors are unacceptable. These new goals require the support of professionals who have attained some level of expertise in the various aspects of software and firmware. System Safety Engineers are no exception. Our Safety Engineer can apply system safety methods & techniques to the analysis of software systems with a high degree of confidence, in order to certify the safety of the system that the software controls.

Because Software Safety is a field in its infancy, all of the usual growing pains have to be experienced. Recent cases of software whose use was unsafe are strongly suggestive of the risks involved. Engineers must recognize that software is just another system component, and that this component can contain errors or defects which can cause undesired events in the hardware system it is controlling. System Safety Engineers should work with Software Engineers to identify those errors which can cause hazards or produce undesired events.

Overview
Software System Safety, an element of the total safety and software development program, cannot be allowed to function independently of the total effort. Both simple and highly integrated multiple systems are experiencing an extraordinary growth in the use of computers and software to monitor and/or control safety-critical subsystems or functions. A software specification error, design flaw, or the lack of generic safety-critical requirements can contribute to or cause a system failure or erroneous human decision. To achieve an acceptable level of safety for software used in critical applications, Software System Safety engineering must be given primary emphasis early in the requirements definition and system conceptual design process. Safety-critical software must then receive continuous management emphasis and engineering analysis throughout the development and operational lifecycles of the system.

Definition
Software System Safety optimizes system safety in the design, development, use, and maintenance of software systems and their integration with safety critical hardware systems in an operational environment.

Goals

  1. Safety consistent with mission requirements, is designed into the software in a timely, cost effective manner.
  2. Hazards associated with the system and its software are identified, evaluated and eliminated or the risk reduced to an acceptable level, throughout the lifecycle.
  3. Reliance on administrative procedures for hazard control is minimized.
  4. The number and complexity of safety critical interfaces is minimized.
  5. The number and complexity of safety critical computer software components is minimized. Sound human engineering principles are applied to the design of the software-user interface to minimize the probability of human error.
  6. Failure modes, including hardware, software, human and system are addressed in the design of the software.
  7. Sound software engineering practices and documentation are used in the development of the software.
  8. Safety issues are addressed as part of the software testing effort at all levels of testing.
  9. Software is designed for ease of maintenance and modification or enhancement

Questionnaires

  1. Does the system contain software/firmware?
  2. Does the software/firmware have complete control over hardware and its subsystems or components without the ability of operator intervention?
  3. Does the software/firmware have control over a hazardous system and does it allow for operator intervention?
  4. Does the software/firmware generate information which will be used in the making of critical decisions?
  5. Do adequate controls exist in the design of the software to minimize the risk of potentially critical hazards?
  6. Will credible failures of single hardware input or output devices result in the occurrence of a catastrophic or critical hazard?
  7. Does the system power-up to a safe state and revert back to a safe state in the event of the failure of critical components, such as: Primary computer or Power failure?

    How do you approve existing safety in old software?

  1. Get all information on field history such as:
    1. hours of use
    2. type of actual use
    3. environmental conditions
    4. experienced personnel
    5. existing hardware configuration (sensors, computer, I/O)

    Analyze all changes to the above, including new missions, use and time.

  2. Determine the true "delta" of the software package. What are the new functions vs. the new hazards. Are the old hazards being corrected in the new version?
  3. Review the process that produced new version. Does it look like they are developing good code?
  4. Review the S/W audit minutes. If SSWG's are available then get them and review for S/W hazards.
  5. If the code is written, you will have to go with testing. If code is not yet written, give the S/W developer code and design checklists. Tailor the checklist with the known hazards, the error set known in the language, CPU, and architecture. vIf testing is the only means available, you should develop tests which are derived from the hazards (and resulting requirements specification). Use the error sets for language, CPU, checklists, and examples.
  6. Ensure that all hazards have been addressed by all tests, including all field history-derived hazards.
  7. Review test procedures/cases and problem reports for safety critical applications during formal qualification and regression testing on the target hardware.
  8. Recommend and implement corrections to all safety critical issues that resulted from formal testing.

Requirements
1. Safety Critical Software Development
A structured development environment and an organization committed to ensuring the safety of the soldier and using state of the art methods are prerequisites to developing dependable safety critical software.

The following requirements and guidelines are intended to carry out the cardinal safety rule and its corollary that no single event or action shall be allowed to initiate a potentially hazardous event and that the system, upon detection of an unsafe condition or command, shall inhibit the potentially hazardous event sequence and originate procedures/functions to bring the system to a predetermined "safe" state.

The purpose of this section is to describe the software safety activities which should be incorporated into the software development phases of project development. The software safety information which should be included in the documents produced during these phases is also discussed.

If government standards or guidelines exist which define the format and/or content of a specific document (Data Item Descriptions (DID's)), they are referenced and should be followed. The term "software components" is used in a general sense to represent important software development products such as software requirements, software designs, software code or program sets, software tests, ect.

1.1 Software Concept and Initiation Phase
For most government projects this lifecycle phase involves system level requirements and design development.

Although most project work during this phase is concentrated on the subsystem level, software development has several tasks that must be initiated. These include the creation of important software documents and plans which will determine how, what, and when important software products will be produced or activities will be conducted. Each of the following documents should address software safety issues:
 
Document Software Safety Section
System Safety Plan, Include software as a subsystem, identify tasks.
Software Concepts Document, Identify safety critical processes.
Software Management Plan, and Software Configuration Management Plan, Coordination with systems safety tasks, flowdown incorporation of safety requirements. Applicability to safety critical software.
Software Security Plan Security of safety critical software.
Software Quality Assurance Plan Support to software safety, verification of software safety requirements, safety participation in software reviews and inspections.

1.2 Software Requirements Phase
The cost of correcting software faults and errors escalates dramatically as the development life cycle progresses, making it important to correct errors and implement correct software requirements from the very beginning. Unfortunately, it is generally impossible to eliminate all errors.

Software developers must therefore work toward two goals:
    (1) to develop complete and correct requirements and correct code, and
    (2) to develop fault-tolerant designs, which will detect and compensate for software faults "on the fly".
NOTE: (2) is required because (1) is usually impossible.

This section describes safety involvement in developing safety requirements for software. The software safety requirements can be top-down (flowed down from system requirements) and/or bottom-up (derived from hazard analyses). In some organizations, top-down flow is the only permitted route for requirements into software, and in those cases, newly derived bottom-up safety requirements must be flowed back into the system specification.

Software safety requirements are derived from the system and subsystem safety requirements developed to mitigate hazards identified in the Preliminary, System, and Subsystems Hazard Analyses.

Also, system safety flows requirements to systems engineering. The systems engineering group and the software development group (i.e., Integrated Product Teams (IPT's)) have a responsibility to coordinate and negotiate requirements flow down to be consistent with the software safety requirements lowdown.

The software system safety organization should flow requirements into the following documents:

Software Requirements Document (SRD)
Safety-related requirements must be clearly identified in the SRD.

Software Interface Specification (SIS) of Interfaces Control Document (ICD)

SIS activities identify, define, and document interface requirements internal to the [sub] system in which software resides, and between system (including hardware and operator interfaces), subsystem, and program set components and operation procedures.

Note that the SIS is sometimes effectively contained in the SRD, or within an ICD which defines all system interfaces, including hardware-to-hardware, hardware-to-software, and software-to-software.

1.2.1 Development of Software System Safety Requirements
Software System Safety requirements are obtained from several sources, and are of two types: generic and specific.

The generic category of software safety requirements are derived from sets of requirements that can be used in different programs and environments to solve common software safety problems. Examples of generic software safety requirements and their sources are given in para 2.2.2 Generic Software Safety Requirements. Specific software safety requirements are system unique functional capabilities of constraints which are identified in three ways:

1) Through top down analysis of system design requirements (from specifications):
 The system requirements may identify system hazards up-front, and specify which system functions are safety critical. The (software) system safety organization participates or leads the mapping of these requirements to software.

 2) From the Preliminary Hazard Analysis (PHA):
PHA looks down into the system from the point of view of system hazards. Preliminary hazard causes are mapped to, interact with,     software. Software hazard control features are identified and specified as requirements.

 3) Through bottom-up analysis of design data, (e.g. flow diagrams, FMEAs, fault trees etc.):     Design implementations allowed, but not anticipated by the system requirements, are analyzed and new hazard causes are identified.  

 Software hazard controls are specified via requirements when the hazard can be mapped to, or interact with, software.

1.2.1.1 Safety Requirements Flow down
Generic safety requirements are established as a priority and placed into the system specification and/or overall project design specifications. From there, they are flowed into lower level unit and module specifications.

Other safety requirements, derived from bottom-up analysis, are flowed up from subsystems and components to the system level requirements. These new system level requirements are then flowed down across all affected subsystems. During the System Requirements Phase, subsystems and components may not be well defined. In this case, bottom-up analysis might not be possible until the Architectural Design Phase or even later.

1.2.2 Generic Software System Safety Requirements
The generic category of software safety requirements are derived from sets of requirements and best practices used in different programs and environments to solve common software safety problems. Similar processors/platforms and/or software can suffer from similar or identical design problems. Generic software safety requirements capture these lessons learned and provide a valuable resource for developers.

Generic requirements prevent costly duplication of effort by taking advantage of existing proven techniques and lessons learned rather than reinventing techniques or repeating mistakes. Most development programs should be able to make use of some generic requirement; however, they should be used with care.

As technology evolves, or as new applications are implemented, new "generic" requirements will likely arise, and other sources of generic requirements might become available. A partial listing of generic requirement sources is shown below:

NSTS 19943 Command Requirements and Guidelines for NSTS Customers.
STANAG 4404 (Draft) NATO Standardization Agreement (STANAG), Safety Design Requirements and Guidelines for Munitions Related Safety Critical Computing Systems.
WSMCR 127-1 Range Safety Requirements - Western Space and Missile Center, Attachment-3 Software System Design Requirements. This document is being replaced by EWRR (Eastern and Western Range Regulation) 127-1, Section 3.16.4, Safety Critical Computing System Software Design Requirements.
AFISC SSH 1-1 System Safety Handbook - Software System Safety, Headquarters Air Force Inspection and Safety Center.
EIA Bulletin SEB6-A System Safety Engineering in Software Development (Electrical Industries Association), Underwriters Laboratory - UL 1998 Standard for Safety - Safety Related Software, January 4, 1994.
NUREG/CR-6263 MTR 94W0000114, High Integrity Software for Nuclear Power Plants, The MITRE Corporation, for the U.S. Nuclear Regulatory Commission.

Generic Software Safety Requirements

  1. The failure of safety critical software functions shall be detected, isolated, and recovered from, such that catastrophic and critical hazardous events are prevented from occurring.
  2. Software shall perform automatic Failure Detection, Isolation, and Recovery (FDIR) for identified safety critical functions, with a time-to-criticality under 24 hours.
  3. Automatic recovery actions taken shall be reported to the crew, operator, or controlling executive. There shall be no necessary response from crew or ground operators to proceed with the recovery action.
  4. The FDIR switchover software shall be resident on an available, non-failed control platform which is different from the one with the function being monitored.
  5. Override commands shall require multiple operator actions.
  6. Software shall process the necessary commands within the time-to-criticality of a hazardous event.
  7. Hazardous commands shall only be issued by the controlling application, or by the crew, ground, or controlling executive.
  8. Software that executes hazardous commands shall notify the initiating crew, ground operator, or controlling executive upon execution or provide the reason for failure to execute a hazardous command.
  9. Prerequisite conditions (e.g., correct mode, correct configuration, component availability, proper sequence, and parameters in range) for the safe execution of and identified hazardous command shall be met before execution.
  10. In the event that prerequisite conditions have not been met, the software shall reject the command and alert the crew, ground operators, or the controlling executive.
  11. Software shall make available status of all software controllable inhibits to the crew, ground operators, or the controlling executive.
  12. Software shall accept and process crew, ground operator, or controlling executive commands to activate/deactivate software controllable inhibits.
  13. Software shall provide an independent and unique command to control each software controllable inhibit.
  14. Software shall incorporate the capability to identify and status each software inhibit associated with hazardous commands.
  15. Software shall make available current status on software inhibits associated with hazardous commands to the crew, ground operators, or controlling executive.
  16. All software inhibits associated with a hazardous command shall have a unique identifier.
  17. Each software inhibit command associated with a hazardous command shall be consistently identified using the rules and legal values.
  18. If an automated sequence is already running when a software inhibit associated with a hazardous command is activated, the sequence shall complete before the software inhibit is executed.
  19. Software shall have the ability to resume control of an inhibited operation after deactivation of a software inhibit associated with a hazardous command.
  20. The state of software inhibits shall remain unchanged after the execution of an override.
  21. Software shall provide error handling to support safety critical functions.
  22. Software shall provide caution and warning status to the crew, ground operators, or the controlling executive.
  23. Software shall provide for crew/ground forced execution of any automatic safing, isolation, or switchover functions.
  24. Software shall provide for crew/ground forced termination of any automatic safing, isolation, or switchover functions.
  25. Software shall provide procession for crew/ground commands in return to the previous mode or configuration for any automatic safing, isolation, or switchover function.
  26. Software shall provide for crew/ground forced override of any automatic safing, isolation, or switchover functions.
  27. Software shall provide fault containment mechanisms to prevent error propagation across replaceable unit interfaces.
  28. Hazardous payloads shall provide failure status and data to core software systems. Core software systems shall process hazardous payload status and data to provide status monitoring and failure annunciation.
  29. Software (including firmware) Power On Self Test (POST) utilized within any replaceable unit or component shall be confined to that single system process controlled by the replaceable unit or component.
  30. Software (including firmware) POST utilized within any replaceable unit or component shall terminate in a safe state.
  31. Software shall initialize, start, and restart replaceable units to a safe state.
  32. For systems solely using software for hazard risk mitigation, software shall require two independent command messages for a commanded system action that could result in a critical or catastrophic hazard.
  33. Software shall require two independent operator actions to initiate or terminate a system function that could result in a critical hazard.
  34. Software shall require three independent operator actions to initiate or terminate a system function that could result in a catastrophic hazard.
  35. Operational software functions shall allow only authorized access.
  36. Software shall provide proper sequencing (including timing) of safety critical commands.
  37. Software termination shall result in a safe system state.
  38. In the event of hardware failure, software faults that lead to system failures, or when the software detects a configuration inconsistent with the current mode of operation, the software shall have the capability to place the system into a safe state.
  39. When the software is notified of or detects hardware failures, software faults that lead to system failures, or a configuration inconsistent with the current mode of operation, the software shall notify the crew, ground operators, or the controlling executive.
  40. Hazardous processes and safing processes with a time-to-criticality such that timely human intervention may not be available, shall be automated (i.e., not require crew intervention to begin or complete).
  41. The software shall notify crew, ground, or the controlling executive during or immediately after execution of an automated hazardous or safing process.

  42. Unused or undocumented codes shall be incapable of producing a critical or catastrophic hazard.
  43. All safety critical elements (requirements, design elements, code modules, and interfaces) shall be identified as "safety critical."
  44. An application software set shall ensure proper configuration of inhibits, interlocks, and safing logic, and exception limits at initialization.

Hazard Analysis
Preliminary Hazardous Analysis (PHA)
The purpose of performing a Preliminary Hazard Analysis (PHA) is to identify safety critical areas, evaluate hazards, and identify the safety design criteria to be used. The system should be examined shortly after the concept definition effort begins in order to provide a list of hazards that may require special safety design emphasis or hazardous areas where in-depth analyses need to be done. The PHA effort must start during the concept exploration phase so that safety considerations are included in tradeoff studies and design alternatives. Based on the best available data, including mishap data from similar systems and other lessons learned, hazards associated with the proposed design or function must be evaluated for hazard severity, hazard probability, and operational constraints. As a minimum, the PHA should consider the following for identification and evaluation of hazards:

  1. Hazardous components (e.g., fuels, lasers, toxic substances).

  2. Safety design criteria to control safety-critical software commands and responses (e.g., inadvertent command, failure to command, untimely command or responses) must be identified and appropriate action taken to incorporate them into the software specifications.

  3. Environmental constraints including the operating environments (e.g., temperatures, fire, lightning, and radiation).

  4. Safety related equipment, safeguards, and possible alternate approaches.

  5. Identification of the safety requirements, standards and other regulations pertaining to personnel safety, environmental hazards, and toxic substances with which the system will have to comply.

Software Requirement Hazard Analysis (SRHA)
The SRHA effort begins at the time that the system requirements allocation is being made. A safety evaluation of the requirements will identify requirements that are missing, not well defined (bounds not specified, not clear enough to be evaluated as having been implemented, etc.), or that could result in hazards. The effort generally includes the following:

Review of software requirement specifications (SRS) to ensure that hazards from the system PHA have been identified in the SRS. The SRS should explicitly state known hazards that are relevant to the software operation. Analysis of functional flow diagrams (or their functional equivalent), finite state machine diagrams, data flow diagrams, storage and timing allocation charts and other software documentation are needed to assure that specification and safety requirements are met.

Design Specification Hazard Analysis
The outdated code-and-test-only software development Paradigm is inadequate for any effective safety program. A more disciplined engineering approach is needed that focuses on design not coding. This approach should allow for teamwork in the design process and should consider the fact that:

  1. Software design specifications must be detailed and communicative.

  2. Software design architecture and code construction must be modular.

  3. Use of modem software design methods such as object oriented technology must be encouraged. Software design specifications must describe in detail how each and every functional requirement is to be met. There must be a clear and obvious path showing how each requirement is implemented in the design specification[8]. Analysis includes relating those hazards identified in the PHA and SRHA to specific software components, and identifying those Safety-Critical Computer Software Components (SCCSCs). Software design and its resulting code must be modular with all the safety-critical computer software components (SCCSC) separated from other non-critical sections of the software. Software modularity is proven to improve quality The separation of SCCSC increases software cohesion which allows potential errors to be isolated.

    SCCSCs are those computer software components (processes, functions, values or computer program states) whose errors (e.g., inadvertent or unauthorized occurrence, failure to occur when required, occurrence out of sequence, occurrence in combination with other functions, or erroneous value) result in a potential hazard, or in loss of control of a system.

Subsystem Hazard Analysis
As soon as the subsystems are defined, the SSHA can begin. It should be updated as the design matures. This analysis examines each subsystem or component and identifies the hazards associated with it. It determines how operation or failure of components affects the overall safety of the system. This analysis should identify necessary actions to eliminate or reduce the risk of identified hazards. For software, the objectives of this analysis are described as follows:

  1. Examine all SCCSCs (algorithms, components, modules, routines, and calculations) in the subsystem for correctness (input/output, timing, multiple event, wrong event, out-of-sequence, adverse environment, deadlocking, inappropriate magnitude, and any other hazard conditions).

  2. Identify all software components whose performance, performance degradation, functional failure, or inadvertent functioning could result in a hazard or whose design does not satisfy contractual safety requirements.

  3. Determine potential contribution of software events, faults, and occurrences (such as improper timing) on the safety of the subsystem.

  4. Ensure that the safety design criteria identified in the software requirement specifications have been satisfied.

  5. Ensure that the method of software requirements, design, implementation, and corrective actions does not impair or decrease the safety of the subsystem and has not introduced any new hazards.

System Hazard Analysis (SHA)
The SHA determines how system operation and failure modes affect the safety of the system and its subsystems. Specifically, the SHA examines all subsystem interfaces for:

  1. Possible combinations of independent, dependent, or interdependent hardware or software failures, unintended program jumps, single or multiple events, or out-of order events that could cause the system to operate in a hazardous manner. Failures of controls and safety devices must be considered.

  2.  
  3. How normal operations of systems and subsystems could degrade the safety of the system.

  4. Compliance with safety criteria called out in the applicable system/subsystem requirements documents.

  5. Design changes to system, subsystems, or interfaces, logic, and software that could create new hazards to equipment and personnel.

Safety Analysis Techniques
Current analysis techniques and methodologies available for conducting software safety analysis are:

  1. Software fault tree analysis.

  2. Failure modes and effects analysis.

  3. Design walk-through.

  4. Code walk-through.

  5. Petri net analysis.

  6. Software/hardware integrated critical path analysis.

  7. Safety cross-check analysis.

  8. Cross reference listing analysis.

Each technique has its strengths and weaknesses. A thorough software hazard analysis may require application of more than one of these techniques on any software element.

Test Requirements

  1. Software testing shall include NO-GO path testing.

  2. Software testing shall include hardware and software input failure mode testing.

  3. Software testing shall include boundary, out-of-bounds, and boundary crossing test conditions.

  4. Software testing shall include input values of zero, zero crossing, and approaching zero from either direction.

  5. Software testing shall include minimum and maximum input data rates in worst case configurations to determine the system's response and capabilities to these environments.

  6. Safety Critical Computer Software Components (SCCSCs) in which changes have been made shall be subjected to complete regression testing.

  7. Operator interface testing shall include operator errors during safety critical operations to verify safe system response to these errors.

  8. Conduct a test readiness review.

  9. Correlate all tests to the System Requirement Specification (SRS), and System Segment Specification (SSS).

  10. Verify that the object code contains no extraneous code or patches.

Verification & Validation
Verification tasks include reviews, configuration audits, and quality audits. Validation tasks are performed to guarantee that all software configuration components fulfill their intended objective. The entire software configuration must be checked, and the final product must be validated against its initial requirement. Special attention must be given to verifying the traceability of safety requirements in the software configuration. Safety requirements, as specified in system specifications, requirements documents, etc., will need to be verified by analysis, inspection, demonstration, or test. All design changes require verification. As with configuration management, this activity should be started when the project is begun and carried out throughout the entire software life cycle. Much of the safety validation is outlined in system/subsystem test plans and procedures. Regression testing is highly recommended. Testing must be conducted at the unit level first, then move up the system level. The objectives of verification and testing are to:

  1. Ensure that the identified safety hazards have been eliminated or reduced to an acceptable level of risk.

  2. Provide appropriate test procedures, cases, and inputs to test personnel to test the SCCSCs; for safe and proper operation. Test cases are to be developed based on the risk analysis, fault tree analysis, and program logic analysis.

  3. Ensure that all of the Safety Critical Computer Software Components (SCCSCs) are tested in accordance with the approved test procedures, and that test results are accurately recorded.

  4. Test the software under abnormal environmental and input condition, as well as normal conditions, to ensure that it performs properly and safely under these conditions.

  5. Subject the software to stress testing to ensure that it performs properly and safely under stress conditions.

  6. Ensure that safety hazards and other deficiencies and discrepancies, discovered during system integration and system acceptance testing, are corrected and re-tested to be sure that they are no longer a problem.

Lessons Learned
1.0 THERAC Radiation Therapy Machine Fatalities
1.1 Summary
Eleven Therac-25 therapy machines were installed, 5 in the US and 6 in Canada. They were manufactured by the Canadian Crown (government owned) company AECL. The Therac-25 model was an advanced model over earlier models (-6 and -20 models, corresponding to energy delivery capacity) with more energy and automation features. Although all models had some software control, the 25 model had many new features and had replaced most of the hardware interlocks with software versions. There was no record of any malfunctions resulting in patient injury from any of the earlier model Therac's (earlier than the 25). The software control was implemented in a DEC model PDP 11 processor using a custom executive and assembly language. A single programmer implemented virtually all of the software. He had an uncertain level of formal education and produced very little, if any documentation on the software.

Between 6/85 and 1/87 there were six known accidents of massive radiation overdoses involving the Therac-25. Three of the six resulted in fatalities. The company did not respond effectively to early reports citing the belief that the software could have contributed to the failure. Records show that software was deliberately left out of an otherwise thorough safety analysis performed in 1983, which used fault-tree methods. Software was excluded because "software errors have been eliminated because of extensive simulation and field testing. (Also) software does not degrade due to wear, fatigue or reproduction process." Other types of software failures were assigned very low failure rates with no apparent justification. After a large number of lawsuits and extensive negative publicity, the company decided to withdraw from the medical instrument business and concentrate on its main business of nuclear reactor control systems.

The accidents were due to many design deficiencies involving a combination of software design defects and system operational interaction errors. There were no apparent review mechanisms for software design or quality control. The continuing recurrence of the accidents, before effective corrective action resulted, was a result of management's view. This view had faith in the correctness of the software without any apparent evidence to support it. The errors were not discovered because the policy was to fix the symptoms without investigating the underlying causes, of which there were many.

1.2 Key Facts

  1. The software was assumed to be fail-safe and was excluded from normal safety analysis review.
  2. The software design and implementation had no effective review or quality control practices.
  3. The software testing at all levels was obviously insufficient, given the results.
  4. Hardware interlocks were replaced by software without supporting safety analysis.
  5. There was no effective reporting mechanism for field problems involving software.
  6. Software design practices (contributing to the accidents) did not include basic, shared-data and, contention management mechanisms normal in multi-tasking software. The investigation concluded that the programmer was not fully qualified for the task.
  7. It was determined that the overall design was unnecessarily complex. For instance, there were more parallel tasks than necessary, which was a direct cause of some of the accidents.

1.3 LESSONS LEARNED

  1. Changeover from a hardware to a software implementation must include a review of assumptions, physics and rules.
  2. Testing should include possible abuse or bypassing of expected procedures.
  3. Design and implementation of software must be subject to the same safety analysis, review and quality control as other parts of the system.
  4. Hardware interlocks should not be completely eliminated when incorporating software interlocks.

Programmer qualifications are as important as qualifications for any other member of the engineering team.

2.0 Missile Launch Timing Causes Hang Fire
2.1 Summary  
An aircraft was modified from a hardware controlled missile launcher to a software controlled launcher. The aircraft was properly modified according to standards and the software was fully tested at all levels before delivery to operational test. The normal weapons rack interface and safety overrides were fully tested and documented. The aircraft was loaded with a live missile (with an inert warhead) and sent out onto the range for a test firing.

The aircraft was commanded to fire the weapon, whereupon it did as designed. Unfortunately, the design did not specify the amount of time to unlock the holdback and was coded to the assumption of the programmer. In this case, the assumed time for unlock was insufficient and the holdback locked before the weapon left the rack. As the weapon was powered, the engine drove the weapon while attached to the aircraft. This resulted in a loss of altitude and a wild ride. The aircraft landed safely with a burned out weapon.

2.2 Key Facts

  1. Proper process and procedures were followed as far as specified.
  2. The product specification was re-used without considering differences in the software implementation, i.e., the timing issues. Hence, the initiating event was a specification error.
  3. While the acquirer and user had experience in the weapons system, neither had experience in software. The programmer did not have experience in the details of the weapons system. The result was that the interaction between the two parts of the system was not understood by any of the parties.

2.3 Lessons Learned

  1. Because the software controlled implementation was not fully understood, the result was flawed specifications and incomplete tests. Therefore, even though the software and subsystem were thoroughly tested against the specifications, the system design was in error, and a mishap occurred.
  2. Changeover from hardware to software requires review of design assumptions by all relevant specialists, acting jointly. This joint review must include all product specifications, interface documentation and testing.
  3. The test, verification and review processes must each include end-to-end event review and test.

3.0 Reused Software Causes Flight Controls To Shut Down
3.1 Summary

A research vehicle was designed with fly-by-wire digital control and, for research and weight considerations, had no hardware backup systems installed. The normal safety and testing practices were minimized or eliminated by citing many arguments, such as use of experienced test pilots, limited flight and exposure times, minimum number of flights, controlled airspace, use of monitors and telemetry, etc. Also, the argument justified the action as safer because the system reused software from similar vehicles currently operational.

The aircraft flight controls went through every level of test, including "iron bird" laboratory tests that allow direct measurement of the response of the flight components. The failure occurred on the flight line the day before actual flight was to begin after the system had successfully completed all testing. The flight computer was operating for the first time unrestricted by test routines and controls. A reused portion of the software was inhibited during earlier testing as it conflicted with certain computer functions. This was part of the reused software taken from a proven and safe platform because of its functional similarity. This portion was now enabled and running in the background.

Unfortunately, the reused software shared computer data locations with certain safety-critical functions and it was not partitioned nor checked for valid memory address ranges. The result was, that as the flight computer functioned for the first time, it used data locations where this reused software had stored out-of-range data on top of safety-critical parameters. The flight computer then performed according to its design when detecting invalid data and reset itself. This happened sequentially in each of the available flight control channels until there were no functioning flight controls. Since the system had no hardware backup system, the aircraft would have stopped flying if it were airborne. The software was quickly corrected and was fully operational in the following flights.

3.2 Key Facts

  1. Proper process and procedures were minimized for apparently valid reasons, i.e., the -112 (offending) software was proven by its use in other similar systems.
  2. Reuse of the software components did not include review and testing of the integrated components in the new operating environment. In particular, memory addressing was not validated with the new programs that shared the computer resources.

3.3 Lessons Learned

  1. Safety-critical, real time flight controls must include full integration testing of end-to-end -118 events. In this case, the reused software should have been functioning within the full 119 software system.
  2. Arguments to bypass software safety, especially in software containing functions capable of a Kill/Catastrophic event, must be reviewed at each phase. Several of the arguments to minimize software safety provisions were compromised before the detection of the defect.

4.0 Flight Controls Fail At Supersonic Transition
4.1 Summary
A front-line aircraft was rigorously developed, thoroughly tested by the manufacturer, and again exhaustively tested by the government and finally by the using service. Dozens of aircraft had been accepted and were operational worldwide when the service asked for an upgrade to the weapons systems. One particular weapon test required significant telemetry. The aircraft change was again developed and tested to the same high standards including nuclear weapons carriage clearance. This additional testing data uncovered a detail missed in all of the previous testing.

The telemetry showed that the aircraft computers all failed -- ceased to function and then restarted - at a certain airspeed (Mach 1). The aircraft had sufficient momentum and mechanical control of other systems so that it effectively "coasted" through this anomaly and the pilot did not notice.

The cause of this failure originated in the complex equations from the aerodynamicist. His specialty assumes the knowledge that this particular equation will asymptotically approach infinity at Mach 1. The software engineer does not inherently understand the physical science involved in the transition to supersonic speed at Mach 1. The system engineer who interfaced between these two engineering specialists was not aware of this assumption and, after receiving the aerodynamicist's equation for flight, forwarded the equation to software engineering for coding. The software engineer did not plot the equation and merely encoded it in the flight control program.

4.2 Key Facts

  1. Proper process and procedures were followed to the stated requirements.
  2. The software specification did not include the limitations of the equation describing a physical science event.
  3. The computer hardware accuracy was not considered in the limitations of the equation.
  4. The various levels of testing did not validate the computational results for the Mach 1 portion of the flight envelope.

4.3 Lessons Learned

  1. Specified equations describing physical world phenomenon must be thoroughly defined, with assumptions as to accuracy, ranges, use, environment and limitations of the computation.
  2. When dealing with requirements that interface between disciplines, it must be assumed that each discipline knows little or nothing about the other and therefore must include basic assumptions.
  3. Boundary assumptions should be used to generate test cases as the more subtle failures caused by assumptions are not usually covered by ordinary test cases (division by zero, boundary crossing, singularities, etc.)

5.0 Incorrect Missile Firing From Invalid Setup Sequence
5.1 Summary

A battle command center with a network controlling several missile batteries was operating in a field game exercise. As game advanced, an order to reposition the battery was issued to an active missile battery. This missile battery disconnected from the network, broke-down their equipment and repositioned to a new location in the grid.

The repositioned missile battery arrived at the new location and commenced setting-up. A final step was connecting the battery into the network. This was allowed in any order. The battery personnel were still occupying the erector/launcher when the connection that attached the battery into the network, was made elsewhere on the site. This cable connection immediately allowed communication between the battery and the battle command center.

The battle command center, meanwhile, had prosecuted an incoming "hostile" and designated the battery to "fire," but targeted to use the old location of the battery. As the battery was off-line, the message was buffered. Once the battery crew connected the cabling, the battle command center computer sent the last valid commands from the buffer and the command was immediately executed. Personnel on the erector/launcher were thrown clear as the erector/launcher activated on the old slew and acquire command. Personnel injury was slight as no one was pinned or impaled when the erector/launcher slued.

5.2 Key Facts

  1. Proper process and procedures were followed as specified.
  2. Subsystems were developed separately with interface control documents. Messages containing safety-critical commands were not "aged" and reassessed once buffered.
  3. Battery activation was not inhibited until personnel had completed the setup procedure.

5.3 Lessons Learned

  1. System engineering must define the sequencing of the various states (dismantling, reactivating, shutdown, etc.) of all subsystems with human confirmations and reinitialization of state variables (e.g., site location) at critical points.
  2. System integration testing should include buffering messages (particularly safety-critical) -187 and demonstration of disconnect and restart of individual subsystems to verify that the 188 system always transitions between states safely.
  3. Operating procedures must clearly describe (and require) a safe and comprehensive sequence in dismantling and reactivating the battery subsystems with particular attention to the interaction with the network.

6.0 Operator's Choice Of Weapon Release Overridden By Software
6.1 Summary

During field practice exercises, a missile weapon system was carrying both practice and live missiles to a remote site and was using the transit time for sluing practice. Practice and live missiles were located on opposite sides of the vehicle. The acquisition and tracking radar was located between the two sides causing a known obstruction to the missiles' field of view.

While correctly following command-approved procedures, the operator acquired the willing target, tracked it through various maneuvers, and pressed the weapons release button to simulate firing the practice missile. Without the knowledge of the operator, the software was programmed to override his missile selection in order to present the best target to the best weapon. The software noted that the current maneuver placed the radar obstruction in front of the practice missile seeker while the live missile had acquired a positive lock on the target and was unobstructed. The software therefore optimized the problem and deselected the practice missile and selected the live missile. When the release command was sent, it went to the live missile and "missile away" was observed from the active missile side of the vehicle where no launch was expected. The "friendly" target had been observing the maneuvers of the incident vehicle and noted the unexpected live launch. Fortunately, the target pilot was experienced and began evasive maneuvers. The missile tracked and still detonated in close proximity to the target pilot.

6.2 Key Facts

  1. Proper procedures were followed as specified and all operations were authorized.
  2. All operators were thoroughly trained in the latest versions of software.
  3. The software had been given authority to select "best" weapon but this characteristic was not communicated to the operator as part of the training.
  4. The indication that another weapon had been substituted (live vs. practice) by the software was displayed in a manner not easily noticed among other dynamic displays.

6.3 Lessons Learned

  1. The versatility (and resulting complexity) demanded by the requirement was provided exactly as specified. This complexity, combined with the possibility that the vehicle would employ a mix of practice and live missiles was not considered. This mix of missiles is common practice and system testing must include known scenarios such as this example to find operationally based hazards.
  2. Training must describe the safety-related software functions such as the possibility of software overrides to operator commands. This must also be included in operating procedures available to all users of the system.

Design Criteria
Design For Minimum Risk
Eliminate identified hazards or reduce associated risk through design. Greater complexity in the system increases the possibility that design faults will occur and emerge into the final product. The complexity of safety-related software should be measured and be under management control. Interface design should emphasize user-safety instead of user-friendliness. User-friendly functions in software can increase the level of risk. To design operator interfaces for safety critical operations, human factors should be considered to minimize the risk created by human error. The system must be designed to be testable during development. Priorities and responses must be analyzed such that the more critical the risk, the higher the response priority in the software.

Tolerate The Hazard
The design needs to be fault tolerant. That means, in the presence of a hardware/software fault, the software still provides continuous correct execution. Consider hazard conditions to software logic created by equipment wear and tear, or unexpected failures. Consider alternate approaches to minimize risk from hazards that cannot be eliminated. Such approaches include interlocks, redundancy, fail-safe design, system protection, and procedures.

Fail-Safe Degration
The software should be designed to limit failures effects on normal operation. This kind of containment prevents the development of hazardous conditions. Hardware and software faults should be detected by the software, and effective fail-safe exits should then be designed into the system. When applicable, provisions should be made for periodic functional checks of safety devices. This results in needs for start-up Built-In-Test (BIT), continuous BIT, redundancy check, and other design approaches intended to help ensure the correct functioning of critical components and its handling in a degraded mode should a failure occur. The degraded operation mode must be well thought out, since it encompasses many interacting system components.

Provide Warning Devices
When neither software design nor safety devices can effectively eliminate identified hazards or adequately reduce associated risk, devices should be used to detect the condition and to produce an adequate warning signal to alert personnel of the hazard.

Develop Procedures And User Training
Not all hazards can be controlled within the software subsystem. Providing adequate hazard controls is a system level issue that includes the control of physical hazards from all areas, and minimizing risk resulting from environmental conditions. Where it is impractical to eliminate hazards through design, procedures and training must be used. All software-related functions that affect service should be documented, and users should be fully trained.

Specific Development Activities
Software life cycle activities related to System Safety Engineering include the following:

  1. Preliminary Hazard Analysis (PHA)

  2. Software Requirement Hazard Analysis (SRHA)

  3. Design Specification Hazard Analysis (DSHA)

  4. Subsystem Hazard Analysis (SSHA)

  5. System Hazard Analysis (SHA)

References & Standards
CECOM TR 92-2, Software System Safety Guide, May 92
CECOM TR 92-02 was written to augment other existing documentation/regulations and develop guidelines on how to implement a Software System Safety Program within the Department of the Army (DA).

CECOM TR 94-10, Identification, Integration and Tracking of Software System Safety Requirements, Aug 94
This information in this report includes the CECOM Hazard and Accident Tracking System (HATS) requirements for Software Safety.

CECOM Regulation 385-21, Software System Safety
This regulation establishes policies and identifies responsibilities for implementing a Software System Safety program for all Communications-Electronics (C-E) systems managed or supported by CECOM.

MIL-STD-2167A - Military Standard, Defense System Software Development, 29 Feb 88
Although the standard has been replaced by MIL-STD-498, it remains in effect on numerous contracts. It establishes the basis for government insight into contractor’s software development, testing, and evaluation efforts. Specific requirements are contained in this standard which establishes the system safety interface.

MIL-STD-498 - Military Standard, Software Development and documentation, 5 Dec 94
This standard established an interface with system safety engineering and defines the safety activities which are required for incorporation into the software development throughout the acquisition lifecycle.

DOD 5000.1 - Department of Defense, Defense Acquisition, 23 Feb 91
This document established the requirement and need for a risk management program, to include safety, for acquiring quality products.

DOD 5000.2 - Depart of Defense, Defense Acquisition Management Policies and Procedures, 26 Feb 93
This document establishes the interface between system safety engineering and software development.

MIL-STD-882B - Military Standard, System Safety Program Requirements, 30 Mar 84
This standard provides guidance and specific tasks for the development team to address the software, hardware, system, and human interfaces. These include the 300 series tasks.

MIL-STD 882C - Military Standard, System Safety Program Requirements, 19 Jan 93
This standard established the requirements for a detailed system safety engineering and management activities on all system procurements within DoD. This includes the integration of software system safety, within the context of the system safety program. Although MIL-STD-882B remains on older contracts, MIL-STD 882C is the current systems safety standard.

IEEE 1228 - Institute of Electrical and Electronic Engineers, titled, Standard for Software Safety Plans, 1994
This document describes the minimum acceptable requirements for the content of a software safety plan. This document closely follows MIL-STD-882B, Change Notice 1.

UL 1998 - Underwriters Laboratory, Standard for Safety Related Software, 4 Jan 94
The requirements contained in this standard apply to software whose failure could result in a risk of injury to persons or loss of property.

EIA 6B - Electronic Industry Association, G-48 System Safety Engineering Bulletin No 6B titled, System Safety Engineering in Software Development, 1990
The purpose of this document is ...to provide guidelines on how a system safety analysis and evaluation program should be conducted for systems which include computer-controlled or monitored functions".

IEC-1508 (DRAFT) - International Electrotechnical Commission, Safety lifecycle and Safety Integrity Levels
This international standard is primarily concerned with safety-related control systems incorporating electrical/electronic/programmable electronic devices.

IEC/65A/WG9, Draft Standard IEC 1508 Software for Computers in the Application of Industry Safety-related Systems, Draft Ver 1, Sept 26, 1991
This standard is primarily concerned with Safety related Control Systems incorporating electronic/electrical/programmable devices and their generic approach to Safety Lifecycle Activities.

RTCA/DO-178B - FAA, Software Considerations in Airborne Systems and Equipment Certification
The purpose of this document is to provide guidelines for the production of software for airborne systems and equipment that performs its intended function with a level of confidence in safety that complies with airworthiness requirements.

NSS 1740.13 - NASA, Interim, Software Safety Standard, June 94
This document describes the activities necessary to ensure that safety is designed into software that is acquired or developed by NASA and that safety is maintained throughout the software lifecycle.

Neumann, P G., Risks to the Public in Computers and Related Systems
A regular column in the quarterly publication of The Special Interest Group for Software Engineering of the Association for Computing Machinery (ACM).

Levenson, N. G., and Turner, C. S., An Investigation of the Therac-25 Accidents, IEEE Computer, July 1993
Description and evaluation of a software related accident involving the Therac-25.

Littlewood, B., and Strigini, L., The Risks of Software, Scientific American, Nov. 1992
Delfino, A. B. and Chen B., Evotech Technical Report: Future Directions For The Practicing Engineer and Software Manager, ETR-92-1 1, Evotech, Inc., Burlingame, CA, 1992
Yang, L., and Chen, B., Procedures for Management Software Processes, ETR-93-07, Evotech, Inc., Burlingame, CA,1993
Yang, L., and Chen, B., Software Quality Measurement, ETR-93-06, Evotech, Inc., Burlingame, CA, 1993
Hoes, C. R, Memo: Safety Assessment Process for "Logic Control Systems", August 24, 1993
Levenson, N. G., Cha, S. S. and Shimeall, T. J., Safety Verification of ADA Program Using Software Fault Trees, IEEE Software, July 1991
Bass, L. and Hoes, C., System Safety Analysis of Software Controlled Robotic Devices, IM, 1987
Delfino, A. B., and Chen, B., Software Quality Assurance and Real-Time Systems Development Using The Hierarchical Software State Machine Method, ETR-92-09, Evotech, Inc., Burlingame, CA, 1992
MacKinley, A., Software Safety, Merging/ Emerging Standards, American Society of Safety Engineers (ASSE) Conference, July 1993
Pliakos, M., Software Safety and System Safety, Hazard Prevention, Third Quarter 1992
Chen, B and Yang L, Design, Testing and Verification of Safety Critical Software, Hazard Prevention, Fourth Quarter 1995

 

 

 

This is a U.S. Government computer system - all activities are subject to monitoring.