Short Course: Validating High-Availability Systems
COURSE DESCRIPTION:
This course provides a comprehensive introduction to methods for validating high-availability systems and networks that can be used from the early design stage to the testing of a prototype. After introducing basic concepts related to reliability, availability, and performability, the course presents combinatorial modeling methods, which are most useful in the early design stage. Markov process theory, and numerical issues in solving Markov models, are then presented as a method to solve system models during the early/middle design phase, and discrete-event simulation is presented as a method for solving detailed models built in the final design phase. In both cases, stochastic activity networks are introduced as a high-level, easy-to-use method of describing system models. Finally, the use of fault injection to validate a prototype implementation is discussed.
PREREQUISITE: An undergraduate course in probability and statistics.
HIGH-LEVEL COURSE OUTLINE AND TIMETABLE:
- Introduction to the Validation of High-Availability Systems
- Review of Design Techniques for High-Availability and Dependable Systems
- Validation of High-Availability Systems is Critical
- Fault Types: Stand-Alone and Networked Systems
- Definitions and Measures of Success: Dependability and Performability
- Approaches to Validation
- Synergistic Relationship Between Validation Methods
- Validation Methods not Covered in this Course
- When Validation Should Take Place
- Course Overview
- First-Cut, Rapid Validation by Combinatorial Methods
- Reliability Assessment: The Need for Probability
- Combinatoric Methods: Independent Failure Assumption
- Reliability Formalisms: Fault Trees, Reliability Block Diagrams, Reliability Graphs
- Reliability Block Diagram Examples
- Validation Using Classical State-Based Methods
- Availability: One Need for State-Based Methods
- Random Processes
- Discrete Time Markov Chains
- Continuous Time Markov Chains
- Markov Chain Solution Techniques: Transient and Steady State
- CTMC Model Examples
- Specifying High-Availability System Designs Using Stochastic Activity Networks
- The Need for High-Level Specification Methods
- Stochastic Petri Nets: Basic Definitions and Examples
- Generating Markov Models from Stochastic Petri Nets: Example
- Stochastic Activity Networks (SANs)
- Execution of SANs
- Specification of Reliability, Availability, and Performability Variables
- Simple Example: Multiprocessor Failure/Repair Model
- Building Larger Models from SAN Components: Composed Models
- Simulation-Based Validation Techniques
- Simulation as Model Experimentation - Basic Algorithms and Assumptions
- Random Number and Random Variable Generators
- Types of Simulation: Transient and Steady-State
- Confidence Intervals about Estimators - Statistical Issues and Pitfalls
- Parallel Simulation: Study-Level, Experiment-Level, and Trajectory-Level Parallelism
- Variance Reduction Techniques: Importance Sampling
- Fault Injection Methods and Mechanisms
- Fault Injection in the Development and Validation Process
- Fault Injection on Simulated Systems
- Hardware-Implemented Fault Injection
- Software-Implemented Fault Injection - Stand-Alone Systems
- Software-Implemented Fault Injection - Networks / Distributed Systems
- Comparison of Three Hardware Fault Injection Techniques on the MARS Architecture
- Removal, Benchmarking, and Dependability Assessment Using Fault Injection
- Fault Removal
- Dependability Benchmarks
- Dependability Assessment
- Summary and Concluding Remarks
- Review of Previous Lectures
- The "Art" of Dependability Validation
- Validating Validation Models and Measurements
- Next Steps
Instructor Biography
William H. Sanders is a Donald Biggar Willett Professor of Engineering and the Director of the Information Trust Institute at the University of Illinois. He is a professor in the Department of Electrical and Computer Engineering and the Coordinated Science Laboratory. He is a Fellow of the IEEE and the ACM. He serves as the Vice-Chair of IFIP Working Group 10.4 on Dependable Computing . In addition, he serves on the editorial board of Performance Evaluation , and is the Area Editor for Simulation and Modeling of Computer Systems for the ACM Transactions on Modeling and Computer Simulation . He is a past Chair of the IEEE Technical Committee on Fault-Tolerant Computing. Dr. Sanders's research interests include performance/dependability evaluation, dependable computing, and reliable distributed systems. He has published more than 160 technical papers in these areas. He is a co-developer of three tools for assessing the performability of systems represented as stochastic activity networks: METASAN, UltraSAN, and Möbius. Möbius and UltraSAN have been distributed widely to industry and academia; more than 300 licenses for the tools have been issued to universities, companies, and NASA for evaluating the performance, dependability, security, and performability of a variety of systems. He is also a co-developer of the Loki distributed system fault injector and the AQuA/ITUA middlewares for providing dependability/security to distributed and networked applications.