Important Site Links


Information Trust Institute: University of Illinois at Urbana-Champaign

Short Course: Design of High-Availability Systems and Networks

COURSE DESCRIPTION:

This course introduces a system (hardware and software) view on design issues in high-availability computing. The material covers a broad spectrum of hardware and software error detection and recovery techniques and discusses how the hardware and software techniques interplay, what techniques can be provided in COTS hardware, what can be embedded into the operating system and communication layers, and what can be provided in the software and application itself. After introducing basic concepts and terms, including reliability, availability, and hardware and software fault models, the course continues with discussion of hardware redundancy, coding techniques, signature-based error-checking (e.g., software-based control flow checking), checkpoint and recovery (single process and distributed environment), software fault tolerance techniques (e.g., process pair, robust data structures, recovery blocks, and N-version programming), and finally network-specific issues (e.g., ways to provide consistent data and reliable communications). The examples include the TMR (Triple Modular Redundancy) system, the Dynamic Host Configuration Protocol application, the design of a failure-resilient node controller, a distributed database, checkpoint and recovery in an operating System, and database audit in a mobile switching environment. The course also introduces a software-based high-availability middleware to demonstrate how the various detection and recovery techniques can be provided to (incorporated in) applications. (2 days - 16 hours class time)

PREREQUISITE: Basic knowledge of design of computer systems.

HIGH-LEVEL COURSE OUTLINE AND TIMETABLE:

DAY 1:

  1. Introduction
    1. Course overview
    2. Discussion of expectations of class members
    3. Motivation for high-availability computing
    4. System view of high-availability design
    5. Fault models, classification, and failure data
    6. Tandem example
    7. High-availability network system example
  2. Hardware redundancy
    1. Basic approaches to hardware redundancy
    2. Static and dynamic redundancy
    3. Voting
    4. Hardware voter example
  3. Open discussion
  4. Information redundancy
    1. Error-detecting and error-correcting codes
      1. Basic definitions
      2. Parity prediction, Hamming codes
      3. Codes for storage and communication, codes for arithmetic operations
    2. Non-coding techniques
    3. Application 1: DHCP server
    4. Application 2: Fault-Resilient Node Controller
    5. Checkpointing and recovery techniques
      1. Recovery basics
        • Forward error recovery
        • Backward error recovery
        • Libft example
  5. Checkpoint and recovery in networked systems
    1. Global consistent state
    2. Recovery line
  6. Open discussion

DAY 2:

  1. Open discussion
  2. Checkpointing and recovery in network systems (cont.)
    1. Synchronous checkpointing and recovery
    2. Asynchronous checkpointing and recovery
    3. Checkpointing on distributed databases example
    4. Micro-checkpointing
    5. IRIX operating system checkpoint and restart
  3. Software fault tolerance
    1. Process pairs
    2. Robust data structures
    3. IDEN MicroLite example
    4. N-version programming
    5. Recovery blocks
    6. Software fault tolerance in IBM MVS operating system
  4. Network-specific issues
    1. Specific issues in design and implementation of networked/distributed systems
    2. Broadcast protocols
      1. Reliable broadcast
      2. FIFO broadcast
      3. Causal broadcast
      4. Atomic broadcast
    3. Agreement protocols
      1. Byzantine agreement
      2. Consensus
      3. Interactive consistency
      4. Application of agreement algorithms
    4. Commit protocols
      1. Two-phase commit protocol
      2. Three-phase commit protocol
    5. DHCP example revisited
  5. Practice of high-availability system design
    1. IBM main frame, hardware-supported fault tolerance
  6. Open discussion
  7. Conclusions
    1. Review of techniques
    2. Integrated system, hardware, and software fault tolerance

Instructor Biographies

Ravishankar K. Iyer is the George and Ann Fisher Distinguished Professor of Electrical and Computer Engineering, and holds appointments in the Department of Computer Science and the Coordinated Science Laboratory at the University of Illinois at Urbana-Champaign, where he is Director of the Coordinated Science Laboratory. Professor Iyer's research interests are in the area of reliable computing, measurement and evaluation, and automated design. He was general chair of the 19th Annual IEEE International Symposium on Fault-Tolerant computing (FTCS-19) and program co-chair for FTCS-25, the Silver Jubilee Symposium. Professor Iyer is an IEEE Computer Society Distinguished Visitor, an associate fellow of the American Institute for Aeronautics and Astronautics (AIAA), a fellow of the IEEE Computer Society, and a member of the ACM, Sigma XI, and the IFIP technical committee (WG 10.4) on fault-tolerant computing. In 1991, he received the Senior Humboldt Foundation award for excellence in research and teaching. In 1993, he received the AIAA Information Systems award and medal.

Zbigniew T. Kalbarczyk is currently a Principal Research Scientist at the Center for Reliable and High-Performance Computing in the Coordinated Science Laboratory of the University of Illinois at Urbana-Champaign. He holds the Ph.D. in Computer Science from the Technical University of Sofia, Bulgaria. After receiving his doctorate, he worked as an Assistant Professor in the Laboratory for Dependable Computing at Chalmers University of Technology in Gothenburg, Sweden. Dr. Kalbarczyk's research interests are in the area of reliable and secure networked systems. His research alsoinvolves development of automated techniques for validation and benchmarking of dependable computing systems. He served as a program co-chair of the Performance and Dependability Symposium (PDS) track of the Conference on Dependable Systems and Networks (DSN 2002) and is regularly invited to work on the program committees of major conferences on the design of fault-tolerant systems. He is a member of the IEEE and the IEEE Computer Society.