Fault-Tolerant Systems (CS449/549)
Welcome to Fault-Tolerant Systems CS449/549. This course is offered in the
Fall Semester 2003 at the University of Idaho in Moscow and is also available
though Engineering Outreach for
off-campus students. The course is taught by Dr. Axel Krings.
This web-page contains information about the course, e.g. syllabus, class
notes, pointers to interesting places etc. Material can be down-loaded in pdf
(or postscript) format, and will be made available in the updated form as the
class goes on. To get an idea of what this class is about, take a look at last semesters
page. However, materials and topics constantly change, and this class will
be no exception. If you have comments, please let me know.
Engineering Outreach students, there are several things you should know.
First of all, if you are trying to contact me, you can call 800-824-2889 ext.
4078 (toll free). Please download the class material from the web page. This
speeds up the distribution process and avoids shipping delays. If you do not
have a pdf viewer, you can get it free at adobe, if you need a postscript viewer, check
out the aladin viewer. If for some reason you are not able to download the
material, please contact Engineering Outreach. There are several assignments
that require access to local simulation tools. Engineering Outreach students
need to have web access with telnet capability in order to use this software.
Accounts on local workstations will be made available.
Course description: this course addresses design, modeling, analysis, and
integration of hardware and software to achieve dependable computing systems
employing on-line fault-tolerance. It covers the concepts and terminologies of
Fault-Tolerant System Design including: Reliability, Dependability,
Maintainability, Redundancy, Error Detection, Damage Confinement, Error
Recovery, Fault Treatment, Redundancy Management, Voting, Information
Redundancy, Random Variables, cdf, pdf, Expectation, Bathtub Curve, MTTF,
Reliability of Series/Parallel Systems, Stand-by Redundancy, M-of-N System,
Reliability Block Diagrams, Fault Trees, Markov Process, Petri Nets, General
Stochastic Petri Nets, Recovery Strategies, Roll-back Recovery, Agreement and
Consensus, Byzantine Clock Synchronization, RAID, Fail-Stop Processes, Systems
Diagnosis, Case studies. I always change the material slightly to account for
interesting changes in the field.
Note: This class has a prerequisite of Computer Organization and
Architecture (CS245) or permission of the instructor. In a 400/500 level
computer science class I expect working knowledge of unix and MS operating
systems.
- Contact information:
- Axel Krings (PhD), JEB 320,
- Phone: 208-885-4078, fax: 208-885-9052.
- Engineering outreach students: dial toll free 800-824-2889 ext 4078
- Mailing address: Engineering Outreach, PO Box 441014, Moscow, Idaho
83844-1014.
- Email: krings@cs.uidaho.edu (see comments in syllabus on email
procedures)
- Office Hours: (see here)
- Live-taped: MWF 9:30-10:20 room JEB 025.
- News Group
- CS449/549 has a news group as a forum for questions/answers related to
the material covered. The news server is: news.uidaho.edu, the news group
is: uidaho.class.cs.449-ak. Note, this is a standard news group accessible
using any news reader, e.g. netscape or tin, it is not a "chat room". If you
have problems accessing the group, make sure that the news server
news.uidaho.edu is in the list of servers. For example, in Netscape 4.7, you
can do this by going to
"Edit->Preferences->Mail&Newsgroups->Newsgroup Server" and add
the server.
- Spring 2002 Term Class Handouts:
- The handout numbers refer to the lecture in which the handout was made
available. This does not necessarily mean that this material was covered in
this particular lecture. (Most likely there is some overlap).
- If there are any problems with accessing the handouts, please let me
know (email, phone, smoke signs, drums, ...)!
- Corrections: some slides may contain formatting errors, typos etc. which
have been addressed in class, but have not been reflected in the notes
posted here.
- WARNING LOCAL STUDENTS: Do not send pdf files (i.e. files in pdf format)
to the printer! Pdf files are binary files and printing them "directly" will
result in a big printer mess!!! There are 2 ways to look at or print the pdf
notes:
- Save the file and use acroread (usr/local/bin/acroread) to open it.
Then from within acroread use the print option.
- Better: update netscape to use pdf files. To do this go to "edit -
preferences", then expand "Navigator" and click "Application". Next click
"New" and fill in the following: Description: acroread, MIMEType:
application/pdf, Suffixes: pdf. Then click on "application" and enter:
/usr/local/bin/acroread %s
- Now "OK" out of it and it should work.
- Syllabus.
- Lecture Notes
- lecture 1 (08/25/03): (pdf)
Reading Assignment 1). Introduction, discussion: Shipping product on
schedule, Reducing Unavailability, Human Fault-tolerance
- lecture 2 (08/27/03): (pdf)
Reading Assignment 2). Definitions, Dependability...Maintainability,
Fault-Error-Failure,
- lecture 3 (08/29/03): (pdf)
Reading Assignment 3). Redundancy, Error Detection, Damage Confinement,
Error Recovery, Fault Treatment, Passive HW Redundancy, Voting.
- lecture 4 (09/03/03): (pdf)
Active/Hybrid HW Redundancy, Information Redundancy, Parity, Checksum,
CRC.
- lecture 5 (09/05/03): (pdf)
Random Variables, cdf, pdf, Expectation, Reliability, Bathtub Curve, MTTF,
Reliability of Series System, Reliability of Parallel System
- lecture 6 (09/08/03): (pdf)
Stand-by Redundancy, M-of-N System, Reliability Block Diagram, Example
Bus-Guardian
- lecture 7 (09/10/03): (pdf)
Example Bus-Guardian, Fault Trees
- lecture 8 (09/12/03): (pdf).
SHARPE material: help save paper and try avoid printing these manuals. You
can use them on-line when you need them. (SHARPE Quick starter, pdf, ps )
(SHARPE Language Description (63 pages), pdf ),
(SHARPE Intro Manual (74 pages), pdf
)
- lecture 9 (09/15/03): no new handouts
- lecture 10 (09/17/03): (pdf)
Markov Process
- lecture 11 (09/19/03): (pdf)
Steady State and Transient Solution
- lecture 12 (09/22/03): (pdf)
Markov Model of Typical Systems
- lecture 13 (09/24/03): (pdf)
Petri Nets
- lecture 14 (09/26/03): (pdf)
General Stochastic Petri Nets (GSPN)
- lecture 15 (09/29/03): (pdf)
Distributed Systems, Ordering-Synchronizing
- lecture 16 (10/01/03): (pdf)
Recovery Strategies, Reading assignment 5)
- lecture 17 (10/03/03): (pdf)
Roll-back Recovery
- lecture 18 (10/06/03): catching up
- lecture 19 (10/08/03): Exam I
- lecture 20 (10/10/03): (pdf)
Introduction to Fault-Tolerant Agreement, Reading assignment 6)
- lecture 21 (10/13/03): (pdf)
cont. Fault-Tolerant Agreement, OM(m) and SM(m) Algorithms
- lecture 22 (10/15/03): cont. Fault-Tolerant Agreement, (no handouts),
Reading assignment 7.
- lecture 23 (10/17/03): 500-level project
- lecture 24 (10/20/03): catching up
- lecture 25 (10/22/03): (pdf)
agreement examples, SM(m) Algorithm, Reading assignment 8,
- lecture 26 (10/24/03): (pdf)
HW solution -- Davis Wakerly Approach, Reading assignment 9,
- lecture 27 (10/27/03): (pdf)
Fault Models, Thamb.& Park, Clock Synchronization
- lecture 28 (10/29/03): (pdf)
Clock Synchronization
- lecture 29 (10/31/03): (pdf)
Reading assignment 10, Understanding Protocols for Byzantine Clock
Synchronization, by Fred Schneider
- lecture 30 (11/03/03): Reading Assignment 11, caching up
- lecture 31 (11/05/03): (pdf)
Reading Remote Clocks (Cri89a)
- lecture 32 (11/07/03): Reading Assignment 12, caching up
- lecture 33 (11/10/03): (pdf)
RAID (Reading Assignments 12, 13)
- lecture 34 (11/12/03): (pdf)
Fail-Stop Processes, Reading Assignment 14)
- lecture 35 (11/14/03): (pdf)
Systems Diagnosis, Reading Assignment 15)
- lecture 36 (11/17/03): Reading Assignment 16), catching up
- lecture 37 (11/19/03): discussion of HW3, (pdf)
Fault-Tolerant Architectures, Tandem, Stratus, SIFT
- lecture 38 (11/21/03): Exam discussion, RAID (pdf),
Space Shuttle, Reading Assignment 17
- Break (11/24 - 11/28/03): Thanksgiving Break
- lecture 39 (12/01/03): Exam II (in-class, open book/notes) (pdf)
Boeing 777, Reading Assignment 18
- lecture 40 (12/03/03): (pdf)
Boeing 777 ADIRU, Reading Assignment 19
- lecture 41 (12/05/03): (pdf)
SIFT
- lecture 42 (12/08/03): (pdf)
Tandem, NonStop System Cyclone, Himalaya
- lecture 43 (12/10/03): (pdf)
MAFT
- lecture 44 (12/12/03): catching up, exam discussion
- Final Exam: Tuesday, Dec. 16th, 10:00 in JEB25
- Reading Assignments (so far):
- You need to locate the paper, unless I specify "(copy)", in which case I
will supply a hardcopy to the UI Copy Center in the Commons. EO students:
the paper will be send to you.
- 1) (copy) R. Chillarege, "Top Five Challenges Facing the Practice of
Fault-tolerance"
- 2) (copy) V. Nelson, "Fault-Tolerant Computing: Fundamental Concepts"
- 3) (copy) D. Harvey, "Is it Safe to Fly-by-Wire?"
- 4) SHARPE documentation, see lecture 8
- 5) Avi Mendelson and Neeray Suri, "Cache Based Fault Recovery for
Distributed Systems", 1997.
- 6) L. Lamport, R. Shostak, and M Pease, "The Byzantine Generals
Problem", 1982.
- 7) There are many general papers on agreement, e.g. thesis "Classes Of
Byzantine Fault-Tolerant Algorithms For Dependable Distributed Systems", by
Andre Postma
- 8) (copy) Davies, Daniel, and J.F. Wakerly, "Synchronization and
Matching in Redundant Systems"
- 9) (copy) Thambidurai, P., and You-Keun Park, "Interactive Consistency
with Multiple Failure Modes". There is an interesting followup paper
"Verification of Hybrid Byzantine Agreement Under Link Faults" by P. Lincoln
and J. Rushby that addresses a problem in the algorithm of Thambidurai and
Park
- 10) Fred Schneider, "Understanding Protocols for Byzantine Clock
Synchronization" ( pdf )
- 11) (copy)Flaviu Cristian, "Probabilistic clock synchronization". Try to
search for the paper online, e.g. google, and find the "ResearchIndex". This
is a good way to find related papers, e.g.
http://citeseer.nj.nec.com/cristian94probabilistic.html
- 12) A Case for Redundant Arrays of Inexpensive Disks (RAID), by D.A.
Patterson,
- 13) RAID: High-Performance, Reliabile Secondary Storage, Peter M. Chen,
Edward Lee, Garth Gibson, Randy Katz, and David Patterson, ACM Computing
Surveys, 1994.
- 14) Byzantine Generals in Action: Implementing Fail-Stop Processors,
Fred B. Schneider, ACM Transactions on Computer Systems, Vol. 2, No..2, pp.
145-154, May 1984.
- 15) (copy) On the Connection Assignment Problem of Diagnosable Systems,
by F. Preparata, G. Metze and R. Chien.
- 16) Implementation of On-Line Distributed System-Level Diagnosis Theory,
by Ronald Bianchini and Richard Buskens, Trans. Computers, Vol. 41, No. 5,
May 1992.
- 17) (copy) Redundancy Management Technique for Space Shuttle Computers,
by Sklaroff, J., R., IBM Journal on Research and Development, Vol. 20, No.
1, pp. 20-28, January 1976.
- 18) Triple-Triple Redundant 777 Primary Flight Computer, Y.C. Yeh, 1996
IEEE Aerospace Applications Conference, pg 293-307, 1996.
- 19) A Fault-Tolerant Air Data/Inertial Reference Unit, Michael L.
Sheffels, IEEE AES Systems Magazine, March 1993.
- Fall 2003 Homeworks/Exams:
- Expectations: Homeworks are expected to look professional! They do not
have to be typed in order to look good, but be aware that I will not accept
scribbles etc. Use a new page for each problem and staple the final
submission.
- HW1 ( pdf
) due 09/24/2003 (video 10/01)
- HW2 ( pdf
) due 09/29/2003 (video 10/13) You have only a few days to do this!!!
- HW3 ( pdf
) due 11/14/2003 (video 11/20)
- HW4 ( pdf )
due at the time of the final exam. (See comments in the news group).
- 500-level Project A (pdf
) due: tbd in class.
- 500-level Project B (pdf
) due: 12/16/2003, 10:00 (time of the final exam). Video students, ask you
proctor to include the report, or send it separate.
- Old Exams:
- Interesting
Links
- A special thanks to Dr. Roger Kieckhafer (MTU) for the contributions to
the material used in this class.
Back