You have 3 free guides left 😟

Light

You have 3 free guides left 😟

13.4 Redundancy and Fault-Tolerant Architectures

7 min read•july 30, 2024

and are crucial for keeping computer systems running smoothly when things go wrong. They use extra hardware, information, time, and software to catch and fix errors before they cause big problems.

These techniques are like backup plans for computers. They help systems detect issues, isolate them, and recover quickly. By using smart design principles, engineers can make systems that keep working even when parts fail, ensuring reliability in critical applications.

Redundancy Types in Fault Tolerance

Hardware Redundancy Techniques

Top images from around the web for Hardware Redundancy Techniques

Triple Modular Redundancy verification via heuristic netlist analysis [PeerJ] View original
Is this image relevant?
Toward fault tolerant modelling for SCADA based electricity distribution networks, machine ... View original
Is this image relevant?
Triple Modular Redundancy verification via heuristic netlist analysis [PeerJ] View original
Is this image relevant?
Toward fault tolerant modelling for SCADA based electricity distribution networks, machine ... View original
Is this image relevant?

1 of 2

Top images from around the web for Hardware Redundancy Techniques

Triple Modular Redundancy verification via heuristic netlist analysis [PeerJ] View original
Is this image relevant?
Toward fault tolerant modelling for SCADA based electricity distribution networks, machine ... View original
Is this image relevant?
Triple Modular Redundancy verification via heuristic netlist analysis [PeerJ] View original
Is this image relevant?
Toward fault tolerant modelling for SCADA based electricity distribution networks, machine ... View original
Is this image relevant?

1 of 2

Hardware redundancy involves replicating critical components or subsystems to ensure continued operation in case of failures
Common include , , and
- DMR uses two identical components and compares their outputs to detect faults
- TMR employs three identical components and uses to determine the correct output
- NMR extends the concept to N components, providing higher levels of fault tolerance

Information and Time Redundancy Techniques

adds extra bits or data to the original information to detect and correct errors
- Examples include parity bits, , and
- Parity bits detect single-bit errors by adding an extra bit to ensure an even or odd number of 1s
- ECC can detect and correct multiple-bit errors by adding redundant bits based on mathematical algorithms
repeats computations or operations multiple times to detect and mitigate transient faults
- Techniques include re-execution, , and
- Re-execution repeats the computation and compares the results to detect transient faults
- Checkpointing periodically saves the system state to enable recovery from faults
- Rollback recovery restores the system to a previous checkpoint when a fault is detected

Software Redundancy Approaches

Software redundancy employs multiple instances of software components or diverse implementations to detect and recover from software faults
Approaches include , , and
- N-version programming uses independently developed software versions and compares their outputs
- Recovery blocks execute alternate software versions when an acceptance test fails
- Self-checking software incorporates error detection and recovery mechanisms within the software itself
Software redundancy techniques aim to mitigate the impact of software bugs, design flaws, and other software-related faults

Fault-Tolerant Architecture Principles

Fault Detection and Isolation Mechanisms

Fault-tolerant architectures aim to maintain system functionality and prevent failures in the presence of faults
identify the occurrence of faults in the system
- Techniques include , , and
- Error detection codes (e.g., parity, ECC) detect data corruption during storage or transmission
- Watchdog timers monitor the system's behavior and trigger an alarm if expected actions do not occur within a specified time
prevent the propagation of faults to other parts of the system
- Approaches include , , and
- Circuit-level isolation uses physical barriers or electrical isolation to contain faults within a specific circuit
- Module-level isolation employs well-defined interfaces and error containment boundaries to prevent fault propagation between modules

Fault Recovery and Masking Techniques

restore the system to a correct state after a fault occurs
- Techniques include checkpointing, rollback recovery, and
- Checkpointing periodically saves the system state to enable recovery from faults
- Rollback recovery restores the system to a previous checkpoint when a fault is detected
- Forward error correction uses redundant information to correct errors without requiring retransmission or rollback
hide the effects of faults from the system's outputs, ensuring uninterrupted operation
- Examples include majority voting, , and
- Majority voting compares the outputs of redundant components and selects the majority result
- Redundant data storage maintains multiple copies of data to ensure and integrity
- Error-correcting memory automatically corrects bit errors in memory using ECC techniques

Redundancy Effectiveness for Reliability

Reliability Metrics and Evaluation Tools

, such as , , and availability, are used to assess the effectiveness of redundancy techniques in improving system reliability
- MTBF represents the average time between failures in a system
- MTTR indicates the average time required to repair a failed component or system
- Availability is the proportion of time a system is operational and available for use
Reliability block diagrams (RBDs) and are analytical tools used to evaluate the reliability of fault-tolerant systems with different redundancy configurations
- RBDs represent the system as a series of blocks, each representing a component or subsystem, and analyze the overall system reliability based on the reliability of individual blocks
- Markov models use state transitions to represent the system's behavior and calculate reliability metrics based on the probabilities of moving between states

Factors Affecting Redundancy Effectiveness

The effectiveness of hardware redundancy techniques depends on factors such as the level of redundancy (e.g., DMR, TMR), the reliability of individual components, and the voting or comparison mechanisms employed
- Higher levels of redundancy (e.g., TMR vs. DMR) provide better fault tolerance but increase cost and complexity
- The reliability of individual components directly impacts the overall system reliability
- Voting or comparison mechanisms must be reliable and correctly identify and handle faults
Information redundancy techniques' effectiveness is determined by the error detection and correction capabilities of the chosen codes (e.g., Hamming codes, Reed-Solomon codes) and the overhead introduced by the additional bits
- More powerful error-correcting codes can handle a greater number of errors but may introduce more overhead
- The trade-off between error correction capability and overhead must be considered based on the system's requirements
Time redundancy techniques' effectiveness depends on the number of repetitions, the detection and recovery mechanisms employed, and the trade-off between fault coverage and performance overhead
- More repetitions increase fault coverage but may impact system performance
- Detection and recovery mechanisms must be reliable and efficiently handle faults
Software redundancy techniques' effectiveness is influenced by the diversity of implementations, the error detection and recovery mechanisms, and the coordination among software versions
- Greater diversity among software versions reduces the likelihood of common mode failures
- Robust error detection and recovery mechanisms are essential for effective software redundancy
- Coordination mechanisms must ensure consistent and correct behavior across software versions

Designing Fault-Tolerant Systems

Identifying Critical Components and Selecting Redundancy Techniques

Identifying the critical components and subsystems that require fault tolerance based on the system's reliability requirements and failure modes and effects analysis (FMEA)
- FMEA systematically analyzes potential failure modes, their effects, and their criticality to prioritize fault tolerance efforts
- Reliability requirements, such as target MTBF or availability, guide the selection of critical components for redundancy
Selecting appropriate hardware redundancy techniques (e.g., DMR, TMR) for critical components, considering factors such as reliability, cost, and power consumption
- The chosen redundancy technique should provide the required level of fault tolerance while balancing cost and power constraints
- Reliability analysis and trade-off studies help determine the most suitable redundancy technique for each critical component
Incorporating information redundancy techniques (e.g., ECC, CRC) for data storage, transmission, and processing to detect and correct errors
- ECC is commonly used in memory systems to protect against bit errors
- CRC is often employed in data transmission to detect and sometimes correct errors in the received data
Applying time redundancy techniques (e.g., re-execution, checkpointing) for critical computations or operations to detect and recover from transient faults
- Re-execution can be used for critical computations where the results can be quickly verified
- Checkpointing is useful for long-running or complex operations to minimize the amount of lost work in case of a fault

Implementing Fault Detection, Recovery, and Masking Mechanisms

Employing software redundancy techniques (e.g., N-version programming, recovery blocks) for critical software components to improve fault tolerance
- N-version programming is suitable for software components with well-defined inputs and outputs
- Recovery blocks are useful for software components with clear acceptance criteria for the results
Designing fault detection and isolation mechanisms (e.g., watchdog timers, BIST) to identify and contain faults within specific components or subsystems
- Watchdog timers can detect software or hardware faults that cause the system to become unresponsive
- BIST mechanisms enable self-testing of components to identify faults during system startup or periodic checks
Implementing fault recovery mechanisms (e.g., checkpointing, rollback recovery) to restore the system to a correct state after a fault occurs
- Checkpointing saves the system state at regular intervals to enable recovery from faults
- Rollback recovery uses the saved checkpoints to restore the system to a known good state
Incorporating fault masking techniques (e.g., majority voting, error-correcting memory) to maintain uninterrupted system operation in the presence of faults
- Majority voting can be used in systems with redundant components to determine the correct output
- Error-correcting memory automatically corrects bit errors, preventing them from affecting the system's operation

© 2024 Fiveable Inc. All rights reserved.

AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.

About Us

About Fiveable Blog Careers Testimonials Code of Conduct Terms of Use Privacy Policy CCPA Privacy Policy

Resources

Cram Mode AP Score Calculators Study Guides Practice Quizzes Glossary Crisis Text Line Request a Feature

Stay Connected

AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.

About Us

About Fiveable Blog Careers Testimonials Code of Conduct Terms of Use Privacy Policy CCPA Privacy Policy

Resources

Cram Mode AP Score Calculators Study Guides Practice Quizzes Glossary Crisis Text Line Request a Feature

AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.

Back

Glossary

You have 3 free guides left 😟

You have 3 free guides left 😟

13.4 Redundancy and Fault-Tolerant Architectures

Redundancy Types in Fault Tolerance

Hardware Redundancy Techniques

Top images from around the web for Hardware Redundancy Techniques

Top images from around the web for Hardware Redundancy Techniques

Information and Time Redundancy Techniques

Software Redundancy Approaches

Fault-Tolerant Architecture Principles

Fault Detection and Isolation Mechanisms

Fault Recovery and Masking Techniques

Redundancy Effectiveness for Reliability

Reliability Metrics and Evaluation Tools

Factors Affecting Redundancy Effectiveness

Designing Fault-Tolerant Systems

Identifying Critical Components and Selecting Redundancy Techniques

Implementing Fault Detection, Recovery, and Masking Mechanisms

© 2024 Fiveable Inc. All rights reserved.

AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.

About Us

Resources

Stay Connected

© 2024 Fiveable Inc. All rights reserved.

AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.

About Us

Resources

© 2024 Fiveable Inc. All rights reserved.

AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.

Back

Next