Cyber Resilience and Redundancy
84 cards · 8 sections
High Availability (OBJ 3.4)
High Availability
The design goal of keeping systems and services operational continuously by eliminating single points of failure through redundancy, load balancing, and failover mechanisms.
Five Nines (99.999%)
Availability standard permitting no more than 5.26 minutes of downtime per year; six nines (99.9999%) permits approximately 31.5 seconds per year.
Load Balancing
Distributes incoming traffic or workloads across multiple servers so no single server becomes a bottleneck, increasing both availability and throughput.
Clustering
Groups of servers that work together so that if one node fails, another automatically takes over, maintaining continuous service availability.
Redundancy
Duplication of critical components or functions so that a single failure does not cause a system outage; applies to hardware, network paths, power, and data.
Multi-Cloud
Strategy of spreading workloads across two or more cloud providers to avoid vendor lock-in and eliminate a single provider as a point of failure.
Availability Formula
Availability (%) = Uptime / (Uptime + Downtime) × 100. High availability requires engineering out single points of failure at every tier: compute, storage, network, and power.
Redundancy vs. Clustering
Redundancy ensures a spare exists; clustering ensures active-active or active-passive failover so workloads shift automatically without manual intervention.
Scenario: maximize uptime with automatic failover
Scenario: maximize uptime with automatic failover → Answer: clustering or load balancing. Scenario: eliminate dependence on a single cloud provider → Answer: multi-cloud strategy.
Five nines vs. six nines downtime
Five nines = 5.26 min/year; six nines = 31.5 sec/year. These exact numbers are testable. Higher nines = less acceptable downtime = higher cost.
Data Redundancy and RAID (OBJ 3.4)
RAID (Redundant Array of Independent Disks)
Combines multiple physical drives into one logical unit for redundancy, performance, or both. Different RAID levels trade capacity, speed, and fault tolerance differently.
RAID 0 — Striping
Data split across two or more disks for maximum performance and capacity; no redundancy. One disk failure = total data loss. Keywords: striping, performance, no fault tolerance.
RAID 1 — Mirroring
Data written identically to two disks simultaneously. One disk can fail with no data loss. Keywords: mirroring, 1:1 duplication, half usable capacity.
RAID 5 — Striping with Parity
Data and parity information striped across three or more disks; survives one disk failure. Keywords: parity, minimum 3 disks, one drive failure tolerance.
RAID 6 — Striping with Double Parity
Like RAID 5 but with two parity blocks; survives two simultaneous disk failures. Keywords: double parity, minimum 4 disks, two drive failure tolerance.
RAID 10 — Mirroring + Striping
Combines RAID 1 (mirroring) and RAID 0 (striping); requires minimum 4 disks. Provides both high performance and fault tolerance. Keywords: nested RAID, mirror + stripe, 50% usable capacity.
Failure-Resistant
System designed to withstand the failure of one or more components without losing data or availability. RAID 1, 5, 6, and 10 are failure-resistant configurations.
Fault-Tolerant
System that continues operating correctly even when one or more components fail, typically through automatic failover or redundancy mechanisms.
Disaster-Tolerant
System architecture capable of surviving a complete site failure, typically through geographic redundancy such as offsite data replication to a separate physical location.
RAID level quick-select
- RAID 0 → performance only, no redundancy
- RAID 1 → mirroring, simplest redundancy, 2 disks
- RAID 5 → parity, 3+ disks, one drive failure
- RAID 6 → double parity, 4+ disks, two drive failures
- RAID 10 → mirror + stripe, 4 disks, high performance + redundancy
Failure-resistant vs. fault-tolerant vs. disaster-tolerant
Failure-resistant = survives component failure (RAID). Fault-tolerant = continues operating during failure (clustering). Disaster-tolerant = survives site loss (geographic replication). Each represents a higher level of resilience.
Capacity Planning (OBJ 3.4)
Capacity Planning
The process of determining the resources an organization needs to meet current and future demand, covering people, technology, infrastructure, and processes.
People Capacity
Ensuring sufficient staffing to operate systems, respond to incidents, and support users; includes cross-training and succession planning for critical roles.
Technology Capacity
Scaling compute, storage, and network resources to handle peak workloads, including licensing and software capacity headroom.
Infrastructure Capacity
Physical space, power, and cooling in data centers or co-location facilities needed to support technology growth without bottlenecks.
Process Capacity
Ensuring workflows, procedures, and automation can scale with organizational demand, preventing process bottlenecks that limit operational effectiveness.
Four capacity planning dimensions
Scenario: organization cannot respond to growing demand → Answer: assess capacity across all four dimensions: people, technology, infrastructure, and processes. Missing any one dimension creates a single point of failure in scaling.
Powering the Data Center (OBJ 3.4)
Power Surge
A sudden, brief spike in voltage above normal levels; can damage equipment. Mitigated by surge protectors and line conditioners.
Power Spike
Extreme, very short-duration overvoltage event (milliseconds), often caused by lightning strikes or switching transients.
Power Sag
A short-duration drop in voltage below normal levels; also called a voltage sag or dip. Can cause equipment instability or resets.
Undervoltage (Brownout)
A sustained, intentional or unintentional reduction in voltage, lasting minutes to hours; can damage motors and power supplies over time.
Power Loss (Blackout)
Complete loss of electrical power, either momentary or extended; requires UPS or generator backup to maintain operations.
Line Conditioner
Device that filters and regulates power supply to remove noise, surges, and sags before they reach connected equipment.
UPS (Uninterruptible Power Supply)
Battery-backed device that provides immediate power during outages, bridging the gap between power loss and generator startup.
Generator — Portable
Fuel-powered generator that can be moved to a location; used for temporary power when permanent power is unavailable.
Generator — Standby
Permanently installed generator that automatically starts when utility power fails; typically used for building-level or data center backup power.
Generator — Permanent
Fixed, high-capacity generator integrated into facility infrastructure; provides long-duration backup power for critical operations.
PDC (Power Distribution Center)
Centralized panel that routes and distributes electrical power from the source to all circuits and devices within a data center or facility.
Five power conditions
- Surge — brief overvoltage spike
- Spike — extreme short-duration overvoltage (lightning)
- Sag — brief undervoltage dip
- Undervoltage (Brownout) — sustained low voltage
- Power Loss (Blackout) — complete outage
UPS vs. generator layering
Scenario: maintain power during utility failure → Answer: UPS provides immediate bridging (seconds to minutes), then generator takes over for extended coverage. Both together are the correct layered answer.
Data Backups (OBJ 3.4)
Onsite Backup
Backup copies stored at the same physical location as the primary data; provides fast recovery but no protection against site-wide disasters.
Offsite Backup
Backup copies stored at a geographically separate location from the primary site; protects against site-level disasters such as fire, flood, or ransomware that encrypts local backups.
RPO (Recovery Point Objective)
The maximum acceptable amount of data loss measured in time; defines how old a backup can be when used for recovery. Lower RPO = more frequent backups required.
Snapshot
Point-in-time copy of a system or data set captured quickly, often used with virtual machines and storage systems; enables rapid rollback.
Replication
Continuous or near-continuous copying of data from a primary system to a secondary system, enabling near-zero data loss recovery.
Journaling
Records every change made to data in a sequential log (journal), enabling granular recovery to any point in time by replaying the journal entries.
Backup Encryption
Applies encryption to backup copies both at rest (storage) and in transit (transfer); prevents unauthorized access to backup data if physical media is lost or stolen.
3-2-1 Backup Strategy
Best practice: keep 3 copies of data, on 2 different media types, with 1 stored offsite. Ensures recovery even if onsite copies are destroyed.
Backup tiers by recovery speed
- Replication — near real-time, lowest RPO
- Snapshot — minutes/hours granularity
- Onsite backup — local restore, fast
- Offsite backup — slowest restore, site-failure protection
- Journaling — granular point-in-time recovery
RPO determines backup frequency
Scenario: data loss tolerance is 4 hours → Answer: backup interval must be ≤ 4 hours (RPO). RPO is about data loss tolerance, not recovery speed — that is RTO.
Ransomware-resistant backup design
Scenario: ransomware encrypts all local backups → Answer: offsite backups with encryption at rest are the recovery path. Air-gapped or immutable offsite backups prevent ransomware from reaching backup copies.
Business Continuity and Disaster Recovery (OBJ 3.4)
BCP (Business Continuity Plan)
A plan covering preventive and recovery actions to maintain business operations during and after any disruptive event, including non-technical disruptions.
DRP (Disaster Recovery Plan)
A subset of the BCP focused specifically on restoring IT systems and operations after a declared disaster. BCP = any disruption; DRP = disaster specifically.
BCDR (Business Continuity and Disaster Recovery)
Combined term encompassing both the BCP and DRP; used when referring to the full continuity and recovery program together.
Business Continuity Committee
Cross-functional team representing technology, legal, security, communications, and other departments; develops and maintains the BCP under direction of senior management.
Senior Management Role in BCP
Senior management bears ultimate responsibility for BCP development; sets risk appetite and scope, appoints the BC coordinator, and cannot delegate this authority.
BCP vs. DRP distinction
BCP covers any disruptive event (incident, strike, supply chain failure) and includes non-technical functions. DRP is narrower: IT recovery from a declared disaster. DRP is a component of BCP.
BCP scope and senior management
Scope must be defined by senior management to prevent scope creep; committee represents all business functions, not just IT, because continuity covers the entire organization.
BCP vs. DRP scenario differentiation
Scenario: hurricane destroys data center, restore IT systems → Answer: DRP. Scenario: key supplier fails, maintain all operations → Answer: BCP. DRP is always about IT recovery from disaster; BCP is broader.
Who owns the BCP?
Scenario: who is ultimately responsible for the BCP? → Answer: senior management. The BC coordinator leads the committee, but senior management sets goals, scope, and risk appetite.
Redundant Sites (OBJ 3.4)
Hot Site
A fully equipped and operational backup facility that can take over immediately with near-zero downtime; highest cost, fastest recovery.
Warm Site
A partially equipped backup facility with power, connectivity, and basic infrastructure; requires days to become fully operational. Balances cost and recovery time.
Cold Site
A backup facility with minimal infrastructure (building, power, basic utilities) but no installed equipment or network; takes weeks to months to activate. Lowest cost.
Mobile Site
A portable recovery facility (trailers, tents) that can be transported to any location; can be configured as hot, warm, or cold depending on equipment carried.
Virtual Site
A cloud-based redundant environment providing virtual hot, warm, or cold site capabilities; offers rapid scalability, cost efficiency, and geographic flexibility without physical facility costs.
Platform Diversity
Using different operating systems, hardware vendors, or cloud providers across primary and redundant sites to ensure a vulnerability or outage affecting one platform does not also compromise the backup.
Geographic Dispersion
Distributing systems, data, and personnel across multiple physical locations to reduce the impact of a regional disaster on overall operations.
Site type trade-offs
- Hot site — instant failover, highest cost (duplicate infrastructure)
- Warm site — days to activate, moderate cost
- Cold site — weeks to months, lowest cost
- Mobile site — portable, deployable anywhere
- Virtual site — cloud-based, scalable, pay-as-needed
Platform diversity trade-off
Different platforms reduce shared-vulnerability risk but increase support complexity and training costs. Same-platform redundant sites are easier to manage but share vulnerabilities.
Site type scenario selection
Scenario: need immediate failover, cost is secondary → Answer: hot site. Scenario: minimize cost, accept weeks of downtime → Answer: cold site. Scenario: balance cost and recovery speed → Answer: warm site.
Virtual site vs. cloud redundancy
Scenario: organization wants cloud-based disaster recovery without physical facility → Answer: virtual site. Virtual sites can replicate hot/warm/cold characteristics in cloud infrastructure.
Platform diversity purpose
Scenario: a critical vulnerability affects the network vendor used at the primary site → Answer: platform diversity at the redundant site ensures the backup remains immune to the same vulnerability.
Resilience and Recovery Testing (OBJ 3.4)
Resilience Testing
Assessment of a system's ability to withstand and adapt to disruptive events without losing critical functionality.
Recovery Testing
Evaluation of a system's capacity to restore normal operations after a disruptive event, verifying that recovery plans and procedures work as expected.
Tabletop Exercise
A simulated scenario-based discussion among key stakeholders to assess and improve organizational preparedness for an emergency or crisis without deploying real resources.
Exercise Inject
A specific scenario or situation introduced during a tabletop exercise to prompt stakeholder response; simulates an event such as a detected breach or natural disaster.
Failover Test
A controlled experiment that verifies seamless transition from a primary system to a backup or secondary system, confirming that uninterrupted operations can be maintained during a real disaster.
Simulation
A computer-generated or artificial representation of a real-world system or scenario used to test how systems and personnel respond to failures or attacks in a virtualized environment.
Parallel Processing (resilience context)
Replicating data and system processes onto a secondary system and running both simultaneously to verify the secondary can handle the load without disrupting primary operations.
Four resilience and recovery testing methods
- Tabletop exercise — discussion-based, low cost, no real deployment
- Failover test — actual cutover to backup, verifies real-world transition
- Simulation — virtualized environment, real-time responder actions
- Parallel processing — primary and secondary run simultaneously, validates secondary capacity
Testing must be continuous
Resilience and recovery testing is not a one-time activity; plans must be regularly tested and updated as the threat environment and infrastructure evolve.
Testing method scenario selection
Scenario: low-cost, low-disruption rehearsal with stakeholders → Answer: tabletop exercise. Scenario: verify actual failover to hot site works → Answer: failover test. Scenario: test defenders against live attack in virtualized environment → Answer: simulation.
Tabletop vs. simulation distinction
Tabletop = paper-based discussion, no system actions taken. Simulation = actual technical actions in a virtualized or test environment. Failover test = real cutover to production backup systems.
Parallel processing purpose
Scenario: validate secondary system handles real production load → Answer: parallel processing. Both systems run concurrently; confirms secondary stability without stopping primary operations.