Day 13 of 7Domain 3 · 18%

Cyber Resilience and Redundancy

High AvailabilityData Redundancy and RAIDCapacity PlanningPowering the Data CenterData BackupsBusiness Continuity and Disaster RecoveryRedundant SitesResilience and Recovery Testing

84 cards · 8 sections

High Availability (OBJ 3.4)

Terms & Definitions(6)

High Availability

The design goal of keeping systems and services operational continuously by eliminating single points of failure through redundancy, load balancing, and failover mechanisms.

Five Nines (99.999%)

Availability standard permitting no more than 5.26 minutes of downtime per year; six nines (99.9999%) permits approximately 31.5 seconds per year.

Load Balancing

Distributes incoming traffic or workloads across multiple servers so no single server becomes a bottleneck, increasing both availability and throughput.

Clustering

Groups of servers that work together so that if one node fails, another automatically takes over, maintaining continuous service availability.

Redundancy

Duplication of critical components or functions so that a single failure does not cause a system outage; applies to hardware, network paths, power, and data.

Multi-Cloud

Strategy of spreading workloads across two or more cloud providers to avoid vendor lock-in and eliminate a single provider as a point of failure.

Key Concepts(2)

Availability Formula

Availability (%) = Uptime / (Uptime + Downtime) × 100. High availability requires engineering out single points of failure at every tier: compute, storage, network, and power.

Redundancy vs. Clustering

Redundancy ensures a spare exists; clustering ensures active-active or active-passive failover so workloads shift automatically without manual intervention.

Exam Tips(2)

Scenario: maximize uptime with automatic failover

Scenario: maximize uptime with automatic failover → Answer: clustering or load balancing. Scenario: eliminate dependence on a single cloud provider → Answer: multi-cloud strategy.

Five nines vs. six nines downtime

Five nines = 5.26 min/year; six nines = 31.5 sec/year. These exact numbers are testable. Higher nines = less acceptable downtime = higher cost.

Data Redundancy and RAID (OBJ 3.4)

Terms & Definitions(9)

RAID (Redundant Array of Independent Disks)

Combines multiple physical drives into one logical unit for redundancy, performance, or both. Different RAID levels trade capacity, speed, and fault tolerance differently.

RAID 0 — Striping

Data split across two or more disks for maximum performance and capacity; no redundancy. One disk failure = total data loss. Keywords: striping, performance, no fault tolerance.

RAID 1 — Mirroring

Data written identically to two disks simultaneously. One disk can fail with no data loss. Keywords: mirroring, 1:1 duplication, half usable capacity.

RAID 5 — Striping with Parity

Data and parity information striped across three or more disks; survives one disk failure. Keywords: parity, minimum 3 disks, one drive failure tolerance.

RAID 6 — Striping with Double Parity

Like RAID 5 but with two parity blocks; survives two simultaneous disk failures. Keywords: double parity, minimum 4 disks, two drive failure tolerance.

RAID 10 — Mirroring + Striping

Combines RAID 1 (mirroring) and RAID 0 (striping); requires minimum 4 disks. Provides both high performance and fault tolerance. Keywords: nested RAID, mirror + stripe, 50% usable capacity.

Failure-Resistant

System designed to withstand the failure of one or more components without losing data or availability. RAID 1, 5, 6, and 10 are failure-resistant configurations.

Fault-Tolerant

System that continues operating correctly even when one or more components fail, typically through automatic failover or redundancy mechanisms.

Disaster-Tolerant

System architecture capable of surviving a complete site failure, typically through geographic redundancy such as offsite data replication to a separate physical location.

Exam Tips(2)

RAID level quick-select

RAID 0 → performance only, no redundancy
RAID 1 → mirroring, simplest redundancy, 2 disks
RAID 5 → parity, 3+ disks, one drive failure
RAID 6 → double parity, 4+ disks, two drive failures
RAID 10 → mirror + stripe, 4 disks, high performance + redundancy

Failure-resistant vs. fault-tolerant vs. disaster-tolerant

Failure-resistant = survives component failure (RAID). Fault-tolerant = continues operating during failure (clustering). Disaster-tolerant = survives site loss (geographic replication). Each represents a higher level of resilience.

Capacity Planning (OBJ 3.4)

Terms & Definitions(5)

Capacity Planning

The process of determining the resources an organization needs to meet current and future demand, covering people, technology, infrastructure, and processes.

People Capacity

Ensuring sufficient staffing to operate systems, respond to incidents, and support users; includes cross-training and succession planning for critical roles.

Technology Capacity

Scaling compute, storage, and network resources to handle peak workloads, including licensing and software capacity headroom.

Infrastructure Capacity

Physical space, power, and cooling in data centers or co-location facilities needed to support technology growth without bottlenecks.

Process Capacity

Ensuring workflows, procedures, and automation can scale with organizational demand, preventing process bottlenecks that limit operational effectiveness.

Exam Tips(1)

Four capacity planning dimensions

Scenario: organization cannot respond to growing demand → Answer: assess capacity across all four dimensions: people, technology, infrastructure, and processes. Missing any one dimension creates a single point of failure in scaling.

Powering the Data Center (OBJ 3.4)

Terms & Definitions(11)

Power Surge

A sudden, brief spike in voltage above normal levels; can damage equipment. Mitigated by surge protectors and line conditioners.

Power Spike

Extreme, very short-duration overvoltage event (milliseconds), often caused by lightning strikes or switching transients.

Power Sag

A short-duration drop in voltage below normal levels; also called a voltage sag or dip. Can cause equipment instability or resets.

Undervoltage (Brownout)

A sustained, intentional or unintentional reduction in voltage, lasting minutes to hours; can damage motors and power supplies over time.

Power Loss (Blackout)

Complete loss of electrical power, either momentary or extended; requires UPS or generator backup to maintain operations.

Line Conditioner

Device that filters and regulates power supply to remove noise, surges, and sags before they reach connected equipment.

UPS (Uninterruptible Power Supply)

Battery-backed device that provides immediate power during outages, bridging the gap between power loss and generator startup.

Generator — Portable

Fuel-powered generator that can be moved to a location; used for temporary power when permanent power is unavailable.

Generator — Standby

Permanently installed generator that automatically starts when utility power fails; typically used for building-level or data center backup power.

Generator — Permanent

Fixed, high-capacity generator integrated into facility infrastructure; provides long-duration backup power for critical operations.

PDC (Power Distribution Center)

Centralized panel that routes and distributes electrical power from the source to all circuits and devices within a data center or facility.

Exam Tips(2)

Five power conditions

Surge — brief overvoltage spike
Spike — extreme short-duration overvoltage (lightning)
Sag — brief undervoltage dip
Undervoltage (Brownout) — sustained low voltage
Power Loss (Blackout) — complete outage

UPS vs. generator layering

Scenario: maintain power during utility failure → Answer: UPS provides immediate bridging (seconds to minutes), then generator takes over for extended coverage. Both together are the correct layered answer.

Data Backups (OBJ 3.4)

Terms & Definitions(7)

Onsite Backup

Backup copies stored at the same physical location as the primary data; provides fast recovery but no protection against site-wide disasters.

Offsite Backup

Backup copies stored at a geographically separate location from the primary site; protects against site-level disasters such as fire, flood, or ransomware that encrypts local backups.

RPO (Recovery Point Objective)

The maximum acceptable amount of data loss measured in time; defines how old a backup can be when used for recovery. Lower RPO = more frequent backups required.

Snapshot

Point-in-time copy of a system or data set captured quickly, often used with virtual machines and storage systems; enables rapid rollback.

Replication

Continuous or near-continuous copying of data from a primary system to a secondary system, enabling near-zero data loss recovery.

Journaling

Records every change made to data in a sequential log (journal), enabling granular recovery to any point in time by replaying the journal entries.

Backup Encryption

Applies encryption to backup copies both at rest (storage) and in transit (transfer); prevents unauthorized access to backup data if physical media is lost or stolen.

Key Concepts(2)

3-2-1 Backup Strategy

Best practice: keep 3 copies of data, on 2 different media types, with 1 stored offsite. Ensures recovery even if onsite copies are destroyed.

Backup tiers by recovery speed

Replication — near real-time, lowest RPO
Snapshot — minutes/hours granularity
Onsite backup — local restore, fast
Offsite backup — slowest restore, site-failure protection
Journaling — granular point-in-time recovery

Exam Tips(2)

RPO determines backup frequency

Scenario: data loss tolerance is 4 hours → Answer: backup interval must be ≤ 4 hours (RPO). RPO is about data loss tolerance, not recovery speed — that is RTO.

Ransomware-resistant backup design

Scenario: ransomware encrypts all local backups → Answer: offsite backups with encryption at rest are the recovery path. Air-gapped or immutable offsite backups prevent ransomware from reaching backup copies.

Business Continuity and Disaster Recovery (OBJ 3.4)

Terms & Definitions(5)

BCP (Business Continuity Plan)

A plan covering preventive and recovery actions to maintain business operations during and after any disruptive event, including non-technical disruptions.

DRP (Disaster Recovery Plan)

A subset of the BCP focused specifically on restoring IT systems and operations after a declared disaster. BCP = any disruption; DRP = disaster specifically.

BCDR (Business Continuity and Disaster Recovery)

Combined term encompassing both the BCP and DRP; used when referring to the full continuity and recovery program together.

Business Continuity Committee

Cross-functional team representing technology, legal, security, communications, and other departments; develops and maintains the BCP under direction of senior management.

Senior Management Role in BCP

Senior management bears ultimate responsibility for BCP development; sets risk appetite and scope, appoints the BC coordinator, and cannot delegate this authority.

Key Concepts(2)

BCP vs. DRP distinction

BCP covers any disruptive event (incident, strike, supply chain failure) and includes non-technical functions. DRP is narrower: IT recovery from a declared disaster. DRP is a component of BCP.

BCP scope and senior management

Scope must be defined by senior management to prevent scope creep; committee represents all business functions, not just IT, because continuity covers the entire organization.

Exam Tips(2)

BCP vs. DRP scenario differentiation

Scenario: hurricane destroys data center, restore IT systems → Answer: DRP. Scenario: key supplier fails, maintain all operations → Answer: BCP. DRP is always about IT recovery from disaster; BCP is broader.

Who owns the BCP?

Scenario: who is ultimately responsible for the BCP? → Answer: senior management. The BC coordinator leads the committee, but senior management sets goals, scope, and risk appetite.

Redundant Sites (OBJ 3.4)

Terms & Definitions(7)

Hot Site

A fully equipped and operational backup facility that can take over immediately with near-zero downtime; highest cost, fastest recovery.

Warm Site

A partially equipped backup facility with power, connectivity, and basic infrastructure; requires days to become fully operational. Balances cost and recovery time.

Cold Site

A backup facility with minimal infrastructure (building, power, basic utilities) but no installed equipment or network; takes weeks to months to activate. Lowest cost.

Mobile Site

A portable recovery facility (trailers, tents) that can be transported to any location; can be configured as hot, warm, or cold depending on equipment carried.

Virtual Site

A cloud-based redundant environment providing virtual hot, warm, or cold site capabilities; offers rapid scalability, cost efficiency, and geographic flexibility without physical facility costs.

Platform Diversity

Using different operating systems, hardware vendors, or cloud providers across primary and redundant sites to ensure a vulnerability or outage affecting one platform does not also compromise the backup.

Geographic Dispersion

Distributing systems, data, and personnel across multiple physical locations to reduce the impact of a regional disaster on overall operations.

Key Concepts(2)

Site type trade-offs

Hot site — instant failover, highest cost (duplicate infrastructure)
Warm site — days to activate, moderate cost
Cold site — weeks to months, lowest cost
Mobile site — portable, deployable anywhere
Virtual site — cloud-based, scalable, pay-as-needed

Platform diversity trade-off

Different platforms reduce shared-vulnerability risk but increase support complexity and training costs. Same-platform redundant sites are easier to manage but share vulnerabilities.

Exam Tips(3)

Site type scenario selection

Scenario: need immediate failover, cost is secondary → Answer: hot site. Scenario: minimize cost, accept weeks of downtime → Answer: cold site. Scenario: balance cost and recovery speed → Answer: warm site.

Virtual site vs. cloud redundancy

Scenario: organization wants cloud-based disaster recovery without physical facility → Answer: virtual site. Virtual sites can replicate hot/warm/cold characteristics in cloud infrastructure.

Platform diversity purpose

Scenario: a critical vulnerability affects the network vendor used at the primary site → Answer: platform diversity at the redundant site ensures the backup remains immune to the same vulnerability.

Resilience and Recovery Testing (OBJ 3.4)

Terms & Definitions(7)

Resilience Testing

Assessment of a system's ability to withstand and adapt to disruptive events without losing critical functionality.

Recovery Testing

Evaluation of a system's capacity to restore normal operations after a disruptive event, verifying that recovery plans and procedures work as expected.

Tabletop Exercise

A simulated scenario-based discussion among key stakeholders to assess and improve organizational preparedness for an emergency or crisis without deploying real resources.

Exercise Inject

A specific scenario or situation introduced during a tabletop exercise to prompt stakeholder response; simulates an event such as a detected breach or natural disaster.

Failover Test

A controlled experiment that verifies seamless transition from a primary system to a backup or secondary system, confirming that uninterrupted operations can be maintained during a real disaster.

Simulation

A computer-generated or artificial representation of a real-world system or scenario used to test how systems and personnel respond to failures or attacks in a virtualized environment.

Parallel Processing (resilience context)

Replicating data and system processes onto a secondary system and running both simultaneously to verify the secondary can handle the load without disrupting primary operations.

Key Concepts(2)

Four resilience and recovery testing methods

Tabletop exercise — discussion-based, low cost, no real deployment
Failover test — actual cutover to backup, verifies real-world transition
Simulation — virtualized environment, real-time responder actions
Parallel processing — primary and secondary run simultaneously, validates secondary capacity

Testing must be continuous

Resilience and recovery testing is not a one-time activity; plans must be regularly tested and updated as the threat environment and infrastructure evolve.

Exam Tips(3)

Testing method scenario selection

Scenario: low-cost, low-disruption rehearsal with stakeholders → Answer: tabletop exercise. Scenario: verify actual failover to hot site works → Answer: failover test. Scenario: test defenders against live attack in virtualized environment → Answer: simulation.

Tabletop vs. simulation distinction

Tabletop = paper-based discussion, no system actions taken. Simulation = actual technical actions in a virtualized or test environment. Failover test = real cutover to production backup systems.

Parallel processing purpose

Scenario: validate secondary system handles real production load → Answer: parallel processing. Both systems run concurrently; confirms secondary stability without stopping primary operations.

PreviousDay 12: Audits, Assessments & Penetration Testing All days →