SOPincident-responseoperations

Emergency SOP: What Custody Teams Should Do During a Major Cloud Provider Outage

vvaults

2026-01-25

9 min read

Actionable outage runbook for custody teams to protect funds, notify clients, and maintain compliance during Cloudflare/AWS/X outages.

When Cloudflare, AWS, or X goes dark, custody teams have minutes — not hours — to protect funds, preserve evidence, and keep clients informed

Major cloud outages in late 2025 and early 2026 exposed a hard truth: custody operations that relied on a single public DNS, edge provider, or centralized signing service lost the ability to transact, communicate, or prove what happened. This runbook gives custody teams a pragmatic, step-by-step emergency SOP to execute during a Cloudflare/AWS/X outage. It combines operational controls, compliance steps, and client communication templates so you can move rapidly and defensibly.

Who should use this runbook

This document is written for custody operations teams at custodial and hybrid custody providers, security engineers, compliance officers, and incident commanders. It assumes you already have standard BCP and incident response policies but need a focused, actionable playbook for large-scale cloud/edge outages that affect connectivity, DNS, or third-party signing services.

Executive summary (do these first)

Declare the outage and assign an Incident Commander within 5 minutes of detection.
Protect wallets by enabling conservative transaction controls: impose temporary withdrawal throttles or pause high-risk operations if telemetry is incomplete.
Failover communications to out-of-band channels and publish client notifications within 15 minutes.
Preserve evidence by snapshotting system state and collecting logs immediately.
Execute signing and routing failovers per predefined failover paths for HSM/MPC/cold wallets.

Detection and first 0-15 minutes

The first minutes set the tone for control and credibility. Speed and discipline matter more than perfect information.

Alert validation: Confirm outage with at least two independent signals: internal monitoring (synthetic transactions), third-party outage trackers, and direct testing from a separate network (cellular or a different cloud region).
Declare incident: Incident Commander (IC) activates an incident room and records the incident start time. Use a dedicated incident channel that does not rely on the affected provider. Make sure your on-call workflows include out-of-band comms and portable-power contingencies.
Initial severity: Assign severity level (P1/P2) based on whether signing or withdrawal flows are affected. If signing services are unreachable, treat as P1.
Immediate protective action: If telemetry or third-party confirmations are incomplete, automatically throttle withdrawals and high-value transactions using an automated circuit-breaker.

Triage and containment (15-60 minutes)

Focus on containment and early decisions that minimize exposure without disrupting essential services for unaffected clients.

Scope the impact: Which regions, services, customers, signing services, DNS, or APIs are affected? Create a quick matrix listing wallet clusters, signer types (HSM, MPC, cold), and UI/API endpoints impacted. Consider performance and caching tradeoffs in design — see notes on performance & caching patterns for high-traffic directories.
Switch to alternate routes: Use preconfigured secondary DNS, direct RPC endpoints, or peer-to-peer messaging for critical on-chain activity. Only use alternate routes that are tested and logged.
Engage third-party suppliers: Notify custodial partners, MPC providers, and HSM vendors. Activate vendor emergency SLAs and failover procedures. Get estimated recovery times and redundant signing availability.
Pause risky processes: Temporarily halt queued automatic sweeps, rebalancing algorithms, and smart-contract interactions that require multi-hop workflows and could fail mid-flight.

Decision matrix: When to pause withdrawals

Use this conservative matrix in the first hour.

If signer connectivity is degraded or signing audit trails are unavailable: Pause withdrawals.
If only client UI or public site is down but signing and ledger reconciliation are intact: Throttled withdrawals with manual approval for high-value ops.
If only analytics or dashboarding is affected: No pause, but notify clients and continuously verify settlement state with internal wallets.

Failover procedures for signing and wallet access

Custody providers use several signing architectures. The runbook below provides failover steps for the three most common setups.

HSM-backed signing

Attempt connection to secondary HSM endpoint in a different cloud region or on-premises appliance.
Enable read-only mode on primary HSM to preserve logs; do not force a primary takeover unless your vendor supports safe failover.
If primary HSM is unreachable, initiate emergency signing on an approved backup HSM. Log cryptographic keys used, operator IDs, and time-range for all emergency signing events.

MPC providers

Contact MPC provider emergency desk and confirm shard availability. Some MPC vendors support emergency quorum reconfiguration; follow their signed playbook.
Use delegated fallback signers only if pre-approved by legal and insurance teams.
Record key shards, operator actions, and produce signed attestations for later audit.

Cold and air-gapped wallets

If hot infrastructure is compromised, coordinate secure transfer of signing responsibilities to cold signers under approved custody workflows.
Perform manual signing under dual-control procedures and record video/timestamped audit trail where legally permissible.

Communications and client notifications

Transparent, timely, and measured communications maintain client trust and reduce support load. Use tiered messages and out-of-band channels.

Channels

Primary: Email to account owners, with encrypted attachments for sensitive data.
Secondary: SMS and authenticated push notifications for urgent account-level actions.
Public status: A status page hosted on an unaffected provider or a static page distributed via distributed ledger anchor or IPFS for censorship resistance.

Notification cadence and templates

Send short, factual updates. Do not speculate. Use the templates below and include a link to the status page.

Initial notice (within 15 minutes)
We are experiencing a service disruption affecting account view and transaction processing due to a third-party cloud/edge outage. Wallet custody is intact. We are executing failover procedures and will provide updates hourly. For urgent issues contact our incident desk.

Hourly update
Update: impact remains limited to web access and API endpoints. Signing systems are currently operating on backup paths. Withdrawals are (paused/throttled). Next update in 60 minutes. Status page link: [status page].

Resolution note
Service has been restored. We are reconciling transactions and will publish a detailed incident report in 72 hours. No unauthorized access to client assets detected. Contact details for further questions.

Evidence preservation and audit trail

Regulators and auditors will want a verifiable timeline and untouched logs. Preserve everything immediately.

Immutable snapshots: Take write-protected snapshots of VMs, containers, database backups, and ledger states. Use offline storage where possible.
Collect logs: Export application logs, HSM/MPC audit trails, Cloudflare edge logs, DNS records, and network flow logs to a secure, append-only repository.
Chain of custody: For every exported artifact, record who exported it, command used, timestamps, and verification hashes. Store signatures from at least two senior operators.
Time synchronization: Record NTP sources and ensure all logs are correlated to a single trusted timebase for forensic analysis.

Regulatory and legal checklist

In 2026 regulators expect prompt, documented reporting from custody providers. Follow your jurisdictional obligations and keep legal in the loop early.

Notify compliance and legal teams within 15 minutes of incident declaration.
Determine notification thresholds for regulators and law enforcement. Prepare required forms and draft reports.
Preserve customer PII and encrypted keys per data protection obligations.
Coordinate with insurance carriers if there's potential financial loss or coverage trigger.

Forensics and root cause analysis (RCA)

After containment, transition to deeper analysis and remediation.

Run integrity checks on signing keys and perform end-to-end reconciliation of on-chain balances.
Engage external forensics providers for cryptographic validation if needed; modern investigations often involve micro-forensic units that combine deep-dive tooling with rapid field response.
Document the RCA and map every finding to remediation and control improvements.

Post-incident actions and continuous improvement

Incident report: Publish a customer-facing incident report within the SLA window agreed with clients and regulators, and an internal full RCA with remediation timelines.
SOP updates: Incorporate lessons learned into playbooks, run additional tabletop exercises, and update the trusted provider list based on performance during the outage. Consider operational resilience patterns used in other sectors, including hospitality operational resilience playbooks.
Testing: Validate failover paths at least quarterly. Include live failover drills for HSM/MPC and low-latency hosted tunnels and testbeds for DNS switchover tests that do not endanger keys.

Advanced strategies and 2026 trends you must adopt

Late 2025 outages accelerated several trends in custody architecture. Implementing these reduces single points of failure and prepares you for stricter regulator expectations in 2026.

Multi-edge and multi-cloud: Distribute critical services across multiple edge providers and cloud regions with automated health checks and DNS failover policies.
Decentralized status anchors: Publish tamper-evident status updates on immutable platforms (blockchain or IPFS) to preserve a public timeline if centralized status pages are unreachable — see local/offline anchoring approaches in local-first sync appliances.
On-device MPC and hardware enforcement: Shift toward cryptographic designs that reduce dependence on a single signing endpoint; prefer MPC with geographically separated shards and enforced legal controls. Consider trends in on-device inference and edge-first operations described in pocket inference and on-device patterns.
Regulatory collaboration: Maintain pre-established reporting templates for major jurisdictions and simulate regulator reporting in drills.
Insurer-aligned controls: Ensure your failover procedures and audit trails meet insurance policy conditions for outage and theft coverage.

Checklist: Emergency runbook at a glance

Incident declared and IC assigned: 0-5 minutes
Initial client notification sent: 0-15 minutes
Withdrawal decision made: 15-30 minutes
Failover signing executed (if needed): 15-60 minutes
Evidence snapshot and log export: 0-60 minutes
Regulatory/legal notified: 15-60 minutes
Hourly client updates until resolution
Full incident report and RCA published: within SLA window (eg 72 hours)

Templates and one-click artifacts to prepare now

Pre-approved client notification templates for initial, hourly, and resolution updates.
Pre-signed delegation letters for emergency signing and transfer, stored in a vault with dual-control access.
Automated scripts to export logs and create immutable snapshots to an out-of-band storage location. Use audit-ready pipelines to preserve provenance and normalization during export.
Failover runbooks for HSM, MPC, and cold-wallet manual signing with operator checklists.

Practice the process, don’t perfect the panic. Teams that rehearse realistic cloud outage scenarios recover faster, provide clearer communications, and reduce client churn.

Final notes and call-to-action

Cloud/edge outages are now a predictable part of the custody risk landscape in 2026. The difference between a reputation-damaging event and a minor operational hiccup is how quickly and transparently you act. Use this runbook to codify decisions, preserve auditability, and protect client assets under pressure.

If you need hardened incident templates, audited failover scripts, or a tabletop exercise tailored to your HSM/MPC architecture, contact vaults.top for expert custody runbook engineering and hands-on drills that meet 2026 regulatory expectations.

vaults

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.