AWS Disaster Recovery EBS Incident Response

Recovering EBS Snapshots with ColdSnap: A Real DR Story from the Field

21 snapshots, ~5 TB of data, a degraded region, and every conventional recovery path failing one by one. Here's the tool that finally worked — and exactly how we used it.

VR
Vishnu Rachapudi
Cloud & AI Engineer · AWS Community Builder (Security)
April 2026
10 min read

During a regional outage we tried snapshot copy — failed. Cross-region AMI copy — failed. VM Export — failed. Then we tried ColdSnap. It worked. This post covers both scenarios it solved: emergency snapshot recovery during an active outage, and exporting encrypted AMIs to on-premises when the conventional VM Export path was blocked.

In this post
  1. The incident — what failed and why
  2. What ColdSnap is and how it works
  3. Setting up the recovery environment
  4. Downloading and uploading 21 snapshots
  5. VM export — bypassing the AWS managed key restriction
  6. Data validation, resilience, and lessons learned
Section 01

The Incident

A production platform running across multiple EBS volumes — application data, database, analytics — snapshotted regularly into a single region. The region degraded. Volume creation from snapshots started hanging indefinitely. We had 21 snapshots and no reliable way to restore from them.

We tried the obvious paths in order: snapshot copy timed out, cross-region AMI copy failed, VM Export was blocked due to encryption. The AWS console was intermittently inaccessible. That's when we reached for ColdSnap.

What failed
CreateVolume from snapshot — hanging
Cross-region snapshot copy — timed out
Cross-region AMI copy — failed
VM Export to S3 — blocked by KMS key type
Why ColdSnap worked
Uses EBS Direct APIs — different service path
No volume creation needed at all
Checkpoint/resume on failure
Works from any healthy region

EBS volume operations (CreateVolume, AttachVolume) and EBS Direct APIs (GetSnapshotBlock, ListSnapshotBlocks) are different internal service paths. When one degrades, the other often stays up. ColdSnap exclusively uses Direct APIs.

Section 02

What is ColdSnap?

ColdSnap is an open-source CLI tool from AWS Labs, written in Rust. It uses EBS Direct APIs to read snapshot blocks directly and reconstruct them as a raw disk image on local disk — no volume, no attachment, no dependency on the EBS control plane.

🔗 github.com/awslabs/coldsnap — open source, AWS Labs maintained

📋
ListSnapshotBlocks
Gets all block indices in the snapshot via EBS Direct API
GetSnapshotBlock (parallel)
Downloads 512KB blocks in parallel — decrypts transparently regardless of KMS key
💾
Checkpoint written per block
Progress saved locally — any failure resumes from last checkpoint, zero re-download
Raw .img reconstructed on disk
Unencrypted disk image — mountable, uploadable, or convertible to VMDK
🔓
Decrypt KMS Snapshots
Output is always unencrypted raw data regardless of source key type.
🖥️
VM Export / AMI Rebuild
Bypasses the AWS managed key restriction that blocks native VM Export.
🌍
Cross-Region Recovery
Download locally, re-upload as new snapshot when copy APIs are unavailable.
Section 03

Setting Up the Recovery Environment

Launch an EC2 instance in a healthy region with at least 2× the snapshot size in disk capacity. We used a 2TB gp3 root volume for ~5TB of snapshots processed in batches.

Grow the partition and install ColdSnap

# Grow partition to full EBS volume capacity
sudo yum install cloud-utils-growpart -y
sudo growpart /dev/nvme0n1 1 && sudo xfs_growfs -d /
mkdir /data

# Install dependencies
sudo yum install git curl gcc make screen -y
sudo yum groupinstall "Development Tools" -y
sudo yum install gcc openssl-devel cmake -y

# Install Rust toolchain
curl https://sh.rustup.rs -sSf | sh -s -- -y
source $HOME/.cargo/env

# Add cargo to PATH permanently
echo 'export PATH=$PATH:$HOME/.cargo/bin' >> ~/.bashrc
source ~/.bashrc

# Build and install ColdSnap from source
cargo install --git https://github.com/awslabs/coldsnap --branch develop --locked

# Set source region
export AWS_REGION="ap-south-1"

IAM policy — attach to EC2 role, no access keys

📄 iam-policy.json
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": [
      "ebs:List*", "ebs:Get*", "ebs:Start*",
      "ebs:PutSnapshotBlock", "ebs:CompleteSnapshot"
    ],
    "Resource": "*"
  }]
}

Always use screen — downloads take hours

screen -S coldsnap          # create session
# Ctrl+A then D             # detach, keep running
screen -ls                  # list sessions
screen -r coldsnap          # reconnect by name
screen -rd 41725            # reconnect by ID
Section 04

Downloading and Uploading 21 Snapshots

The download command

coldsnap download \
  --checkpoint \
  --keep-checkpoint \
  --force \
  snap-09eaa6e71a2896666 \
  /data/app-data-vol3.img
FlagPurpose
--checkpointSave progress block-by-block — resume from exact position on any failure
--keep-checkpointRetain checkpoint after completion for re-runs and verification
--forceOverwrite existing output file — clean resume without manual cleanup
root@ip-10-110-20-183:/data# coldsnap download --checkpoint --keep-checkpoint --force snap-08d24744706ef4847 /data/app-data-vol3.img
  Uploading [=>                    ] 1777/1048576 (2h)
Download in progress — 1777/1048576 blocks · ~2h remaining for this 500GiB snapshot

Handling failed blocks — just re-run

This error appeared multiple times during peak degradation. It is transient, not data corruption. Re-running the same command resumes from checkpoint and retries only the failed blocks.

Failed to download snapshot: Failed to get 166 blocks for snapshot
'snap-08d24744706ef4847': blocks [3915, 6625, 21231, 21524...]

# Fix: re-run the exact same command — checkpoint skips completed blocks
coldsnap download --checkpoint --keep-checkpoint --force \
  snap-08d24744706ef4847 /data/app-data-vol3.img

Script for parallel downloads

#!/bin/bash
SNAPSHOTS=(
  "snap-09eaa6e71a2896666:app-data-vol3"
  "snap-0606fc81f5eca3697:db-main-vol3"
  # ... all 21 snapshot IDs
)
for entry in "${SNAPSHOTS[@]}"; do
  snap_id="${entry%%:*}" ; vol_name="${entry##*:}"
  screen -dmS "dl-${vol_name}" bash -c \
    "coldsnap download --checkpoint --keep-checkpoint --force ${snap_id} /data/${vol_name}.img; exec bash"
done

watch -n 1 df -h  # monitor disk continuously

Upload to healthy region

root@ip-10-110-20-183:/data# coldsnap upload /data/app-data-vol3.img.partial
  Uploading [=>                    ] 1777/1048576 (2h)

root@ip-10-110-20-183:/data# coldsnap upload /data/db-main-vol3.img.partial
  Uploading [=========>           ] 97830/524288 (31m)
Two parallel uploads — app-data-vol3 (500GiB) and db-main-vol3 (500GiB) in separate screen sessions
export AWS_REGION="eu-west-1"  # switch to target region
coldsnap upload /data/app-data-vol3.img
# Prints new snapshot ID on completion — log all 21
Section 05

VM Export — Bypassing the AWS Managed Key Restriction

A separate but related scenario: a customer needed to move 3 VMs (2 snapshots each) from AWS to on-premises as local DR copies. AWS VM Export was the obvious path — export AMI to S3 as VMDK, download, done.

It kept failing. Root cause: the AMIs were encrypted with the AWS managed EBS key (aws/ebs). VM Export only supports customer-managed KMS keys. We raised a support ticket — AWS confirmed it and suggested manually re-encrypting every snapshot with a CMK first. We used ColdSnap instead.

💡 ColdSnap downloads raw decrypted blocks regardless of what KMS key the source was encrypted with. The .img output is always unencrypted on disk. Upload it back with --kms-key-id to re-encrypt with a CMK — then VM Export works.

The full flow — 3 VMs, 6 snapshots, on-premises delivery

🔐
6 snapshots encrypted with aws/ebs key
VM Export blocked · ColdSnap download proceeds without issue
⬇️
coldsnap download → 6 raw .img files
Script loops all 6 · Decrypted transparently · Unencrypted output on disk
⬆️
coldsnap upload --kms-key-id → 6 new CMK snapshots
Re-encrypted with customer-managed key · Tagged per VM
🖼️
Reconstruct AMIs from new CMK snapshots
Register root snapshot as AMI · Attach data volumes
📦
AWS VM Export → S3 as VMDK
Now works — CMK-encrypted source is supported
Customer downloads VMDK from S3 to on-premises
Restores directly on local hypervisor

Upload options reference

# Re-encrypt with CMK on upload
coldsnap upload disk.img \
  --kms-key-id arn:aws:kms:ap-south-1:11111111111:key/1a2b3c4d-5e6f-1a2b-3c4d-5e6f1a2b3c4d

# Tag at upload time — no separate API call needed
coldsnap upload disk.img \
  --tag "Key=VM,Value=vm1" \
  --tag "Key=Volume,Value=root"

# Skip zero blocks — smaller upload, but avoid for encrypted targets
coldsnap upload disk.img --omit-zero-blocks

⚠️ Avoid --omit-zero-blocks when the target snapshot will be encrypted. Apps that write zeroes expect to read them back — missing zero blocks cause unexpected read behaviour.

The Windows instance quirk

Both scenarios shared this gotcha: when you spin up an instance from a ColdSnap-recovered volume, AWS defaults the platform to Linux in the console — even if the disk is Windows. SSM connects and you see a bash prompt, but you're actually in PowerShell.

Workaround: launch a fresh Windows instance from an official AWS Windows AMI → stop it → detach its root volume → attach your recovered volume in its place → start. AWS retains the Windows platform metadata and everything behaves correctly.

Section 06

Data Validation, Resilience, and Lessons

Understanding the checkpoint files

When you run ColdSnap with --checkpoint, it creates two files alongside your download — not just the final .img. Knowing what each file means tells you immediately whether the download is complete or not.

-rw-r--r-- 1 root root      32437471 Apr 11 03:52 app-data-vol3.img.coldsnap-progress
-rw-r--r-- 1 root root 2199023255552 Apr 11 03:52 app-data-vol3.img.partial
FileWhat it meansAction
app-data-vol3.img.partial Download is still in progress or was interrupted. Some blocks are missing. The file exists as a partial write — do not use as-is. Re-run the same download command. ColdSnap resumes from checkpoint.
app-data-vol3.img.coldsnap-progress The checkpoint index file. Records which blocks have been successfully written. Used by ColdSnap on resume to skip completed blocks. Do not delete while download is running. Safe to inspect — it's a binary index.
app-data-vol3.img Download completed successfully. All blocks written. The .partial file is renamed to the final output name. Proceed to validation. Still not production-ready without verification.

⚠️ If you see .img.partial — there are missing blocks. Do not mount, upload, or restore from a .partial file. Re-run the download command first. The checkpoint will handle the rest.

Tracking progress with the .coldsnap-progress file

The .coldsnap-progress file also lets you track how many blocks have been downloaded so far during a running download. You can monitor it alongside df -h to understand both how far along you are and how much disk space is being consumed.

# Check how many blocks have been recorded in the progress file
wc -c app-data-vol3.img.coldsnap-progress

# Watch file size growing (confirms blocks are being written)
watch -n 5 'ls -lh /data/*.partial /data/*.img 2>/dev/null'

This is how the block-level copy actually works: ColdSnap writes 512KB blocks one by one, recording each in the progress file. The .partial file grows as blocks land on disk. When all blocks are accounted for, the rename happens and you get a clean .img.

Why a completed .img still needs validation

A clean .img file — no .partial, no missing blocks — is not the same as a verified complete copy. During an active outage, some blocks may have been retried multiple times before succeeding. We cannot tell you exactly what was in the blocks that failed initially. In most cases the data is intact. But for critical workloads you need to verify before restoring.

⚠️ Never treat a ColdSnap-recovered .img as production-ready without validation — even if the filename shows no .partial. This is especially critical for databases and transactional systems.

The bottom line

ColdSnap is a last resort that genuinely works when everything else fails. We tried snapshot copy, cross-region AMI copy, and VM Export before landing here — all three failed. ColdSnap succeeded, but at scale (~5TB, 21 snapshots) it took significant time and required multiple resumes per snapshot. Add it to your DR runbook before you need it. Know it's slow. Check your .partial files. Validate everything it produces.

References

ColdSnap GitHub — https://github.com/awslabs/coldsnap
AWS EBS Direct APIs — https://docs.aws.amazon.com/ebs/latest/userguide/ebs-accessing-snapshot.html
AWS EBS Snapshots — https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSSnapshots.html