21 snapshots, ~5 TB of data, a degraded region, and every conventional recovery path failing one by one. Here's the tool that finally worked — and exactly how we used it.
During a regional outage we tried snapshot copy — failed. Cross-region AMI copy — failed. VM Export — failed. Then we tried ColdSnap. It worked. This post covers both scenarios it solved: emergency snapshot recovery during an active outage, and exporting encrypted AMIs to on-premises when the conventional VM Export path was blocked.
A production platform running across multiple EBS volumes — application data, database, analytics — snapshotted regularly into a single region. The region degraded. Volume creation from snapshots started hanging indefinitely. We had 21 snapshots and no reliable way to restore from them.
We tried the obvious paths in order: snapshot copy timed out, cross-region AMI copy failed, VM Export was blocked due to encryption. The AWS console was intermittently inaccessible. That's when we reached for ColdSnap.
EBS volume operations (CreateVolume, AttachVolume) and EBS Direct APIs (GetSnapshotBlock, ListSnapshotBlocks) are different internal service paths. When one degrades, the other often stays up. ColdSnap exclusively uses Direct APIs.
ColdSnap is an open-source CLI tool from AWS Labs, written in Rust. It uses EBS Direct APIs to read snapshot blocks directly and reconstruct them as a raw disk image on local disk — no volume, no attachment, no dependency on the EBS control plane.
🔗 github.com/awslabs/coldsnap — open source, AWS Labs maintained
Launch an EC2 instance in a healthy region with at least 2× the snapshot size in disk capacity. We used a 2TB gp3 root volume for ~5TB of snapshots processed in batches.
# Grow partition to full EBS volume capacity
sudo yum install cloud-utils-growpart -y
sudo growpart /dev/nvme0n1 1 && sudo xfs_growfs -d /
mkdir /data
# Install dependencies
sudo yum install git curl gcc make screen -y
sudo yum groupinstall "Development Tools" -y
sudo yum install gcc openssl-devel cmake -y
# Install Rust toolchain
curl https://sh.rustup.rs -sSf | sh -s -- -y
source $HOME/.cargo/env
# Add cargo to PATH permanently
echo 'export PATH=$PATH:$HOME/.cargo/bin' >> ~/.bashrc
source ~/.bashrc
# Build and install ColdSnap from source
cargo install --git https://github.com/awslabs/coldsnap --branch develop --locked
# Set source region
export AWS_REGION="ap-south-1"
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Action": [
"ebs:List*", "ebs:Get*", "ebs:Start*",
"ebs:PutSnapshotBlock", "ebs:CompleteSnapshot"
],
"Resource": "*"
}]
}
screen -S coldsnap # create session
# Ctrl+A then D # detach, keep running
screen -ls # list sessions
screen -r coldsnap # reconnect by name
screen -rd 41725 # reconnect by ID
coldsnap download \
--checkpoint \
--keep-checkpoint \
--force \
snap-09eaa6e71a2896666 \
/data/app-data-vol3.img
| Flag | Purpose |
|---|---|
| --checkpoint | Save progress block-by-block — resume from exact position on any failure |
| --keep-checkpoint | Retain checkpoint after completion for re-runs and verification |
| --force | Overwrite existing output file — clean resume without manual cleanup |
This error appeared multiple times during peak degradation. It is transient, not data corruption. Re-running the same command resumes from checkpoint and retries only the failed blocks.
Failed to download snapshot: Failed to get 166 blocks for snapshot
'snap-08d24744706ef4847': blocks [3915, 6625, 21231, 21524...]
# Fix: re-run the exact same command — checkpoint skips completed blocks
coldsnap download --checkpoint --keep-checkpoint --force \
snap-08d24744706ef4847 /data/app-data-vol3.img
#!/bin/bash
SNAPSHOTS=(
"snap-09eaa6e71a2896666:app-data-vol3"
"snap-0606fc81f5eca3697:db-main-vol3"
# ... all 21 snapshot IDs
)
for entry in "${SNAPSHOTS[@]}"; do
snap_id="${entry%%:*}" ; vol_name="${entry##*:}"
screen -dmS "dl-${vol_name}" bash -c \
"coldsnap download --checkpoint --keep-checkpoint --force ${snap_id} /data/${vol_name}.img; exec bash"
done
watch -n 1 df -h # monitor disk continuously
export AWS_REGION="eu-west-1" # switch to target region
coldsnap upload /data/app-data-vol3.img
# Prints new snapshot ID on completion — log all 21
A separate but related scenario: a customer needed to move 3 VMs (2 snapshots each) from AWS to on-premises as local DR copies. AWS VM Export was the obvious path — export AMI to S3 as VMDK, download, done.
It kept failing. Root cause: the AMIs were encrypted with the AWS managed EBS key (aws/ebs). VM Export only supports customer-managed KMS keys. We raised a support ticket — AWS confirmed it and suggested manually re-encrypting every snapshot with a CMK first. We used ColdSnap instead.
💡 ColdSnap downloads raw decrypted blocks regardless of what KMS key the source was encrypted with. The .img output is always unencrypted on disk. Upload it back with --kms-key-id to re-encrypt with a CMK — then VM Export works.
# Re-encrypt with CMK on upload
coldsnap upload disk.img \
--kms-key-id arn:aws:kms:ap-south-1:11111111111:key/1a2b3c4d-5e6f-1a2b-3c4d-5e6f1a2b3c4d
# Tag at upload time — no separate API call needed
coldsnap upload disk.img \
--tag "Key=VM,Value=vm1" \
--tag "Key=Volume,Value=root"
# Skip zero blocks — smaller upload, but avoid for encrypted targets
coldsnap upload disk.img --omit-zero-blocks
⚠️ Avoid --omit-zero-blocks when the target snapshot will be encrypted. Apps that write zeroes expect to read them back — missing zero blocks cause unexpected read behaviour.
Both scenarios shared this gotcha: when you spin up an instance from a ColdSnap-recovered volume, AWS defaults the platform to Linux in the console — even if the disk is Windows. SSM connects and you see a bash prompt, but you're actually in PowerShell.
Workaround: launch a fresh Windows instance from an official AWS Windows AMI → stop it → detach its root volume → attach your recovered volume in its place → start. AWS retains the Windows platform metadata and everything behaves correctly.
When you run ColdSnap with --checkpoint, it creates two files alongside your download — not just the final .img. Knowing what each file means tells you immediately whether the download is complete or not.
-rw-r--r-- 1 root root 32437471 Apr 11 03:52 app-data-vol3.img.coldsnap-progress
-rw-r--r-- 1 root root 2199023255552 Apr 11 03:52 app-data-vol3.img.partial
| File | What it means | Action |
|---|---|---|
| app-data-vol3.img.partial | Download is still in progress or was interrupted. Some blocks are missing. The file exists as a partial write — do not use as-is. | Re-run the same download command. ColdSnap resumes from checkpoint. |
| app-data-vol3.img.coldsnap-progress | The checkpoint index file. Records which blocks have been successfully written. Used by ColdSnap on resume to skip completed blocks. | Do not delete while download is running. Safe to inspect — it's a binary index. |
| app-data-vol3.img | Download completed successfully. All blocks written. The .partial file is renamed to the final output name. | Proceed to validation. Still not production-ready without verification. |
⚠️ If you see .img.partial — there are missing blocks. Do not mount, upload, or restore from a .partial file. Re-run the download command first. The checkpoint will handle the rest.
The .coldsnap-progress file also lets you track how many blocks have been downloaded so far during a running download. You can monitor it alongside df -h to understand both how far along you are and how much disk space is being consumed.
# Check how many blocks have been recorded in the progress file
wc -c app-data-vol3.img.coldsnap-progress
# Watch file size growing (confirms blocks are being written)
watch -n 5 'ls -lh /data/*.partial /data/*.img 2>/dev/null'
This is how the block-level copy actually works: ColdSnap writes 512KB blocks one by one, recording each in the progress file. The .partial file grows as blocks land on disk. When all blocks are accounted for, the rename happens and you get a clean .img.
A clean .img file — no .partial, no missing blocks — is not the same as a verified complete copy. During an active outage, some blocks may have been retried multiple times before succeeding. We cannot tell you exactly what was in the blocks that failed initially. In most cases the data is intact. But for critical workloads you need to verify before restoring.
⚠️ Never treat a ColdSnap-recovered .img as production-ready without validation — even if the filename shows no .partial. This is especially critical for databases and transactional systems.
ColdSnap is a last resort that genuinely works when everything else fails. We tried snapshot copy, cross-region AMI copy, and VM Export before landing here — all three failed. ColdSnap succeeded, but at scale (~5TB, 21 snapshots) it took significant time and required multiple resumes per snapshot. Add it to your DR runbook before you need it. Know it's slow. Check your .partial files. Validate everything it produces.
ColdSnap GitHub — https://github.com/awslabs/coldsnap
AWS EBS Direct APIs — https://docs.aws.amazon.com/ebs/latest/userguide/ebs-accessing-snapshot.html
AWS EBS Snapshots — https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSSnapshots.html