AWS Kubernetes Velero Disaster Recovery

Velero-Based Disaster Recovery for Amazon EKS: A Complete Hands-On Guide

Setting up multi-region active–passive DR for EKS using Velero, S3 Cross-Region Replication, Helm, and IRSA — with every command, every error, and every fix documented.

Vishnu Rachapudi
Cloud & AI Engineer · AWS Community Builder (Security)
April 2026
15 min read

Disaster Recovery isn't optional anymore — it's table stakes. In this guide, I walk through building a production-grade, multi-region DR setup for Amazon EKS using Velero. Everything is installed via Helm, secured with IRSA (no static credentials), and backed by S3 with Cross-Region Replication. I've documented every step including the real errors I hit and how I fixed them.

In this post
  1. Architecture & DR Flow
  2. Creating EKS Clusters
  3. Velero Installation via Helm (GitOps)
  4. S3 Buckets & Cross-Region Replication
  5. Backup, Restore & DR Simulation
  6. Troubleshooting & Lessons Learned
Section 01

Architecture & DR Flow

The setup follows an Active–Passive pattern. The primary cluster in us-east-1 runs all workloads. Velero backs up cluster state to S3, which replicates to a DR bucket in ap-south-1. A standby EKS cluster in Mumbai reads from the replicated bucket and restores workloads on demand.

PRIMARY REGION (us-east-1) EKS primary-cluster Apps + Velero (Backup) velero backup S3: velero-primary-bucket + EBS Snapshots DR REGION (ap-south-1) EKS dr-cluster Restored workloads + PVCs S3: velero-dr-bucket Replicated data velero restore S3 CRR DR Failover Flow STEP 1 Velero backup (primary) STEP 2 S3 CRR replicates STEP 3 Velero restore on DR cluster STEP 4 Verify workloads in DR STEP 5 Route 53 failover → traffic switches to DR region
Velero Handles
Kubernetes resource backup
PV snapshot metadata
Cross-cluster migration
Scheduled backups (cron)
You Must Handle
Traffic failover (Route 53)
Database-aware dumps
Real-time replication
DNS / LoadBalancer IPs
Section 02

Creating EKS Clusters

Two clusters — primary in US East, DR in Mumbai. eksctl handles VPC, subnets, IAM roles, and node groups automatically.

# Primary Cluster
eksctl create cluster --name primary-cluster --region us-east-1 \
  --version 1.35 --nodegroup-name primary-nodes \
  --node-type t3.medium --nodes 1 --managed
Primary cluster creation
eksctl provisioning primary-cluster in us-east-1 via CloudFormation
# DR Cluster
eksctl create cluster --name dr-cluster --region ap-south-1 \
  --version 1.35 --nodegroup-name dr-nodes \
  --node-type t3.medium --nodes 1 --managed
DR cluster creation
DR cluster creation in ap-south-1 (Mumbai)

Cost: Two EKS clusters with t3.medium ~ $5–6/day. Tear down after practicing.

Section 03

Velero Installation via Helm (GitOps)

Helm values stored in GitHub for version-controlled, repeatable deployments.

Helm values files
values-primary.yaml and values-dr.yaml created and stored in Git
📄 values-primary.yaml
configuration:
  backupStorageLocation:
    - bucket: velero-primary-bucket
      provider: aws
      config:
        region: us-east-1
  volumeSnapshotLocation:
    - provider: aws
      config:
        region: us-east-1
serviceAccount:
  server:
    name: velero
    create: false
initContainers:
  - name: velero-plugin-for-aws
    image: velero/velero-plugin-for-aws:v1.9.0

IRSA + Helm Install

IRSA primary
IRSA creation on primary cluster
Helm install
helm install velero — STATUS: deployed

Fix: IRSA Credential Issue

Gotcha: Velero Helm chart looks for /credentials/cloud by default. With IRSA, set credentials.useSecret=false.

helm upgrade velero vmware-tanzu/velero --namespace velero \
  -f values-primary.yaml \
  --set credentials.useSecret=false \
  --set serviceAccount.server.create=false \
  --set serviceAccount.server.name=velero \
  --set serviceAccount.server.annotations."eks\.amazonaws\.com/role-arn"="<ROLE_ARN>"
BSL Available
After helm upgrade: BSL shows Available — Velero reaches S3 via IRSA

DR Cluster: Same Process

DR IRSA
IRSA + Velero Running on DR cluster
DR Helm
Helm upgrade on DR — deployed, revision 2
Section 04

S3 Buckets & Cross-Region Replication

CRR is the heart of DR. Every backup Velero writes to the primary bucket gets automatically replicated to Mumbai.

aws s3 mb s3://velero-primary-bucket --region us-east-1
aws s3 mb s3://velero-dr-bucket --region ap-south-1
aws s3api put-bucket-versioning --bucket velero-primary-bucket \
  --versioning-configuration Status=Enabled
aws s3api put-bucket-versioning --bucket velero-dr-bucket \
  --versioning-configuration Status=Enabled
CRR config
Replication rule: primary → dr-bucket (ap-south-1)
CRR rules
CRR rule active — entire bucket, enabled
Important

CRR only replicates new objects after enabling. Take a fresh backup after setting up CRR.

Section 05

Backup, Restore & DR Simulation

Deploy App & Backup

App deployed
Velero Running + nginx deployment created in demo namespace
velero backup create test-backup --include-namespaces demo
Backup completed
test-backup: Completed, 0 errors, 0 warnings
S3 objects
9 backup objects in velero-primary-bucket/backups/test-backup/

Scheduled Backups

Daily backup schedule
daily-backup schedule: 2 AM daily, 30-day retention (720h TTL)

Backup for DR

Take a fresh backup on primary (after CRR is enabled) so it replicates to the DR bucket:

# On primary cluster
velero backup create dr-backup --include-namespaces demo
velero backup get
dr-backup in primary S3
velero-primary-bucket — dr-backup objects (9 files)
dr-backup replicated to DR S3
velero-dr-bucket — same dr-backup replicated via CRR

Restore in DR Region

# Switch to DR cluster
kubectl config use-context iam-root-account@dr-cluster.ap-south-1.eksctl.io

# Verify backup visible via CRR
velero backup get

# Restore
velero restore create --from-backup dr-backup
DR restore command
Switched to DR context → dr-backup visible → restore submitted
Restore completed
Restore Phase: Completed — 9 items restored in 2 seconds

Verify Workloads in DR

DR pods and services
nginx pod Running + LoadBalancer service with ap-south-1 ELB endpoint
nginx welcome page on DR
nginx welcome page served from DR region (ap-south-1 ELB) — DR is live!

Restore Metadata in DR Bucket

Restore objects in DR S3
velero-dr-bucket/restores/ — restore logs, resource list, and results

DR fully validated! nginx deployment + LoadBalancer service restored in Mumbai from a US-East-1 backup. The app is accessible via the DR region's ELB endpoint.

Section 06

Troubleshooting & Lessons Learned

Issue 1: ServiceAccount Not Found

SA error
FailedCreate: serviceaccount "velero" not found

Fix: Create IRSA before Helm install, or use --override-existing-serviceaccounts then restart.

Deployment fix
Velero recovers: 0/1 → 1/1 after IRSA fix

Issue 2: Credentials File Missing

credentialsFile does not exist: stat /credentials/cloud — Fix with credentials.useSecret=false

Issue 3: Backup FailedValidation

Stale backup from when BSL was unavailable. Fix: velero backup delete test-backup --confirm then create fresh.

Key Takeaways

LessonDetail
IRSA before HelmCreate service account first, then install Velero
Disable credential Secretcredentials.useSecret=false when using IRSA
CRR is forward-onlyOnly new objects replicate — always take fresh backup after enabling
Context disciplineAlways check kubectl config current-context
Coming Next
Argo CD for GitOps DR Set-up
Source Code

All Helm values and configs: github.com/aquavis12/velero-config