Disaster Recovery Strategies for Cloud-Native Systems
Why Disaster Recovery Cannot Be an Afterthought
Every organization says they care about disaster recovery until they see the cost estimate. Then it becomes a "future sprint" item — until a region goes down, a database corrupts, or a ransomware attack encrypts production data. We have seen clients lose days of data and weeks of revenue because their DR strategy existed only in a confluence page, never tested, never automated.
At H1Cloud, disaster recovery is a first-class infrastructure concern, architected from day one and tested monthly. This post covers the strategies we implement for our managed hosting clients, from single-region redundancy to multi-region active-active deployments.
Understanding RPO and RTO
Every DR strategy starts with two numbers:
- Recovery Point Objective (RPO): How much data can you afford to lose? An RPO of 1 hour means you accept up to 1 hour of data loss. An RPO of 0 means you need synchronous replication — every write must be confirmed in at least two locations before the application acknowledges it.
- Recovery Time Objective (RTO): How long can your system be down? An RTO of 4 hours means your DR plan must restore service within 4 hours of an incident. An RTO of 5 minutes requires hot standby infrastructure that can assume traffic almost instantly.
These two numbers drive every architectural decision. The tighter they are, the more expensive and complex the solution. A typical SaaS application might target RPO of 15 minutes and RTO of 1 hour, while a financial platform might need RPO of 0 and RTO of 30 seconds.
Tier 1: Automated Backups with Cross-Region Replication
The baseline DR strategy — and the bare minimum for any production system — is automated backups stored in a different region. For PostgreSQL databases, this means continuous WAL archiving to S3 with cross-region replication enabled:
# postgresql.conf - continuous archiving
archive_mode = on
archive_command = 'aws s3 cp %p s3://backup-bucket/wal/%f --region us-west-2'
wal_level = replica
# Automated base backup (run via cron or Kubernetes CronJob)
pg_basebackup -h localhost -U replicator -D /backup/base \
--checkpoint=fast --wal-method=stream
aws s3 sync /backup/base s3://backup-bucket/base/$(date +%Y%m%d) \
--region us-west-2
This approach provides RPO of minutes (limited by WAL shipping frequency) and RTO of 1-4 hours (time to provision infrastructure and restore from backup). It is cost-effective — cross-region S3 replication costs fractions of a cent per GB — but recovery involves manual steps that should be automated via runbooks.
Tier 2: Warm Standby with Streaming Replication
For tighter RPO and RTO, we deploy a warm standby in a secondary region. The standby receives streaming replication from the primary and stays seconds behind. Kubernetes deployments in the standby region are pre-provisioned but scaled to zero replicas.
During failover, we promote the standby database to primary, scale up the Kubernetes deployments, and update DNS. With automation, this achieves RPO of seconds and RTO of 5-15 minutes. The cost is approximately 40-60% of the primary region, since the standby database runs on smaller instances and the application tier consumes no compute until activated.
Tier 3: Active-Active Multi-Region
For mission-critical systems that cannot tolerate any downtime, we implement active-active multi-region deployments. Both regions serve production traffic simultaneously, with data replicated bidirectionally. This is the most complex and expensive approach, but it provides near-zero RPO and near-zero RTO.
The key challenges in active-active are conflict resolution (what happens when both regions write to the same record simultaneously) and latency (cross-region replication adds 30-80ms depending on geography). We typically use CockroachDB or Yugabyte for the database layer, as they handle distributed consensus natively, and deploy application instances in both regions behind a global load balancer like Cloudflare or AWS Global Accelerator.
Testing Your DR Plan
A disaster recovery plan that has not been tested is not a plan — it is a hope. We conduct monthly DR drills for all managed hosting clients:
- Quarterly full failover: Simulate a complete region failure. Redirect all traffic to the DR region. Verify data consistency and application functionality. Measure actual RTO against the target.
- Monthly backup restore: Pick a random backup from the past 30 days. Restore it to an isolated environment. Verify data integrity and run application smoke tests.
- Weekly chaos engineering: Use tools like Litmus or Chaos Monkey to inject failures — kill pods, corrupt network routes, fill disks. Verify that self-healing mechanisms work and alerts fire correctly.
After each drill, we produce a report documenting the actual RPO/RTO achieved, any issues discovered, and remediation actions taken. This continuous testing ensures that when a real disaster strikes, recovery is a well-rehearsed procedure, not a panicked improvisation.
Want help implementing these practices?