⚠️ This article is still in beta — content is being updated as the setup evolves.
Repo: github.com/iqbalhakims/Disaster-Recovery-HA-Demo
What Is DR Testing and Why It Matters
The main objective of DR testing is to make sure the infrastructure is resilient to any possibility of error. The scenarios might be nodes going down, DB connection lost, or pods going down unintentionally. For this part I kill nodes intentionally to make sure the webapp stays up even when going through a disaster. Besides that, we need to make sure users aren't affected by any error from the developer side.
Make sure your team knows you're doing this and keep aligned with the SLO of the company.
I'm keeping the nodes in the same AZ for now because I need to build my production setup first and it's cheaper for testing. Previously when I did DR testing across zones, it took around 40 minutes for pods to move from Zone A to Zone B. When I test in the same zone, it takes less than 10 seconds — probably because it's the same zone and I'm just using a simple Go app. My take is to use Datadog so you can see pods moving from Zone A to B dynamically compared to Grafana, but this one isn't so critical haha.
How the Workload Should Be Distributed
The ideal setup is pods spread across 3 nodes with one node in a different AZ. Zone A is usually the primary zone supporting all workload. The first time I did DR testing by shutting down the subnet in one AZ, the pods weren't spreading to other nodes — production was down. I turned the subnet back on, and after a post-mortem, we realized there was no topology constraint set up to make sure pods move to another node.
Bro, that was my P0 hahahaha.
DR Testing Demo
Canary Deployment with Argo Rollouts
One of the deployment strategies to keep deployments stable is canary deployment, which applies rollout gradually to the cluster. What canary can do is apply 20% of the new deployment and test the feature. Argo Rollouts supports canary deployment by giving us a dashboard to manually continue the deployment after the 20% phase. We can set up alerting in case any error is detected — if something goes wrong, we can rollback to the previous version so all users aren't affected by the bug.
Stress Testing with k6 for 3500 VU
I also ran stress testing with 3500 virtual users just to see pods scaling up with k6. It's a good way to validate that the HPA (Horizontal Pod Autoscaler) is working correctly under real load before trusting it in production.
Details on How I Set Up My Infra
I use DigitalOcean for this setup due to price transparency and it being cheaper for managed Kubernetes. The first thing I did was deploy just one node to make sure everything went well before scaling up.
Deploying Grafana via GitOps
I started by deploying Grafana via GitOps — not using kubectl apply -f grafana.yaml. I use Helm template for a better workflow:
helm template grafana grafana/grafana -f values.yaml > grafana.yaml
Why GitOps? GitOps gives us the flexibility to edit directly in VSCode instead of editing in the terminal, which makes life hard when you're debugging at 2AM. Beyond that, your team can trail who made changes and what changes were made — basically better history and better blame. That said, in SRE culture we have a blameless culture, where a post-mortem is made after an incident rather than pointing fingers.
I made a video on how I deploy Grafana here:
Making Grafana Highly Available
To make sure Grafana is highly available, I use a managed database with Postgres — because PVC doesn't support multi-node setups. When I was using PVC and did DR testing, I couldn't even log in to Grafana. So I migrated from PVC to managed DB with Postgres.
But then another question came up: what happens to the dashboards we built when Grafana comes back up after a DR event? That's where dashboard as code comes in. The traditional way is to set up dashboards manually using PromQL, but the better way is to use a ConfigMap so dashboards are defined as code. You can just kubectl apply -f dashboard-dr.yaml or sync it in ArgoCD. At this point, I think I'm already in love with GitOps lol.
Kustomize for Multi-Environment
I chose Kustomize to handle multi-environment since it's very friendly. What I like about Kustomize is that you can disable a YAML by removing a single line from kustomization.yaml — without deleting the file itself. So you can add resources back whenever you need them.
Domain Management: cert-manager and external-dns
The next step was adding domain management. I use cert-manager and external-dns.
- cert-manager issues the cert for HTTPS and handles cert renewal, since certs typically expire every 3 months.
- external-dns creates records in your registrar such as AWS Route 53 or Cloudflare.
I initially used AWS Route53 but AWS charged me around $0.50/month, so I migrated to Cloudflare which offers free domain management.
Secrets: AWS Secret Manager vs Hashicorp Vault
For secrets, I use AWS Secret Manager. AWS SM charges $0.40 per secret. I researched Hashicorp Vault which offers open-source secret management, but you need to host it on your own server — the cheapest droplet on DigitalOcean is $4/month. After working on this for a while, I estimated my secrets would be under 10, so AWS Secret Manager was the better choice.
Initially I just used Kubernetes Secrets and later migrated to AWS Secret Manager. For the secret operator, I use External Secrets Operator — a tool to fetch secrets from the cluster to AWS Secret Manager.
Container Registry
I use DigitalOcean Container Registry since it has the same ecosystem as DOKS and it's free.
Ingress: Istio
nginx-ingress is actually deprecated, so I use Istio, which supports canary deployment. This is quite critical — canary deploys gradually into the cluster so we can detect anomalies early. The first part of a deployment will only take 20% of the cluster, and we can get alerts if something goes wrong in production. If that happens, it won't affect all users and we can rollback to the previous version.
Handling TLS with Istio
In Istio I route port 80 (HTTP) to port 443 (HTTPS):
apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
name: istio-ingress-gateway
namespace: istio-system
spec:
selector:
app: istio-ingress
istio: ingress
servers:
- port:
number: 80
name: http
protocol: HTTP
hosts:
- grafana.prod.iqbalhakim.ink
- argo.prod.iqbalhakim.ink
- rollouts.prod.iqbalhakim.ink
- prometheus.prod.iqbalhakim.ink
- litmus.prod.iqbalhakim.ink
- app.iqbalhakim.ink
- api.iqbalhakim.ink
- iqbalhakim.ink
tls:
httpsRedirect: true
- port:
number: 443
name: https
protocol: HTTPS
hosts:
- grafana.prod.iqbalhakim.ink
- argo.prod.iqbalhakim.ink
- rollouts.prod.iqbalhakim.ink
- prometheus.prod.iqbalhakim.ink
- litmus.prod.iqbalhakim.ink
- app.iqbalhakim.ink
- api.iqbalhakim.ink
- iqbalhakim.ink
tls:
mode: SIMPLE
credentialName: iqbalhakim-prod-tls
Then the traffic routes to each service via a VirtualService:
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: grafana
namespace: monitoring
spec:
hosts:
- grafana.prod.iqbalhakim.ink
gateways:
- istio-system/istio-ingress-gateway
http:
- route:
- destination:
host: kube-prometheus-stack-grafana.monitoring.svc.cluster.local
port:
number: 80
HA Setup: Terraform with HCP
For the HA cluster setup, I use Terraform with HCP to automate cloud provisioning:
resource "digitalocean_kubernetes_cluster" "prod" {
name = "prod"
region = "sgp1"
version = "1.35.1-do.2"
node_pool {
name = "prod-pool"
size = "s-4vcpu-8gb"
node_count = 2
}
}
Since I run a lot of apps, s-4vcpu-8gb is enough to keep CPU usage around 50%.
Pod Spreading: Anti-Affinity and Topology Constraints
For pod spreading I use pod anti-affinity and topology spread constraints:
- Pod anti-affinity ensures pods from the same app don't sit on the same node.
- Topology spread constraints handle pod rescheduling to another available node during disaster recovery.
You need to configure this in every app you deploy to ensure proper spreading across nodes.
Here's a demo of node drain:
ArgoCD for Config Drift Prevention
I use ArgoCD to make sure all configurations stay aligned with what's set in GitHub. ArgoCD prevents config drifting, which is not good for production — if someone manually patches something in the cluster without updating the repo, ArgoCD catches it and flags the drift.
There are two ways to deploy an application in ArgoCD: create it via the UI, or — you can probably guess — the GitOps way. Here's a sample ArgoCD Application manifest:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: grafana
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/iqbalhakims/kube-promethues-stack-test
targetRevision: main
path: grafana/prod
destination:
server: https://kubernetes.default.svc
namespace: monitoring
ignoreDifferences:
- group: ""
kind: Secret
name: alertmanager-kube-prometheus-stack-alertmanager
namespace: monitoring
jsonPointers:
- /data
- group: ""
kind: Secret
name: kube-prometheus-stack-grafana
namespace: monitoring
jsonPointers:
- /data
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
- ServerSideApply=true