Postgres Failover (CNPG)
This runbook covers how to respond when the Postgres primary becomes unavailable. Velocity uses CloudNative-PG (CNPG) for high-availability clustering.
Detection
Failover is automatic in CNPG. Watch for alerts:
alert: PgPrimaryDownPostgresClusterStatus: degraded Primary replicas: 0/1 Standby replicas: 2/2Or query directly:
kubectl get cluster velocity -o jsonpath='{.status.phaseTime}' -n velocity-system# If phase is "unhealthy", primary is unreachable
kubectl describe cluster velocity -n velocity-system# Look for Conditions: ready=False, archiveReady=FalseCheck CNPG logs:
kubectl logs -n velocity-system -l cnpg.io/cluster=velocity -c postgres --tail=50 | grep -i failoverAutomatic Recovery
CNPG performs automatic failover (Quorum-based):
- Detection (30 seconds): Primary unresponsive
- Election (5-10 seconds): Healthy standby elected as new primary
- Switchover (15-30 seconds): Other standbys reconnect
- Bootstrap (5 minutes): New primary rebuilds WAL streaming
Typical total downtime: 5-10 minutes. No manual intervention required.
Manual Intervention (If Needed)
Check Cluster Status
kubectl get cluster velocity -n velocity-systemOutput:
NAME AGE PHASE READYPG READINESSvelocity 45d unhealthy 1/3 FalseIf stuck in unhealthy, proceed to manual switchover.
Option 1: Restart Primary Pod (Preferred)
If the primary crashed but k8s didn’t detect it:
kubectl delete pod velocity-1 -n velocity-system# CNPG will:# 1. Promote standby-2 or standby-3 to primary# 2. Recreate deleted pod as standby# 3. Sync WAL from new primaryWait for recovery:
kubectl wait --for=condition=ready pod velocity-1 -n velocity-system --timeout=5mOption 2: Force Switchover
If the primary is hung but pod is running:
kubectl annotate cluster velocity \ cnpg.io/switchoverImmediateFlag="true" \ --overwrite -n velocity-systemCNPG initiates switchover immediately. New primary elected within 1 minute.
Option 3: Rebuild Replica
If a standby is corrupted:
# Identify bad replicakubectl get pod velocity-2 -n velocity-system -o wide# STATUS: CrashLoopBackOff or Not Ready
# Delete itkubectl delete pod velocity-2 -n velocity-system
# CNPG bootstraps from primarykubectl wait --for=condition=ready pod velocity-2 -n velocity-system --timeout=10mVerify Recovery
After failover completes:
# 1. Check cluster statuskubectl get cluster velocity -n velocity-system# Phase should be "healthy", ReadyPG 3/3
# 2. Check pod statuskubectl get pods -n velocity-system -l cnpg.io/cluster=velocity# All pods Ready 1/1
# 3. Check primary electedkubectl exec -it velocity-1 -n velocity-system -- pg_controldata | grep "Database cluster state"# Should show "in production" for primary
# 4. Connect and verifykubectl exec -it velocity-1 -n velocity-system -- psql -U velocity_admin -c "SELECT version();"
# 5. Check WAL replication lagkubectl exec -it velocity-1 -n velocity-system -- psql -U velocity_admin -c \ "SELECT slot_name, active, restart_lsn FROM pg_replication_slots;"# lag_bytes should be small (< 1 MB)DNS / Service Endpoint Update
Velocity API automatically connects to the primary through velocity-rw service:
# Verify DNS resolves to new primarynslookup velocity-rw.velocity-system.svc.cluster.local# Should return IPs of primary podNo manual DNS updates needed; CNPG manages the service endpoints.
Data Validation After Failover
Once primary is back in service:
# 1. Check record counts matchkubectl exec velocity-1 -n velocity-system -- psql -U velocity_admin velocity -c \ "SELECT schema_name, COUNT(*) FROM information_schema.tables WHERE table_type='BASE TABLE' GROUP BY schema_name;"
# 2. Check audit log integrityvelocity audit verify \ --schema acme/supply-chain/procurement/purchase-order/v1 \ --entity-id PO-001
# 3. Run health checksvelocity status# All components should show ReadyPreventing Future Failovers
Increase Resource Limits
If primary crashed due to OOM:
kubectl edit deployment -n velocity-system velocity-postgres# Increase memory request/limit# spec.resources.requests.memory: 8Gi → 16Gi# spec.resources.limits.memory: 8Gi → 16GiMonitor Replication Lag
Create alert for lagging standby:
alert: ReplicationLagHighexpr: max(pg_replication_slot_latest_restart_lsn) - min(pg_replication_slot_latest_restart_lsn) > 100000000 # 100 MBfor: 2mannotations: summary: "Replication lag > 100 MB (network bottleneck?)"Check Disk Space
kubectl exec velocity-1 -n velocity-system -- df -h /var/lib/postgresql/data# If > 80%, WAL archival may be stalledVerify Backup Schedule
CNPG auto-backs up to S3. Check if backup jobs are running:
kubectl get job -n velocity-system -l cnpg.io/cluster=velocity# Should see daily backup jobsIncident Report
Post-incident checklist:
- Record failover timestamp and duration
- Collect CNPG logs:
kubectl logs -l cnpg.io/cluster=velocity --since=2h - Check if any writes were lost (compare audit IDs before/after)
- Review Prometheus metrics for CPU/memory spikes
- File incident ticket with root cause
- Update runbook if steps were unclear
Contacts
- CNPG Team: #database Slack
- Operator Team: #platform Slack
- On-call: /oncall in Slack