Disaster Recovery for Self-Hosted Services: Planning, Testing, and Recovery
Your homelab server just died. The boot drive is unresponsive. Maybe a power surge, maybe the drive just gave up after five years of continuous operation. It doesn't matter why. What matters is what happens next.
Photo by Salah Darwish on Unsplash
If you're like most self-hosters, the answer is somewhere between "I think I have a backup somewhere" and panic. Disaster recovery planning is the difference between "I'll have everything back up in two hours" and "I've lost years of photos, documents, and configuration."
This guide is the plan you build before you need it.

Core Concepts: RTO and RPO
Before building a recovery plan, you need to answer two questions:
Recovery Time Objective (RTO) -- How long can you be down? If your Nextcloud instance is offline for a weekend, is that acceptable? If your Vaultwarden password manager is down for an hour, can you still log into critical accounts?
Recovery Point Objective (RPO) -- How much data can you afford to lose? If your last backup was 24 hours ago, losing a day of documents might be fine. Losing a day of photos during a family vacation is not.
These two numbers drive every decision in your DR plan:
| Service | Typical RTO | Typical RPO | Implication |
|---|---|---|---|
| Password manager (Vaultwarden) | 1 hour | 24 hours | Priority restore, daily backup is fine |
| Photo library (Immich) | 1 day | 1 hour | Large dataset, frequent backup needed |
| Media server (Jellyfin) | 1 week | N/A | Media can be re-downloaded, low priority |
| Home automation (Home Assistant) | 4 hours | 24 hours | Automations break, daily snapshots |
| Documents (Paperless-ngx) | 1 day | 24 hours | Important but not urgent |
| Monitoring (Grafana/Prometheus) | 1 day | 1 week | Historical data is nice, not critical |
Be honest with yourself. Not everything is equally important, and trying to treat everything as critical makes your DR plan expensive and complex.
The 3-2-1 Backup Rule for Homelabs
The 3-2-1 rule is the foundation of data protection: 3 copies of your data, on 2 different storage media, with 1 copy off-site.
Here's how to implement it practically:
Copy 1: Live data
Your running services and their Docker volumes. This is the data you use every day. It lives on your server's drives.
Copy 2: Local backup
A backup on a separate device on your local network. This could be:
- A NAS (Synology, TrueNAS, a Raspberry Pi with an external drive)
- A second internal drive in your server
- A USB drive that you keep plugged in
Use BorgBackup or Restic to create deduplicated, encrypted backups. Schedule them to run automatically -- nightly for most services, hourly for high-RPO data like photos.
# Example: nightly Restic backup to local NAS
restic backup \
/var/lib/docker/volumes/ \
/opt/compose/ \
/backups/db-dumps/ \
--exclude '*.tmp' \
--repo /mnt/nas/restic-repo
Copy 3: Off-site backup
A backup stored outside your home. This protects against fire, flood, theft, or a ransomware attack that encrypts everything on your network. Options:
- Backblaze B2: $6/TB/month. The go-to choice for most self-hosters. Restic has native B2 support.
- A friend's server: Run BorgBackup over SSH to a friend's machine. They back up to yours. Mutual off-site for free.
- Cloud object storage: AWS S3, Wasabi ($7/TB/month), or any S3-compatible provider.
# Restic to Backblaze B2
export B2_ACCOUNT_ID="your-id"
export B2_ACCOUNT_KEY="your-key"
restic backup /var/lib/docker/volumes/ --repo b2:mybucket:homelab
For a typical homelab with 200 GB of critical data, off-site backup costs roughly $1-2/month. That's cheaper than losing everything.
Building a Recovery Kit
When disaster strikes, you need to be able to rebuild from scratch without access to the dead server. That means your recovery kit must exist independently of the infrastructure it protects.
What goes in the kit
Hardware inventory: What CPU, RAM, drives, and network configuration does your server need? Document it so you can buy replacement hardware quickly.
OS installation notes: Which Linux distribution, which version, any specific kernel parameters or driver requirements.
Docker Compose files: Every
docker-compose.ymland.envfile for every service you run. Keep these in a git repository.Backup credentials: Repository passwords for Borg/Restic, cloud storage API keys, encryption keys. Store these in a password manager that isn't self-hosted (Bitwarden cloud, 1Password, or printed on paper in a safe).
DNS configuration: What domains point where? If you use Cloudflare, document your tunnel configs and DNS records.
Restore order: Which services need to come up first? Usually: reverse proxy, then databases, then applications.
Recovery kit checklist
## Disaster Recovery Kit
### Credentials (stored in external password manager)
- [ ] Restic repository password
- [ ] Backblaze B2 API credentials
- [ ] Borg repository passphrase and key file
- [ ] Domain registrar login
- [ ] Cloudflare API token
- [ ] SSH keys for remote backup server
### Configuration (stored in git repo)
- [ ] All docker-compose.yml files
- [ ] All .env files (encrypted with age or gpg)
- [ ] Prometheus/Grafana configs
- [ ] Reverse proxy configs (Traefik/Caddy rules)
- [ ] Cron jobs and systemd timers
- [ ] /etc modifications (sysctl, fstab, etc.)
### Documentation (stored alongside configs)
- [ ] Hardware specifications
- [ ] Network diagram (IP assignments, VLANs)
- [ ] Service dependency map
- [ ] Step-by-step restore procedure
- [ ] Contact info for ISP, domain registrar
Like what you're reading? Subscribe to Self-Hosted Weekly — free weekly guides in your inbox.
Step-by-Step Recovery Procedure
Write this procedure now, while everything is working. You will not think clearly during a real disaster.
Phase 1: Hardware (0-2 hours)
- Assess damage. Is the hardware salvageable? Can you swap a failed drive?
- If the machine is dead, provision a replacement. A VPS can serve as temporary infrastructure while you wait for hardware.
- Install the base OS. Use your documented installation notes.
Phase 2: Foundation (30 minutes)
# Install Docker
curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker $USER
# Install restore tools
sudo apt install restic borgbackup age
# Clone your infrastructure repo
git clone https://github.com/yourname/homelab-config.git ~/compose
# Decrypt secrets
age -d -i ~/.secrets/age-key.txt secrets.tar.gz.age | tar xz
Phase 3: Restore data (1-4 hours depending on data volume)
# Restore from local backup (fastest)
restic restore latest \
--target / \
--include /var/lib/docker/volumes \
--repo /mnt/nas/restic-repo
# Or from off-site backup (slower but always available)
restic restore latest \
--target / \
--include /var/lib/docker/volumes \
--repo b2:mybucket:homelab
Phase 4: Restore databases (30 minutes)
Don't rely on volume restores for databases. Use your dump files:
# Start the database first
cd ~/compose/postgres && docker compose up -d
# Wait for it to be healthy
docker compose exec postgres pg_isready
# Restore from dump
gunzip -c /backups/db-dumps/pg_all_latest.sql.gz | \
docker exec -i postgres psql -U appuser -d postgres
Phase 5: Start services (30 minutes)
# Start services in dependency order
cd ~/compose/traefik && docker compose up -d
cd ~/compose/postgres && docker compose up -d # already running
cd ~/compose/nextcloud && docker compose up -d
cd ~/compose/immich && docker compose up -d
cd ~/compose/vaultwarden && docker compose up -d
Phase 6: Verify (1 hour)
- Log into every service and verify data is present
- Check that DNS is resolving correctly
- Verify SSL certificates are valid
- Run a test backup to confirm the backup pipeline works on the new system
- Check monitoring is reporting metrics
Testing Your Recovery Plan
A recovery plan you've never tested is a hope, not a plan. Test regularly:
Monthly: Selective restore test
Pick a random service. Restore its database dump to a temporary database and verify the data is intact.
# Create a temporary PostgreSQL container
docker run --rm -d \
--name test-restore \
-e POSTGRES_PASSWORD=test \
postgres:16-alpine
# Restore a dump into it
gunzip -c /backups/db-dumps/immich_2026-02-14.dump.gz | \
docker exec -i test-restore pg_restore -U postgres -d postgres
# Verify
docker exec test-restore psql -U postgres -c "SELECT count(*) FROM assets;"
# Clean up
docker stop test-restore
Quarterly: Full DR simulation
Spin up a fresh VM or VPS. Follow your documented recovery procedure end-to-end. Time it. Fix the documentation where it's wrong (it will be wrong somewhere).
This is the most important thing you can do. Every quarterly test will reveal gaps in your documentation, missing credentials, or steps that no longer work because you changed something and forgot to update the plan.
After every infrastructure change
Added a new service? Changed a backup schedule? Moved to a new domain? Update the recovery kit immediately. The worst time to discover your DR docs are outdated is during an actual disaster.
Monitoring Backup Health
Backups that silently fail are worse than no backups -- they give you false confidence.
Healthchecks.io (free tier)
Create a check for each backup job. Ping the URL at the end of each successful backup. If the ping doesn't arrive on schedule, you get an email alert.
# Add to the end of your backup script
curl -fsS -m 10 --retry 5 https://hc-ping.com/your-uuid-here
Check backup freshness
Add a script that verifies your most recent backup is recent enough:
#!/bin/bash
set -euo pipefail
MAX_AGE_HOURS=26 # Alert if backup is older than 26 hours
LATEST=$(restic snapshots --latest 1 --json --repo /mnt/nas/restic-repo | \
jq -r '.[0].time')
LATEST_EPOCH=$(date -d "$LATEST" +%s)
NOW_EPOCH=$(date +%s)
AGE_HOURS=$(( (NOW_EPOCH - LATEST_EPOCH) / 3600 ))
if [ "$AGE_HOURS" -gt "$MAX_AGE_HOURS" ]; then
echo "CRITICAL: Latest backup is ${AGE_HOURS} hours old!"
# Send notification via ntfy, email, or webhook
curl -d "Backup is ${AGE_HOURS}h old (max: ${MAX_AGE_HOURS}h)" ntfy.sh/your-topic
exit 1
fi
echo "OK: Latest backup is ${AGE_HOURS} hours old"
Monitor backup size
Sudden changes in backup size usually indicate a problem -- either data was deleted unexpectedly, or a new large dataset isn't being captured:
# Log backup sizes over time
restic stats latest --json --repo /mnt/nas/restic-repo | \
jq '{date: now | todate, size_gb: (.total_size / 1073741824 * 100 | round / 100)}'
Common Disaster Recovery Mistakes
Storing the recovery kit on the server it protects -- If the server dies, your recovery docs die with it. Keep the kit in a git repo, a cloud storage bucket, or both.
Backing up volumes but not configurations -- Your Docker Compose files, environment variables, and system configurations are just as critical as your data. A database dump without the application configuration to use it is half a recovery.
Not encrypting off-site backups -- Both Borg and Restic encrypt by default. But if you're using rsync to a remote server, that data is plaintext. Always encrypt before sending data off-site.
Relying on RAID as a backup -- RAID protects against drive failure. It does not protect against accidental deletion, ransomware, fire, or filesystem corruption. RAID is uptime insurance. Backups are data insurance. You need both.
Never testing the recovery -- This is the single most common and most dangerous mistake. Test quarterly. No exceptions.
The Bottom Line
Disaster recovery planning is unglamorous work. Nobody posts their DR runbooks on Reddit for karma. But the self-hosters who weather hardware failures, ransomware incidents, and accidental rm -rf disasters without losing data are the ones who did this work ahead of time.
Spend an afternoon building your recovery kit, writing your restore procedure, and testing it once. Then schedule quarterly tests. When the inevitable hardware failure happens, you'll be annoyed for an afternoon instead of devastated for months.
