Disaster Recovery for Self-Hosted Services: Planning, Testing, and Recovery

Infrastructure 2026-02-14 · 8 min read disaster-recovery backup business-continuity infrastructure planning
By Selfhosted Guides Editorial Team — Self-hosting practitioners covering open source software, home lab infrastructure, and data sovereignty.

Your homelab server just died. The boot drive is unresponsive. Maybe a power surge, maybe the drive just gave up after five years of continuous operation. It doesn't matter why. What matters is what happens next.

Photo by ‪Salah Darwish on Unsplash

If you're like most self-hosters, the answer is somewhere between "I think I have a backup somewhere" and panic. Disaster recovery planning is the difference between "I'll have everything back up in two hours" and "I've lost years of photos, documents, and configuration."

This guide is the plan you build before you need it.

Restic backup tool logo for disaster recovery

Core Concepts: RTO and RPO

Before building a recovery plan, you need to answer two questions:

Recovery Time Objective (RTO) -- How long can you be down? If your Nextcloud instance is offline for a weekend, is that acceptable? If your Vaultwarden password manager is down for an hour, can you still log into critical accounts?

Recovery Point Objective (RPO) -- How much data can you afford to lose? If your last backup was 24 hours ago, losing a day of documents might be fine. Losing a day of photos during a family vacation is not.

These two numbers drive every decision in your DR plan:

Service	Typical RTO	Typical RPO	Implication
Password manager (Vaultwarden)	1 hour	24 hours	Priority restore, daily backup is fine
Photo library (Immich)	1 day	1 hour	Large dataset, frequent backup needed
Media server (Jellyfin)	1 week	N/A	Media can be re-downloaded, low priority
Home automation (Home Assistant)	4 hours	24 hours	Automations break, daily snapshots
Documents (Paperless-ngx)	1 day	24 hours	Important but not urgent
Monitoring (Grafana/Prometheus)	1 day	1 week	Historical data is nice, not critical

Be honest with yourself. Not everything is equally important, and trying to treat everything as critical makes your DR plan expensive and complex.

The 3-2-1 Backup Rule for Homelabs

The 3-2-1 rule is the foundation of data protection: 3 copies of your data, on 2 different storage media, with 1 copy off-site.

Here's how to implement it practically:

Copy 1: Live data

Your running services and their Docker volumes. This is the data you use every day. It lives on your server's drives.

Copy 2: Local backup

A backup on a separate device on your local network. This could be:

A NAS (Synology, TrueNAS, a Raspberry Pi with an external drive)
A second internal drive in your server
A USB drive that you keep plugged in

Use BorgBackup or Restic to create deduplicated, encrypted backups. Schedule them to run automatically -- nightly for most services, hourly for high-RPO data like photos.

# Example: nightly Restic backup to local NAS
restic backup \
  /var/lib/docker/volumes/ \
  /opt/compose/ \
  /backups/db-dumps/ \
  --exclude '*.tmp' \
  --repo /mnt/nas/restic-repo

Copy 3: Off-site backup

A backup stored outside your home. This protects against fire, flood, theft, or a ransomware attack that encrypts everything on your network. Options:

Backblaze B2: $6/TB/month. The go-to choice for most self-hosters. Restic has native B2 support.
A friend's server: Run BorgBackup over SSH to a friend's machine. They back up to yours. Mutual off-site for free.
Cloud object storage: AWS S3, Wasabi ($7/TB/month), or any S3-compatible provider.

# Restic to Backblaze B2
export B2_ACCOUNT_ID="your-id"
export B2_ACCOUNT_KEY="your-key"
restic backup /var/lib/docker/volumes/ --repo b2:mybucket:homelab

For a typical homelab with 200 GB of critical data, off-site backup costs roughly $1-2/month. That's cheaper than losing everything.

Building a Recovery Kit

When disaster strikes, you need to be able to rebuild from scratch without access to the dead server. That means your recovery kit must exist independently of the infrastructure it protects.

What goes in the kit

Hardware inventory: What CPU, RAM, drives, and network configuration does your server need? Document it so you can buy replacement hardware quickly.
OS installation notes: Which Linux distribution, which version, any specific kernel parameters or driver requirements.
Docker Compose files: Every docker-compose.yml and .env file for every service you run. Keep these in a git repository.
Backup credentials: Repository passwords for Borg/Restic, cloud storage API keys, encryption keys. Store these in a password manager that isn't self-hosted (Bitwarden cloud, 1Password, or printed on paper in a safe).
DNS configuration: What domains point where? If you use Cloudflare, document your tunnel configs and DNS records.
Restore order: Which services need to come up first? Usually: reverse proxy, then databases, then applications.

Recovery kit checklist

## Disaster Recovery Kit

### Credentials (stored in external password manager)
- [ ] Restic repository password
- [ ] Backblaze B2 API credentials
- [ ] Borg repository passphrase and key file
- [ ] Domain registrar login
- [ ] Cloudflare API token
- [ ] SSH keys for remote backup server

### Configuration (stored in git repo)
- [ ] All docker-compose.yml files
- [ ] All .env files (encrypted with age or gpg)
- [ ] Prometheus/Grafana configs
- [ ] Reverse proxy configs (Traefik/Caddy rules)
- [ ] Cron jobs and systemd timers
- [ ] /etc modifications (sysctl, fstab, etc.)

### Documentation (stored alongside configs)
- [ ] Hardware specifications
- [ ] Network diagram (IP assignments, VLANs)
- [ ] Service dependency map
- [ ] Step-by-step restore procedure
- [ ] Contact info for ISP, domain registrar

Want more infrastructure guides? Get guides like this in your inbox — Self-Hosted Weekly delivers one free deep-dive every week.

Step-by-Step Recovery Procedure

Write this procedure now, while everything is working. You will not think clearly during a real disaster.

Phase 1: Hardware (0-2 hours)

Assess damage. Is the hardware salvageable? Can you swap a failed drive?
If the machine is dead, provision a replacement. A VPS can serve as temporary infrastructure while you wait for hardware.
Install the base OS. Use your documented installation notes.

Phase 2: Foundation (30 minutes)

# Install Docker
curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker $USER

# Install restore tools
sudo apt install restic borgbackup age

# Clone your infrastructure repo
git clone https://github.com/yourname/homelab-config.git ~/compose

# Decrypt secrets
age -d -i ~/.secrets/age-key.txt secrets.tar.gz.age | tar xz

Phase 3: Restore data (1-4 hours depending on data volume)

# Restore from local backup (fastest)
restic restore latest \
  --target / \
  --include /var/lib/docker/volumes \
  --repo /mnt/nas/restic-repo

# Or from off-site backup (slower but always available)
restic restore latest \
  --target / \
  --include /var/lib/docker/volumes \
  --repo b2:mybucket:homelab

Phase 4: Restore databases (30 minutes)

Don't rely on volume restores for databases. Use your dump files:

# Start the database first
cd ~/compose/postgres && docker compose up -d

# Wait for it to be healthy
docker compose exec postgres pg_isready

# Restore from dump
gunzip -c /backups/db-dumps/pg_all_latest.sql.gz | \
  docker exec -i postgres psql -U appuser -d postgres

Phase 5: Start services (30 minutes)

# Start services in dependency order
cd ~/compose/traefik && docker compose up -d
cd ~/compose/postgres && docker compose up -d  # already running
cd ~/compose/nextcloud && docker compose up -d
cd ~/compose/immich && docker compose up -d
cd ~/compose/vaultwarden && docker compose up -d

Phase 6: Verify (1 hour)

Log into every service and verify data is present
Check that DNS is resolving correctly
Verify SSL certificates are valid
Run a test backup to confirm the backup pipeline works on the new system
Check monitoring is reporting metrics

Testing Your Recovery Plan

A recovery plan you've never tested is a hope, not a plan. Test regularly:

Monthly: Selective restore test

Pick a random service. Restore its database dump to a temporary database and verify the data is intact.

# Create a temporary PostgreSQL container
docker run --rm -d \
  --name test-restore \
  -e POSTGRES_PASSWORD=test \
  postgres:16-alpine

# Restore a dump into it
gunzip -c /backups/db-dumps/immich_2026-02-14.dump.gz | \
  docker exec -i test-restore pg_restore -U postgres -d postgres

# Verify
docker exec test-restore psql -U postgres -c "SELECT count(*) FROM assets;"

# Clean up
docker stop test-restore

Quarterly: Full DR simulation

Spin up a fresh VM or VPS. Follow your documented recovery procedure end-to-end. Time it. Fix the documentation where it's wrong (it will be wrong somewhere).

This is the most important thing you can do. Every quarterly test will reveal gaps in your documentation, missing credentials, or steps that no longer work because you changed something and forgot to update the plan.

After every infrastructure change

Added a new service? Changed a backup schedule? Moved to a new domain? Update the recovery kit immediately. The worst time to discover your DR docs are outdated is during an actual disaster.

Monitoring Backup Health

Backups that silently fail are worse than no backups -- they give you false confidence.

Healthchecks.io (free tier)

Create a check for each backup job. Ping the URL at the end of each successful backup. If the ping doesn't arrive on schedule, you get an email alert.

# Add to the end of your backup script
curl -fsS -m 10 --retry 5 https://hc-ping.com/your-uuid-here

Check backup freshness

Add a script that verifies your most recent backup is recent enough:

#!/bin/bash
set -euo pipefail

MAX_AGE_HOURS=26  # Alert if backup is older than 26 hours

LATEST=$(restic snapshots --latest 1 --json --repo /mnt/nas/restic-repo | \
  jq -r '.[0].time')
LATEST_EPOCH=$(date -d "$LATEST" +%s)
NOW_EPOCH=$(date +%s)
AGE_HOURS=$(( (NOW_EPOCH - LATEST_EPOCH) / 3600 ))

if [ "$AGE_HOURS" -gt "$MAX_AGE_HOURS" ]; then
  echo "CRITICAL: Latest backup is ${AGE_HOURS} hours old!"
  # Send notification via ntfy, email, or webhook
  curl -d "Backup is ${AGE_HOURS}h old (max: ${MAX_AGE_HOURS}h)" ntfy.sh/your-topic
  exit 1
fi

echo "OK: Latest backup is ${AGE_HOURS} hours old"

Monitor backup size

Sudden changes in backup size usually indicate a problem -- either data was deleted unexpectedly, or a new large dataset isn't being captured:

# Log backup sizes over time
restic stats latest --json --repo /mnt/nas/restic-repo | \
  jq '{date: now | todate, size_gb: (.total_size / 1073741824 * 100 | round / 100)}'

Common Disaster Recovery Mistakes

Storing the recovery kit on the server it protects -- If the server dies, your recovery docs die with it. Keep the kit in a git repo, a cloud storage bucket, or both.
Backing up volumes but not configurations -- Your Docker Compose files, environment variables, and system configurations are just as critical as your data. A database dump without the application configuration to use it is half a recovery.
Not encrypting off-site backups -- Both Borg and Restic encrypt by default. But if you're using rsync to a remote server, that data is plaintext. Always encrypt before sending data off-site.
Relying on RAID as a backup -- RAID protects against drive failure. It does not protect against accidental deletion, ransomware, fire, or filesystem corruption. RAID is uptime insurance. Backups are data insurance. You need both.
Never testing the recovery -- This is the single most common and most dangerous mistake. Test quarterly. No exceptions.

The Bottom Line

Disaster recovery planning is unglamorous work. Nobody posts their DR runbooks on Reddit for karma. But the self-hosters who weather hardware failures, ransomware incidents, and accidental rm -rf disasters without losing data are the ones who did this work ahead of time.

Spend an afternoon building your recovery kit, writing your restore procedure, and testing it once. Then schedule quarterly tests. When the inevitable hardware failure happens, you'll be annoyed for an afternoon instead of devastated for months.