Monitoring and Alerting for Self-Hosted Services: A Practical Operations Guide
Running self-hosted services without monitoring is like driving without a dashboard. You don't know how fast you're going, how much fuel you have, or that the engine temperature is climbing -- until something breaks and you're stranded on the side of the road.
Most self-hosters start with Uptime Kuma for basic "is it up?" checks, which is a good first step. But as your homelab grows, you need deeper visibility: why is Nextcloud slow? Is the database running out of connections? Which container is eating all the RAM? When will you run out of disk space?
This guide builds a complete observability stack from the ground up, with a focus on practical operations -- not just pretty dashboards, but actionable alerts and recovery procedures.

The Three Pillars of Observability
Modern monitoring is built on three complementary data types:
Metrics -- Numeric measurements over time. CPU usage, request latency, disk space, error rates. Prometheus is the standard tool for collecting and querying metrics. Metrics tell you what is happening.
Logs -- Timestamped text records from your applications. Error messages, access logs, audit trails. Loki (from the Grafana team) is the standard tool for aggregating and searching logs. Logs tell you why something happened.
Traces -- The path of a request through multiple services. Less relevant for most homelabs but critical for microservice architectures. Tempo or Jaeger handle traces. Traces tell you where in the chain something went wrong.
For a homelab, metrics and logs cover 95% of your needs. Here's how to set up both.
The Full Stack: Docker Compose
This deploys the complete monitoring stack -- Prometheus for metrics, Grafana for visualization, Loki for logs, and the necessary exporters:
services:
prometheus:
image: prom/prometheus:v2.51.0
container_name: prometheus
restart: unless-stopped
ports:
- "127.0.0.1:9090:9090"
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- ./prometheus/alert-rules.yml:/etc/prometheus/alert-rules.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=90d'
- '--storage.tsdb.retention.size=10GB'
- '--web.enable-lifecycle'
networks:
- monitoring
grafana:
image: grafana/grafana:11.0.0
container_name: grafana
restart: unless-stopped
ports:
- "127.0.0.1:3000:3000"
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
environment:
GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_ADMIN_PASSWORD}
GF_USERS_ALLOW_SIGN_UP: "false"
GF_SERVER_ROOT_URL: https://grafana.yourdomain.com
networks:
- monitoring
loki:
image: grafana/loki:3.0.0
container_name: loki
restart: unless-stopped
ports:
- "127.0.0.1:3100:3100"
volumes:
- ./loki/loki-config.yml:/etc/loki/local-config.yaml
- loki_data:/loki
command: -config.file=/etc/loki/local-config.yaml
networks:
- monitoring
promtail:
image: grafana/promtail:3.0.0
container_name: promtail
restart: unless-stopped
volumes:
- ./promtail/promtail-config.yml:/etc/promtail/config.yml
- /var/log:/var/log:ro
- /var/lib/docker/containers:/var/lib/docker/containers:ro
command: -config.file=/etc/promtail/config.yml
networks:
- monitoring
node-exporter:
image: prom/node-exporter:v1.8.0
container_name: node-exporter
restart: unless-stopped
pid: host
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--path.rootfs=/rootfs'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
networks:
- monitoring
cadvisor:
image: gcr.io/cadvisor/cadvisor:v0.49.1
container_name: cadvisor
restart: unless-stopped
privileged: true
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker:/var/lib/docker:ro
- /dev/disk:/dev/disk:ro
devices:
- /dev/kmsg
networks:
- monitoring
alertmanager:
image: prom/alertmanager:v0.27.0
container_name: alertmanager
restart: unless-stopped
ports:
- "127.0.0.1:9093:9093"
volumes:
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
- alertmanager_data:/alertmanager
networks:
- monitoring
volumes:
prometheus_data:
grafana_data:
loki_data:
alertmanager_data:
networks:
monitoring:
That's seven containers, but they're all lightweight. The entire stack uses roughly 1-2 GB of RAM on a typical homelab.
Configuration Files
Prometheus configuration
# prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
rule_files:
- /etc/prometheus/alert-rules.yml
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
- job_name: 'loki'
static_configs:
- targets: ['loki:3100']
Loki configuration
# loki/loki-config.yml
auth_enabled: false
server:
http_listen_port: 3100
common:
path_prefix: /loki
storage:
filesystem:
chunks_directory: /loki/chunks
rules_directory: /loki/rules
replication_factor: 1
ring:
kvstore:
store: inmemory
schema_config:
configs:
- from: 2024-01-01
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: index_
period: 24h
limits_config:
retention_period: 30d
compactor:
working_directory: /loki/compactor
retention_enabled: true
Promtail configuration
# promtail/promtail-config.yml
server:
http_listen_port: 9080
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: docker
static_configs:
- targets:
- localhost
labels:
job: docker
__path__: /var/lib/docker/containers/*/*-json.log
pipeline_stages:
- docker: {}
- labeldrop:
- filename
- job_name: system
static_configs:
- targets:
- localhost
labels:
job: syslog
__path__: /var/log/syslog
Like what you're reading? Subscribe to Self-Hosted Weekly — free weekly guides in your inbox.
Alert Strategy: Avoiding Alert Fatigue
The biggest mistake in monitoring is alerting on everything. When every minor fluctuation triggers a notification, you stop paying attention -- and then you miss the alerts that actually matter. This is alert fatigue, and it kills the value of monitoring entirely.
The alert severity model
Structure your alerts into three tiers:
| Severity | Action | Notification method | Example |
|---|---|---|---|
| Critical | Immediate action required | Push notification (ntfy, PagerDuty) | Disk 95% full, service down >5 min |
| Warning | Investigate within hours | Email or Discord | CPU >80% for 30 min, disk 80% full |
| Info | Review during maintenance | Dashboard only, no notification | Backup completed, certificate renewed |
Alert rules that actually work
# prometheus/alert-rules.yml
groups:
- name: critical
rules:
- alert: ServiceDown
expr: up == 0
for: 5m
labels:
severity: critical
annotations:
summary: "{{ $labels.job }} is down on {{ $labels.instance }}"
description: "{{ $labels.job }} has been unreachable for more than 5 minutes."
- alert: DiskSpaceCritical
expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 5
for: 2m
labels:
severity: critical
annotations:
summary: "Root filesystem nearly full ({{ $value | printf \"%.1f\" }}% free)"
- alert: MemoryExhausted
expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 95
for: 5m
labels:
severity: critical
annotations:
summary: "Memory usage above 95% for 5 minutes"
- name: warning
rules:
- alert: DiskSpaceWarning
expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 20
for: 10m
labels:
severity: warning
annotations:
summary: "Root filesystem below 20% free ({{ $value | printf \"%.1f\" }}% remaining)"
- alert: HighCPU
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[10m])) * 100) > 80
for: 30m
labels:
severity: warning
annotations:
summary: "CPU usage above 80% for 30 minutes"
- alert: ContainerRestarting
expr: increase(container_restart_count[1h]) > 3
for: 0m
labels:
severity: warning
annotations:
summary: "Container {{ $labels.name }} has restarted {{ $value }} times in the last hour"
- alert: DiskWillFillIn24Hours
expr: predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 24 * 3600) < 0
for: 30m
labels:
severity: warning
annotations:
summary: "Disk will be full within 24 hours at current growth rate"
Notice the design principles at work:
- Critical alerts have short
fordurations -- you want to know immediately when something is truly broken. - Warning alerts have longer
fordurations -- brief CPU spikes are normal, sustained high usage is a problem. - Predictive alerts -- the
predict_linearfunction warns you before you run out of disk, not after. - No informational alerts -- those go to the dashboard, not your phone.
Alertmanager routing
# alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
route:
receiver: default
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: push-notification
repeat_interval: 1h
- match:
severity: warning
receiver: email
repeat_interval: 12h
receivers:
- name: default
webhook_configs: []
- name: push-notification
webhook_configs:
- url: 'https://ntfy.yourdomain.com/alerts'
send_resolved: true
- name: email
email_configs:
- to: '[email protected]'
from: '[email protected]'
smarthost: 'smtp.yourdomain.com:587'
auth_username: '[email protected]'
auth_password: 'your-smtp-password'
send_resolved: true
Critical alerts go to push notifications. Warnings go to email. Everything else stays on the dashboard.
Operational Runbooks
An alert without a runbook is just an anxiety generator. For every alert, write a brief runbook that tells future-you (or anyone else) exactly what to do:
Example: DiskSpaceCritical runbook
## Alert: DiskSpaceCritical
### Symptoms
Root filesystem has less than 5% free space.
### Impact
Services will crash when they can't write data. Databases are especially
vulnerable to corruption from out-of-disk conditions.
### Immediate actions
1. Check what's using space: `du -sh /* | sort -rh | head -20`
2. Clean Docker: `docker system prune -a` (removes unused images)
3. Check logs: `du -sh /var/log/*` — rotate or truncate large logs
4. Check Docker logs: `docker system df`
### If Docker is the culprit
- `docker system prune -a --volumes` (WARNING: removes unused volumes)
- Check for containers with no log limits: add logging config
### If /var/log is the culprit
- `journalctl --vacuum-size=500M`
- Check for applications writing excessive logs
### Prevention
- Set Docker log limits in daemon.json
- Set Prometheus retention with --storage.tsdb.retention.size
- Add Loki retention period
Keep these runbooks alongside your monitoring configuration in your git repository. When an alert fires at 2 AM, you want to follow a checklist, not debug from memory.
Capacity Planning
Monitoring data is useless if you only look at it when something breaks. Build a capacity planning dashboard that answers:
When will I run out of disk?
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[30d], 90 * 24 * 3600)
This predicts free disk space 90 days from now based on the trend over the last 30 days.
How is memory usage trending?
avg_over_time((1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)[30d:1d])
Which containers are growing?
topk(10, rate(container_fs_usage_bytes[7d]))
Review this dashboard monthly. If a trend line shows you'll run out of resources in the next quarter, plan ahead -- add storage, increase RAM, or optimize the offending service.
Grafana Dashboards to Import
Don't build dashboards from scratch. Import these community dashboards and customize as needed:
| Dashboard ID | Name | What it monitors |
|---|---|---|
| 1860 | Node Exporter Full | System metrics (CPU, RAM, disk, network) |
| 14282 | cAdvisor Exporter | Per-container resource usage |
| 13639 | Loki & Promtail | Log volume, error rates |
| 3662 | Prometheus 2.0 Overview | Prometheus self-monitoring |
| 9628 | PostgreSQL Database | Database performance (requires postgres_exporter) |
In Grafana, go to Dashboards > Import, enter the dashboard ID, select your data source, and you'll have professional-looking dashboards in seconds.
The Honest Take
Building a monitoring stack takes an afternoon. Maintaining it takes a few hours per month -- updating images, tuning alert thresholds, adding exporters for new services, and occasionally expanding storage.
The investment pays for itself the first time you catch a disk filling up before it crashes your database, or notice a memory leak before it takes down your server. Without monitoring, these problems announce themselves as outages. With monitoring, they're items on your maintenance checklist.
Start with the full Docker Compose stack above, import the recommended dashboards, configure the alert rules, and write a runbook for each alert. That's your operations foundation. Everything else -- additional exporters, custom dashboards, long-term storage with Thanos -- can be added incrementally as your homelab grows.
