Monitoring and Alerting for Self-Hosted Services: A Practical Operations Guide

Monitoring 2026-02-14 · 7 min read monitoring alerting prometheus grafana loki operations
By Selfhosted Guides Editorial Team — Self-hosting practitioners covering open source software, home lab infrastructure, and data sovereignty.

Running self-hosted services without monitoring is like driving without a dashboard. You don't know how fast you're going, how much fuel you have, or that the engine temperature is climbing -- until something breaks and you're stranded on the side of the road.

Photo by Denny Bú on Unsplash

Most self-hosters start with Uptime Kuma for basic "is it up?" checks, which is a good first step. But as your homelab grows, you need deeper visibility: why is Nextcloud slow? Is the database running out of connections? Which container is eating all the RAM? When will you run out of disk space?

This guide builds a complete observability stack from the ground up, with a focus on practical operations -- not just pretty dashboards, but actionable alerts and recovery procedures.

The Three Pillars of Observability

Modern monitoring is built on three complementary data types:

Metrics -- Numeric measurements over time. CPU usage, request latency, disk space, error rates. Prometheus is the standard tool for collecting and querying metrics. Metrics tell you what is happening.

Logs -- Timestamped text records from your applications. Error messages, access logs, audit trails. Loki (from the Grafana team) is the standard tool for aggregating and searching logs. Logs tell you why something happened.

Traces -- The path of a request through multiple services. Less relevant for most homelabs but critical for microservice architectures. Tempo or Jaeger handle traces. Traces tell you where in the chain something went wrong.

For a homelab, metrics and logs cover 95% of your needs. Here's how to set up both.

The Full Stack: Docker Compose

This deploys the complete monitoring stack -- Prometheus for metrics, Grafana for visualization, Loki for logs, and the necessary exporters:

services:
  prometheus:
    image: prom/prometheus:v2.51.0
    container_name: prometheus
    restart: unless-stopped
    ports:
      - "127.0.0.1:9090:9090"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./prometheus/alert-rules.yml:/etc/prometheus/alert-rules.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=90d'
      - '--storage.tsdb.retention.size=10GB'
      - '--web.enable-lifecycle'
    networks:
      - monitoring

  grafana:
    image: grafana/grafana:11.0.0
    container_name: grafana
    restart: unless-stopped
    ports:
      - "127.0.0.1:3000:3000"
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    environment:
      GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_ADMIN_PASSWORD}
      GF_USERS_ALLOW_SIGN_UP: "false"
      GF_SERVER_ROOT_URL: https://grafana.yourdomain.com
    networks:
      - monitoring

  loki:
    image: grafana/loki:3.0.0
    container_name: loki
    restart: unless-stopped
    ports:
      - "127.0.0.1:3100:3100"
    volumes:
      - ./loki/loki-config.yml:/etc/loki/local-config.yaml
      - loki_data:/loki
    command: -config.file=/etc/loki/local-config.yaml
    networks:
      - monitoring

  promtail:
    image: grafana/promtail:3.0.0
    container_name: promtail
    restart: unless-stopped
    volumes:
      - ./promtail/promtail-config.yml:/etc/promtail/config.yml
      - /var/log:/var/log:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
    command: -config.file=/etc/promtail/config.yml
    networks:
      - monitoring

  node-exporter:
    image: prom/node-exporter:v1.8.0
    container_name: node-exporter
    restart: unless-stopped
    pid: host
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--path.rootfs=/rootfs'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    networks:
      - monitoring

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.49.1
    container_name: cadvisor
    restart: unless-stopped
    privileged: true
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker:/var/lib/docker:ro
      - /dev/disk:/dev/disk:ro
    devices:
      - /dev/kmsg
    networks:
      - monitoring

  alertmanager:
    image: prom/alertmanager:v0.27.0
    container_name: alertmanager
    restart: unless-stopped
    ports:
      - "127.0.0.1:9093:9093"
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
      - alertmanager_data:/alertmanager
    networks:
      - monitoring

volumes:
  prometheus_data:
  grafana_data:
  loki_data:
  alertmanager_data:

networks:
  monitoring:

That's seven containers, but they're all lightweight. The entire stack uses roughly 1-2 GB of RAM on a typical homelab.

Configuration Files

Prometheus configuration

# prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

rule_files:
  - /etc/prometheus/alert-rules.yml

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']

  - job_name: 'loki'
    static_configs:
      - targets: ['loki:3100']

Loki configuration

# loki/loki-config.yml
auth_enabled: false

server:
  http_listen_port: 3100

common:
  path_prefix: /loki
  storage:
    filesystem:
      chunks_directory: /loki/chunks
      rules_directory: /loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2024-01-01
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

limits_config:
  retention_period: 30d

compactor:
  working_directory: /loki/compactor
  retention_enabled: true

Promtail configuration

# promtail/promtail-config.yml
server:
  http_listen_port: 9080

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: docker
    static_configs:
      - targets:
          - localhost
        labels:
          job: docker
          __path__: /var/lib/docker/containers/*/*-json.log
    pipeline_stages:
      - docker: {}
      - labeldrop:
          - filename

  - job_name: system
    static_configs:
      - targets:
          - localhost
        labels:
          job: syslog
          __path__: /var/log/syslog

Want more monitoring guides? Get guides like this in your inbox — Self-Hosted Weekly delivers one free deep-dive every week.

Alert Strategy: Avoiding Alert Fatigue

The biggest mistake in monitoring is alerting on everything. When every minor fluctuation triggers a notification, you stop paying attention -- and then you miss the alerts that actually matter. This is alert fatigue, and it kills the value of monitoring entirely.

The alert severity model

Structure your alerts into three tiers:

Severity	Action	Notification method	Example
Critical	Immediate action required	Push notification (ntfy, PagerDuty)	Disk 95% full, service down >5 min
Warning	Investigate within hours	Email or Discord	CPU >80% for 30 min, disk 80% full
Info	Review during maintenance	Dashboard only, no notification	Backup completed, certificate renewed

Alert rules that actually work

# prometheus/alert-rules.yml
groups:
  - name: critical
    rules:
      - alert: ServiceDown
        expr: up == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.job }} is down on {{ $labels.instance }}"
          description: "{{ $labels.job }} has been unreachable for more than 5 minutes."

      - alert: DiskSpaceCritical
        expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 5
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Root filesystem nearly full ({{ $value | printf \"%.1f\" }}% free)"

      - alert: MemoryExhausted
        expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 95
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Memory usage above 95% for 5 minutes"

  - name: warning
    rules:
      - alert: DiskSpaceWarning
        expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 20
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Root filesystem below 20% free ({{ $value | printf \"%.1f\" }}% remaining)"

      - alert: HighCPU
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[10m])) * 100) > 80
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "CPU usage above 80% for 30 minutes"

      - alert: ContainerRestarting
        expr: increase(container_restart_count[1h]) > 3
        for: 0m
        labels:
          severity: warning
        annotations:
          summary: "Container {{ $labels.name }} has restarted {{ $value }} times in the last hour"

      - alert: DiskWillFillIn24Hours
        expr: predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 24 * 3600) < 0
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Disk will be full within 24 hours at current growth rate"

Notice the design principles at work:

Critical alerts have short for durations -- you want to know immediately when something is truly broken.
Warning alerts have longer for durations -- brief CPU spikes are normal, sustained high usage is a problem.
Predictive alerts -- the predict_linear function warns you before you run out of disk, not after.
No informational alerts -- those go to the dashboard, not your phone.

Alertmanager routing

# alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m

route:
  receiver: default
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: push-notification
      repeat_interval: 1h
    - match:
        severity: warning
      receiver: email
      repeat_interval: 12h

receivers:
  - name: default
    webhook_configs: []

  - name: push-notification
    webhook_configs:
      - url: 'https://ntfy.yourdomain.com/alerts'
        send_resolved: true

  - name: email
    email_configs:
      - to: '[email protected]'
        from: '[email protected]'
        smarthost: 'smtp.yourdomain.com:587'
        auth_username: '[email protected]'
        auth_password: 'your-smtp-password'
        send_resolved: true

Critical alerts go to push notifications. Warnings go to email. Everything else stays on the dashboard.

Operational Runbooks

An alert without a runbook is just an anxiety generator. For every alert, write a brief runbook that tells future-you (or anyone else) exactly what to do:

Example: DiskSpaceCritical runbook

## Alert: DiskSpaceCritical

### Symptoms
Root filesystem has less than 5% free space.

### Impact
Services will crash when they can't write data. Databases are especially
vulnerable to corruption from out-of-disk conditions.

### Immediate actions
1. Check what's using space: `du -sh /* | sort -rh | head -20`
2. Clean Docker: `docker system prune -a` (removes unused images)
3. Check logs: `du -sh /var/log/*` — rotate or truncate large logs
4. Check Docker logs: `docker system df`

### If Docker is the culprit
- `docker system prune -a --volumes` (WARNING: removes unused volumes)
- Check for containers with no log limits: add logging config

### If /var/log is the culprit
- `journalctl --vacuum-size=500M`
- Check for applications writing excessive logs

### Prevention
- Set Docker log limits in daemon.json
- Set Prometheus retention with --storage.tsdb.retention.size
- Add Loki retention period

Keep these runbooks alongside your monitoring configuration in your git repository. When an alert fires at 2 AM, you want to follow a checklist, not debug from memory.

Capacity Planning

Monitoring data is useless if you only look at it when something breaks. Build a capacity planning dashboard that answers:

When will I run out of disk?

predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[30d], 90 * 24 * 3600)

This predicts free disk space 90 days from now based on the trend over the last 30 days.

How is memory usage trending?

avg_over_time((1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)[30d:1d])

Which containers are growing?

topk(10, rate(container_fs_usage_bytes[7d]))

Review this dashboard monthly. If a trend line shows you'll run out of resources in the next quarter, plan ahead -- add storage, increase RAM, or optimize the offending service.

Grafana Dashboards to Import

Don't build dashboards from scratch. Import these community dashboards and customize as needed:

Dashboard ID	Name	What it monitors
1860	Node Exporter Full	System metrics (CPU, RAM, disk, network)
14282	cAdvisor Exporter	Per-container resource usage
13639	Loki & Promtail	Log volume, error rates
3662	Prometheus 2.0 Overview	Prometheus self-monitoring
9628	PostgreSQL Database	Database performance (requires postgres_exporter)

In Grafana, go to Dashboards > Import, enter the dashboard ID, select your data source, and you'll have professional-looking dashboards in seconds.

The Honest Take

Building a monitoring stack takes an afternoon. Maintaining it takes a few hours per month -- updating images, tuning alert thresholds, adding exporters for new services, and occasionally expanding storage.

The investment pays for itself the first time you catch a disk filling up before it crashes your database, or notice a memory leak before it takes down your server. Without monitoring, these problems announce themselves as outages. With monitoring, they're items on your maintenance checklist.

Start with the full Docker Compose stack above, import the recommended dashboards, configure the alert rules, and write a runbook for each alert. That's your operations foundation. Everything else -- additional exporters, custom dashboards, long-term storage with Thanos -- can be added incrementally as your homelab grows.