homelab/docs/guides/centralized-logging.md
Claude 7797f89fcb
feat: Add centralized logging stack with Loki, Promtail, and Grafana
Add complete centralized logging solution for all Docker containers.

New services:
- Loki: Log aggregation backend (loki.fig.systems)
- Promtail: Log collection agent
- Grafana: Log visualization (logs.fig.systems)

Features:
- Automatic Docker container discovery
- 30-day log retention (configurable)
- Powerful LogQL querying
- Pre-configured Grafana datasource
- Comprehensive documentation

Resources:
- ~400-700MB RAM for 20 containers
- Automatic labeling by container/project/service
- SSO protection for Loki API

Documentation:
- Complete setup guide
- Query examples and patterns
- Troubleshooting steps
- Best practices
2025-11-09 01:08:20 +00:00

8.1 KiB

Centralized Logging with Loki

Guide for setting up and using the centralized logging stack (Loki + Promtail + Grafana).

Overview

The logging stack provides centralized log aggregation and visualization for all Docker containers:

  • Loki: Log aggregation backend (stores and indexes logs)
  • Promtail: Agent that collects logs from Docker containers
  • Grafana: Web UI for querying and visualizing logs

Why Centralized Logging?

Problems without it:

  • Logs scattered across many containers
  • Hard to correlate events across services
  • Logs lost when containers restart
  • No easy way to search historical logs

Benefits:

  • Single place to view all logs
  • Powerful search and filtering (LogQL)
  • Persist logs even after container restarts
  • Correlate events across services
  • Create dashboards and alerts
  • Configurable retention (30 days default)

Quick Setup

1. Configure Grafana Password

cd ~/homelab/compose/monitoring/logging
nano .env

Update:

GF_SECURITY_ADMIN_PASSWORD=<your-strong-password>

Generate password:

openssl rand -base64 20

2. Deploy

cd ~/homelab/compose/monitoring/logging
docker compose up -d

3. Access Grafana

Go to: https://logs.fig.systems

Login:

  • Username: admin
  • Password: <your GF_SECURITY_ADMIN_PASSWORD>

4. Start Exploring Logs

  1. Click Explore (compass icon) in left sidebar
  2. Loki datasource should be selected
  3. Start querying!

Basic Usage

View Logs from a Container

{container="jellyfin"}

View Last Hour's Logs

{container="immich_server"} | __timestamp__ >= now() - 1h

Filter for Errors

{container="traefik"} |= "error"

Exclude Lines

{container="traefik"} != "404"

Multiple Containers

{container=~"jellyfin|immich.*"}

By Compose Project

{compose_project="media"}

Advanced Queries

Count Errors

sum(count_over_time({container="jellyfin"} |= "error" [5m]))

Error Rate

rate({container="traefik"} |= "error" [5m])

Parse JSON Logs

{container="linkwarden"} | json | level="error"

Top 10 Error Messages

topk(10,
  sum by (container) (
    count_over_time({job="docker"} |= "error" [24h])
  )
)

Creating Dashboards

Import Pre-built Dashboard

  1. Go to DashboardsImport
  2. Dashboard ID: 13639 (Docker logs)
  3. Select Loki as datasource
  4. Click Import

Create Custom Dashboard

  1. Click +Dashboard
  2. Add panel
  3. Select Loki datasource
  4. Build query
  5. Choose visualization (logs, graph, table, etc.)
  6. Save

Example panels:

  • Error count by container
  • Log volume over time
  • Recent errors (table)
  • Top logging containers

Setting Up Alerts

Create Alert Rule

  1. AlertingAlert rulesNew alert rule
  2. Query:
    sum(count_over_time({container="jellyfin"} |= "error" [5m])) > 10
    
  3. Condition: Alert when > 10 errors in 5 minutes
  4. Configure notification channel (email, webhook, etc.)
  5. Save

Example alerts:

  • Too many errors in service
  • Service stopped logging (might have crashed)
  • Authentication failures
  • Disk space warnings

Configuration

Change Log Retention

Default: 30 days

Edit .env:

LOKI_RETENTION_PERIOD=60d  # 60 days

Edit loki-config.yaml:

limits_config:
  retention_period: 60d

table_manager:
  retention_period: 60d

Restart:

docker compose restart loki

Adjust Resource Limits

For low-resource systems, edit loki-config.yaml:

limits_config:
  retention_period: 7d              # Shorter retention
  ingestion_rate_mb: 5              # Lower rate

query_range:
  results_cache:
    cache:
      embedded_cache:
        max_size_mb: 50             # Smaller cache

Add Labels to Services

Make services easier to find by adding labels:

Edit service compose.yaml:

services:
  myservice:
    labels:
      logging: "promtail"
      environment: "production"
      tier: "frontend"

Query with these labels:

{environment="production", tier="frontend"}

Troubleshooting

No Logs Appearing

Wait a few minutes - initial log collection takes time

Check Promtail:

docker logs promtail

Check Loki:

docker logs loki

Verify Promtail can reach Loki:

docker exec promtail wget -O- http://loki:3100/ready

Grafana Can't Connect to Loki

Test from Grafana:

docker exec grafana wget -O- http://loki:3100/ready

Check datasource: Grafana → Configuration → Data sources → Loki

  • URL should be: http://loki:3100

High Disk Usage

Check size:

du -sh compose/monitoring/logging/loki-data

Reduce retention:

LOKI_RETENTION_PERIOD=7d

Manual cleanup (CAREFUL):

docker compose stop loki
rm -rf loki-data/chunks/*
docker compose start loki

Slow Queries

Optimize queries:

  • Use specific labels: {container="name"} not {container=~".*"}
  • Limit time range: Hours not days
  • Filter early: |= "error" before parsing
  • Avoid complex regex

Best Practices

Log Verbosity

Configure appropriate log levels per environment:

  • Production: info or warning
  • Debugging: debug or trace

Too verbose = wasted resources!

Retention Strategy

Match retention to importance:

  • Critical services: 60-90 days
  • Normal services: 30 days
  • High-volume services: 7-14 days

Useful Queries to Save

Create saved queries for common tasks:

Recent errors:

{job="docker"} |= "error" | __timestamp__ >= now() - 15m

Service health check:

{container="traefik"} |= "request"

Failed logins:

{container="lldap"} |= "failed" |= "login"

Integration Tips

Embed in Homarr

Add Grafana dashboards to Homarr:

  1. Edit Homarr dashboard
  2. Add iFrame widget
  3. URL: https://logs.fig.systems/d/<dashboard-id>

Use with Backups

Include logging data in backups:

cd ~/homelab/compose/monitoring/logging
tar czf logging-backup-$(date +%Y%m%d).tar.gz loki-data/ grafana-data/

Combine with Metrics

Later you can add Prometheus for metrics:

  • Loki for logs
  • Prometheus for metrics (CPU, RAM, disk)
  • Both in Grafana dashboards

Common LogQL Patterns

Filter by Time

# Last 5 minutes
{container="name"} | __timestamp__ >= now() - 5m

# Specific time range (in Grafana UI time picker)
# Or use: __timestamp__ >= "2024-01-01T00:00:00Z"

Pattern Matching

# Contains
{container="name"} |= "error"

# Does not contain
{container="name"} != "404"

# Regex match
{container="name"} |~ "error|fail|critical"

# Regex does not match
{container="name"} !~ "debug|trace"

Aggregations

# Count
count_over_time({container="name"}[5m])

# Rate
rate({container="name"}[5m])

# Sum
sum(count_over_time({job="docker"}[1h])) by (container)

# Average
avg_over_time({container="name"} | unwrap bytes [5m])

JSON Parsing

# Parse JSON and filter
{container="name"} | json | level="error"

# Extract field
{container="name"} | json | line_format "{{.message}}"

# Filter on JSON field
{container="name"} | json status_code="500"

Resource Usage

Typical usage:

  • Loki: 200-500MB RAM, 1-5GB disk/week
  • Promtail: 50-100MB RAM
  • Grafana: 100-200MB RAM, ~100MB disk
  • Total: ~400-700MB RAM

For 20 containers with moderate logging

Next Steps

  1. Explore your logs in Grafana
  2. Create useful dashboards
  3. Set up alerts for critical errors
  4. Add Prometheus for metrics (future)
  5. Add Tempo for distributed tracing (future)
  6. Create log-based SLA tracking

Resources


Now debug issues 10x faster with centralized logs! 🔍