feat: Add centralized logging stack with Loki, Promtail, and Grafana

Add complete centralized logging solution for all Docker containers.

New services:
- Loki: Log aggregation backend (loki.fig.systems)
- Promtail: Log collection agent
- Grafana: Log visualization (logs.fig.systems)

Features:
- Automatic Docker container discovery
- 30-day log retention (configurable)
- Powerful LogQL querying
- Pre-configured Grafana datasource
- Comprehensive documentation

Resources:
- ~400-700MB RAM for 20 containers
- Automatic labeling by container/project/service
- SSO protection for Loki API

Documentation:
- Complete setup guide
- Query examples and patterns
- Troubleshooting steps
- Best practices

2025-11-09 01:08:20 +00:00

8.1 KiB

Raw Blame History

Centralized Logging with Loki

Guide for setting up and using the centralized logging stack (Loki + Promtail + Grafana).

Overview

The logging stack provides centralized log aggregation and visualization for all Docker containers:

Loki: Log aggregation backend (stores and indexes logs)
Promtail: Agent that collects logs from Docker containers
Grafana: Web UI for querying and visualizing logs

Why Centralized Logging?

Problems without it:

Logs scattered across many containers
Hard to correlate events across services
Logs lost when containers restart
No easy way to search historical logs

Benefits:

✅ Single place to view all logs
✅ Powerful search and filtering (LogQL)
✅ Persist logs even after container restarts
✅ Correlate events across services
✅ Create dashboards and alerts
✅ Configurable retention (30 days default)

Quick Setup

1. Configure Grafana Password

cd ~/homelab/compose/monitoring/logging
nano .env

Update:

GF_SECURITY_ADMIN_PASSWORD=<your-strong-password>

Generate password:

openssl rand -base64 20

2. Deploy

cd ~/homelab/compose/monitoring/logging
docker compose up -d

3. Access Grafana

Go to: https://logs.fig.systems

Login:

Username: admin
Password: <your GF_SECURITY_ADMIN_PASSWORD>

4. Start Exploring Logs

Click Explore (compass icon) in left sidebar
Loki datasource should be selected
Start querying!

Basic Usage

View Logs from a Container

{container="jellyfin"}

View Last Hour's Logs

{container="immich_server"} | __timestamp__ >= now() - 1h

Filter for Errors

{container="traefik"} |= "error"

Exclude Lines

{container="traefik"} != "404"

Multiple Containers

{container=~"jellyfin|immich.*"}

By Compose Project

{compose_project="media"}

Advanced Queries

Count Errors

sum(count_over_time({container="jellyfin"} |= "error" [5m]))

Error Rate

rate({container="traefik"} |= "error" [5m])

Parse JSON Logs

{container="linkwarden"} | json | level="error"

Top 10 Error Messages

topk(10,
  sum by (container) (
    count_over_time({job="docker"} |= "error" [24h])
  )
)

Creating Dashboards

Import Pre-built Dashboard

Go to Dashboards → Import
Dashboard ID: 13639 (Docker logs)
Select Loki as datasource
Click Import

Create Custom Dashboard

Click + → Dashboard
Add panel
Select Loki datasource
Build query
Choose visualization (logs, graph, table, etc.)
Save

Example panels:

Error count by container
Log volume over time
Recent errors (table)
Top logging containers

Setting Up Alerts

Create Alert Rule

Alerting → Alert rules → New alert rule

Query:

sum(count_over_time({container="jellyfin"} |= "error" [5m])) > 10

Condition: Alert when > 10 errors in 5 minutes
Configure notification channel (email, webhook, etc.)
Save

Example alerts:

Too many errors in service
Service stopped logging (might have crashed)
Authentication failures
Disk space warnings

Configuration

Change Log Retention

Default: 30 days

Edit .env:

LOKI_RETENTION_PERIOD=60d  # 60 days

Edit loki-config.yaml:

limits_config:
  retention_period: 60d

table_manager:
  retention_period: 60d

Restart:

docker compose restart loki

Adjust Resource Limits

For low-resource systems, edit loki-config.yaml:

limits_config:
  retention_period: 7d              # Shorter retention
  ingestion_rate_mb: 5              # Lower rate

query_range:
  results_cache:
    cache:
      embedded_cache:
        max_size_mb: 50             # Smaller cache

Add Labels to Services

Make services easier to find by adding labels:

Edit service compose.yaml:

services:
  myservice:
    labels:
      logging: "promtail"
      environment: "production"
      tier: "frontend"

Query with these labels:

{environment="production", tier="frontend"}

Troubleshooting

No Logs Appearing

Wait a few minutes - initial log collection takes time

Check Promtail:

docker logs promtail

Check Loki:

docker logs loki

Verify Promtail can reach Loki:

docker exec promtail wget -O- http://loki:3100/ready

Grafana Can't Connect to Loki

Test from Grafana:

docker exec grafana wget -O- http://loki:3100/ready

Check datasource: Grafana → Configuration → Data sources → Loki

URL should be: http://loki:3100

High Disk Usage

Check size:

du -sh compose/monitoring/logging/loki-data

Reduce retention:

LOKI_RETENTION_PERIOD=7d

Manual cleanup (CAREFUL):

docker compose stop loki
rm -rf loki-data/chunks/*
docker compose start loki

Slow Queries

Optimize queries:

Use specific labels: {container="name"} not {container=~".*"}
Limit time range: Hours not days
Filter early: |= "error" before parsing
Avoid complex regex

Best Practices

Log Verbosity

Configure appropriate log levels per environment:

Production: info or warning
Debugging: debug or trace

Too verbose = wasted resources!

Retention Strategy

Match retention to importance:

Critical services: 60-90 days
Normal services: 30 days
High-volume services: 7-14 days

Useful Queries to Save

Create saved queries for common tasks:

Recent errors:

{job="docker"} |= "error" | __timestamp__ >= now() - 15m

Service health check:

{container="traefik"} |= "request"

Failed logins:

{container="lldap"} |= "failed" |= "login"

Integration Tips

Embed in Homarr

Add Grafana dashboards to Homarr:

Edit Homarr dashboard
Add iFrame widget
URL: https://logs.fig.systems/d/<dashboard-id>

Use with Backups

Include logging data in backups:

cd ~/homelab/compose/monitoring/logging
tar czf logging-backup-$(date +%Y%m%d).tar.gz loki-data/ grafana-data/

Combine with Metrics

Later you can add Prometheus for metrics:

Loki for logs
Prometheus for metrics (CPU, RAM, disk)
Both in Grafana dashboards

Common LogQL Patterns

Filter by Time

# Last 5 minutes
{container="name"} | __timestamp__ >= now() - 5m

# Specific time range (in Grafana UI time picker)
# Or use: __timestamp__ >= "2024-01-01T00:00:00Z"

Pattern Matching

# Contains
{container="name"} |= "error"

# Does not contain
{container="name"} != "404"

# Regex match
{container="name"} |~ "error|fail|critical"

# Regex does not match
{container="name"} !~ "debug|trace"

Aggregations

# Count
count_over_time({container="name"}[5m])

# Rate
rate({container="name"}[5m])

# Sum
sum(count_over_time({job="docker"}[1h])) by (container)

# Average
avg_over_time({container="name"} | unwrap bytes [5m])

JSON Parsing

# Parse JSON and filter
{container="name"} | json | level="error"

# Extract field
{container="name"} | json | line_format "{{.message}}"

# Filter on JSON field
{container="name"} | json status_code="500"

Resource Usage

Typical usage:

Loki: 200-500MB RAM, 1-5GB disk/week
Promtail: 50-100MB RAM
Grafana: 100-200MB RAM, ~100MB disk
Total: ~400-700MB RAM

For 20 containers with moderate logging

Next Steps

✅ Explore your logs in Grafana
✅ Create useful dashboards
✅ Set up alerts for critical errors
⬜ Add Prometheus for metrics (future)
⬜ Add Tempo for distributed tracing (future)
⬜ Create log-based SLA tracking

Resources

Now debug issues 10x faster with centralized logs! 🔍

8.1 KiB Raw Blame History