Add complete centralized logging solution for all Docker containers. New services: - Loki: Log aggregation backend (loki.fig.systems) - Promtail: Log collection agent - Grafana: Log visualization (logs.fig.systems) Features: - Automatic Docker container discovery - 30-day log retention (configurable) - Powerful LogQL querying - Pre-configured Grafana datasource - Comprehensive documentation Resources: - ~400-700MB RAM for 20 containers - Automatic labeling by container/project/service - SSO protection for Loki API Documentation: - Complete setup guide - Query examples and patterns - Troubleshooting steps - Best practices
8.1 KiB
Centralized Logging with Loki
Guide for setting up and using the centralized logging stack (Loki + Promtail + Grafana).
Overview
The logging stack provides centralized log aggregation and visualization for all Docker containers:
- Loki: Log aggregation backend (stores and indexes logs)
- Promtail: Agent that collects logs from Docker containers
- Grafana: Web UI for querying and visualizing logs
Why Centralized Logging?
Problems without it:
- Logs scattered across many containers
- Hard to correlate events across services
- Logs lost when containers restart
- No easy way to search historical logs
Benefits:
- ✅ Single place to view all logs
- ✅ Powerful search and filtering (LogQL)
- ✅ Persist logs even after container restarts
- ✅ Correlate events across services
- ✅ Create dashboards and alerts
- ✅ Configurable retention (30 days default)
Quick Setup
1. Configure Grafana Password
cd ~/homelab/compose/monitoring/logging
nano .env
Update:
GF_SECURITY_ADMIN_PASSWORD=<your-strong-password>
Generate password:
openssl rand -base64 20
2. Deploy
cd ~/homelab/compose/monitoring/logging
docker compose up -d
3. Access Grafana
Go to: https://logs.fig.systems
Login:
- Username:
admin - Password:
<your GF_SECURITY_ADMIN_PASSWORD>
4. Start Exploring Logs
- Click Explore (compass icon) in left sidebar
- Loki datasource should be selected
- Start querying!
Basic Usage
View Logs from a Container
{container="jellyfin"}
View Last Hour's Logs
{container="immich_server"} | __timestamp__ >= now() - 1h
Filter for Errors
{container="traefik"} |= "error"
Exclude Lines
{container="traefik"} != "404"
Multiple Containers
{container=~"jellyfin|immich.*"}
By Compose Project
{compose_project="media"}
Advanced Queries
Count Errors
sum(count_over_time({container="jellyfin"} |= "error" [5m]))
Error Rate
rate({container="traefik"} |= "error" [5m])
Parse JSON Logs
{container="linkwarden"} | json | level="error"
Top 10 Error Messages
topk(10,
sum by (container) (
count_over_time({job="docker"} |= "error" [24h])
)
)
Creating Dashboards
Import Pre-built Dashboard
- Go to Dashboards → Import
- Dashboard ID: 13639 (Docker logs)
- Select Loki as datasource
- Click Import
Create Custom Dashboard
- Click + → Dashboard
- Add panel
- Select Loki datasource
- Build query
- Choose visualization (logs, graph, table, etc.)
- Save
Example panels:
- Error count by container
- Log volume over time
- Recent errors (table)
- Top logging containers
Setting Up Alerts
Create Alert Rule
- Alerting → Alert rules → New alert rule
- Query:
sum(count_over_time({container="jellyfin"} |= "error" [5m])) > 10 - Condition: Alert when > 10 errors in 5 minutes
- Configure notification channel (email, webhook, etc.)
- Save
Example alerts:
- Too many errors in service
- Service stopped logging (might have crashed)
- Authentication failures
- Disk space warnings
Configuration
Change Log Retention
Default: 30 days
Edit .env:
LOKI_RETENTION_PERIOD=60d # 60 days
Edit loki-config.yaml:
limits_config:
retention_period: 60d
table_manager:
retention_period: 60d
Restart:
docker compose restart loki
Adjust Resource Limits
For low-resource systems, edit loki-config.yaml:
limits_config:
retention_period: 7d # Shorter retention
ingestion_rate_mb: 5 # Lower rate
query_range:
results_cache:
cache:
embedded_cache:
max_size_mb: 50 # Smaller cache
Add Labels to Services
Make services easier to find by adding labels:
Edit service compose.yaml:
services:
myservice:
labels:
logging: "promtail"
environment: "production"
tier: "frontend"
Query with these labels:
{environment="production", tier="frontend"}
Troubleshooting
No Logs Appearing
Wait a few minutes - initial log collection takes time
Check Promtail:
docker logs promtail
Check Loki:
docker logs loki
Verify Promtail can reach Loki:
docker exec promtail wget -O- http://loki:3100/ready
Grafana Can't Connect to Loki
Test from Grafana:
docker exec grafana wget -O- http://loki:3100/ready
Check datasource: Grafana → Configuration → Data sources → Loki
- URL should be:
http://loki:3100
High Disk Usage
Check size:
du -sh compose/monitoring/logging/loki-data
Reduce retention:
LOKI_RETENTION_PERIOD=7d
Manual cleanup (CAREFUL):
docker compose stop loki
rm -rf loki-data/chunks/*
docker compose start loki
Slow Queries
Optimize queries:
- Use specific labels:
{container="name"}not{container=~".*"} - Limit time range: Hours not days
- Filter early:
|= "error"before parsing - Avoid complex regex
Best Practices
Log Verbosity
Configure appropriate log levels per environment:
- Production:
infoorwarning - Debugging:
debugortrace
Too verbose = wasted resources!
Retention Strategy
Match retention to importance:
- Critical services: 60-90 days
- Normal services: 30 days
- High-volume services: 7-14 days
Useful Queries to Save
Create saved queries for common tasks:
Recent errors:
{job="docker"} |= "error" | __timestamp__ >= now() - 15m
Service health check:
{container="traefik"} |= "request"
Failed logins:
{container="lldap"} |= "failed" |= "login"
Integration Tips
Embed in Homarr
Add Grafana dashboards to Homarr:
- Edit Homarr dashboard
- Add iFrame widget
- URL:
https://logs.fig.systems/d/<dashboard-id>
Use with Backups
Include logging data in backups:
cd ~/homelab/compose/monitoring/logging
tar czf logging-backup-$(date +%Y%m%d).tar.gz loki-data/ grafana-data/
Combine with Metrics
Later you can add Prometheus for metrics:
- Loki for logs
- Prometheus for metrics (CPU, RAM, disk)
- Both in Grafana dashboards
Common LogQL Patterns
Filter by Time
# Last 5 minutes
{container="name"} | __timestamp__ >= now() - 5m
# Specific time range (in Grafana UI time picker)
# Or use: __timestamp__ >= "2024-01-01T00:00:00Z"
Pattern Matching
# Contains
{container="name"} |= "error"
# Does not contain
{container="name"} != "404"
# Regex match
{container="name"} |~ "error|fail|critical"
# Regex does not match
{container="name"} !~ "debug|trace"
Aggregations
# Count
count_over_time({container="name"}[5m])
# Rate
rate({container="name"}[5m])
# Sum
sum(count_over_time({job="docker"}[1h])) by (container)
# Average
avg_over_time({container="name"} | unwrap bytes [5m])
JSON Parsing
# Parse JSON and filter
{container="name"} | json | level="error"
# Extract field
{container="name"} | json | line_format "{{.message}}"
# Filter on JSON field
{container="name"} | json status_code="500"
Resource Usage
Typical usage:
- Loki: 200-500MB RAM, 1-5GB disk/week
- Promtail: 50-100MB RAM
- Grafana: 100-200MB RAM, ~100MB disk
- Total: ~400-700MB RAM
For 20 containers with moderate logging
Next Steps
- ✅ Explore your logs in Grafana
- ✅ Create useful dashboards
- ✅ Set up alerts for critical errors
- ⬜ Add Prometheus for metrics (future)
- ⬜ Add Tempo for distributed tracing (future)
- ⬜ Create log-based SLA tracking
Resources
Now debug issues 10x faster with centralized logs! 🔍