Add complete centralized logging solution for all Docker containers. New services: - Loki: Log aggregation backend (loki.fig.systems) - Promtail: Log collection agent - Grafana: Log visualization (logs.fig.systems) Features: - Automatic Docker container discovery - 30-day log retention (configurable) - Powerful LogQL querying - Pre-configured Grafana datasource - Comprehensive documentation Resources: - ~400-700MB RAM for 20 containers - Automatic labeling by container/project/service - SSO protection for Loki API Documentation: - Complete setup guide - Query examples and patterns - Troubleshooting steps - Best practices
445 lines
8.1 KiB
Markdown
445 lines
8.1 KiB
Markdown
# Centralized Logging with Loki
|
|
|
|
Guide for setting up and using the centralized logging stack (Loki + Promtail + Grafana).
|
|
|
|
## Overview
|
|
|
|
The logging stack provides centralized log aggregation and visualization for all Docker containers:
|
|
|
|
- **Loki**: Log aggregation backend (stores and indexes logs)
|
|
- **Promtail**: Agent that collects logs from Docker containers
|
|
- **Grafana**: Web UI for querying and visualizing logs
|
|
|
|
### Why Centralized Logging?
|
|
|
|
**Problems without it:**
|
|
- Logs scattered across many containers
|
|
- Hard to correlate events across services
|
|
- Logs lost when containers restart
|
|
- No easy way to search historical logs
|
|
|
|
**Benefits:**
|
|
- ✅ Single place to view all logs
|
|
- ✅ Powerful search and filtering (LogQL)
|
|
- ✅ Persist logs even after container restarts
|
|
- ✅ Correlate events across services
|
|
- ✅ Create dashboards and alerts
|
|
- ✅ Configurable retention (30 days default)
|
|
|
|
## Quick Setup
|
|
|
|
### 1. Configure Grafana Password
|
|
|
|
```bash
|
|
cd ~/homelab/compose/monitoring/logging
|
|
nano .env
|
|
```
|
|
|
|
**Update:**
|
|
```env
|
|
GF_SECURITY_ADMIN_PASSWORD=<your-strong-password>
|
|
```
|
|
|
|
**Generate password:**
|
|
```bash
|
|
openssl rand -base64 20
|
|
```
|
|
|
|
### 2. Deploy
|
|
|
|
```bash
|
|
cd ~/homelab/compose/monitoring/logging
|
|
docker compose up -d
|
|
```
|
|
|
|
### 3. Access Grafana
|
|
|
|
Go to: **https://logs.fig.systems**
|
|
|
|
**Login:**
|
|
- Username: `admin`
|
|
- Password: `<your GF_SECURITY_ADMIN_PASSWORD>`
|
|
|
|
### 4. Start Exploring Logs
|
|
|
|
1. Click **Explore** (compass icon) in left sidebar
|
|
2. Loki datasource should be selected
|
|
3. Start querying!
|
|
|
|
## Basic Usage
|
|
|
|
### View Logs from a Container
|
|
|
|
```logql
|
|
{container="jellyfin"}
|
|
```
|
|
|
|
### View Last Hour's Logs
|
|
|
|
```logql
|
|
{container="immich_server"} | __timestamp__ >= now() - 1h
|
|
```
|
|
|
|
### Filter for Errors
|
|
|
|
```logql
|
|
{container="traefik"} |= "error"
|
|
```
|
|
|
|
### Exclude Lines
|
|
|
|
```logql
|
|
{container="traefik"} != "404"
|
|
```
|
|
|
|
### Multiple Containers
|
|
|
|
```logql
|
|
{container=~"jellyfin|immich.*"}
|
|
```
|
|
|
|
### By Compose Project
|
|
|
|
```logql
|
|
{compose_project="media"}
|
|
```
|
|
|
|
## Advanced Queries
|
|
|
|
### Count Errors
|
|
|
|
```logql
|
|
sum(count_over_time({container="jellyfin"} |= "error" [5m]))
|
|
```
|
|
|
|
### Error Rate
|
|
|
|
```logql
|
|
rate({container="traefik"} |= "error" [5m])
|
|
```
|
|
|
|
### Parse JSON Logs
|
|
|
|
```logql
|
|
{container="linkwarden"} | json | level="error"
|
|
```
|
|
|
|
### Top 10 Error Messages
|
|
|
|
```logql
|
|
topk(10,
|
|
sum by (container) (
|
|
count_over_time({job="docker"} |= "error" [24h])
|
|
)
|
|
)
|
|
```
|
|
|
|
## Creating Dashboards
|
|
|
|
### Import Pre-built Dashboard
|
|
|
|
1. Go to **Dashboards** → **Import**
|
|
2. Dashboard ID: **13639** (Docker logs)
|
|
3. Select **Loki** as datasource
|
|
4. Click **Import**
|
|
|
|
### Create Custom Dashboard
|
|
|
|
1. Click **+** → **Dashboard**
|
|
2. **Add panel**
|
|
3. Select **Loki** datasource
|
|
4. Build query
|
|
5. Choose visualization (logs, graph, table, etc.)
|
|
6. **Save**
|
|
|
|
**Example panels:**
|
|
- Error count by container
|
|
- Log volume over time
|
|
- Recent errors (table)
|
|
- Top logging containers
|
|
|
|
## Setting Up Alerts
|
|
|
|
### Create Alert Rule
|
|
|
|
1. **Alerting** → **Alert rules** → **New alert rule**
|
|
2. **Query:**
|
|
```logql
|
|
sum(count_over_time({container="jellyfin"} |= "error" [5m])) > 10
|
|
```
|
|
3. **Condition**: Alert when > 10 errors in 5 minutes
|
|
4. **Configure** notification channel (email, webhook, etc.)
|
|
5. **Save**
|
|
|
|
**Example alerts:**
|
|
- Too many errors in service
|
|
- Service stopped logging (might have crashed)
|
|
- Authentication failures
|
|
- Disk space warnings
|
|
|
|
## Configuration
|
|
|
|
### Change Log Retention
|
|
|
|
**Default: 30 days**
|
|
|
|
Edit `.env`:
|
|
```env
|
|
LOKI_RETENTION_PERIOD=60d # 60 days
|
|
```
|
|
|
|
Edit `loki-config.yaml`:
|
|
```yaml
|
|
limits_config:
|
|
retention_period: 60d
|
|
|
|
table_manager:
|
|
retention_period: 60d
|
|
```
|
|
|
|
Restart:
|
|
```bash
|
|
docker compose restart loki
|
|
```
|
|
|
|
### Adjust Resource Limits
|
|
|
|
For low-resource systems, edit `loki-config.yaml`:
|
|
|
|
```yaml
|
|
limits_config:
|
|
retention_period: 7d # Shorter retention
|
|
ingestion_rate_mb: 5 # Lower rate
|
|
|
|
query_range:
|
|
results_cache:
|
|
cache:
|
|
embedded_cache:
|
|
max_size_mb: 50 # Smaller cache
|
|
```
|
|
|
|
### Add Labels to Services
|
|
|
|
Make services easier to find by adding labels:
|
|
|
|
**Edit service `compose.yaml`:**
|
|
```yaml
|
|
services:
|
|
myservice:
|
|
labels:
|
|
logging: "promtail"
|
|
environment: "production"
|
|
tier: "frontend"
|
|
```
|
|
|
|
Query with these labels:
|
|
```logql
|
|
{environment="production", tier="frontend"}
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### No Logs Appearing
|
|
|
|
**Wait a few minutes** - initial log collection takes time
|
|
|
|
**Check Promtail:**
|
|
```bash
|
|
docker logs promtail
|
|
```
|
|
|
|
**Check Loki:**
|
|
```bash
|
|
docker logs loki
|
|
```
|
|
|
|
**Verify Promtail can reach Loki:**
|
|
```bash
|
|
docker exec promtail wget -O- http://loki:3100/ready
|
|
```
|
|
|
|
### Grafana Can't Connect to Loki
|
|
|
|
**Test from Grafana:**
|
|
```bash
|
|
docker exec grafana wget -O- http://loki:3100/ready
|
|
```
|
|
|
|
**Check datasource:** Grafana → Configuration → Data sources → Loki
|
|
- URL should be: `http://loki:3100`
|
|
|
|
### High Disk Usage
|
|
|
|
**Check size:**
|
|
```bash
|
|
du -sh compose/monitoring/logging/loki-data
|
|
```
|
|
|
|
**Reduce retention:**
|
|
```env
|
|
LOKI_RETENTION_PERIOD=7d
|
|
```
|
|
|
|
**Manual cleanup (CAREFUL):**
|
|
```bash
|
|
docker compose stop loki
|
|
rm -rf loki-data/chunks/*
|
|
docker compose start loki
|
|
```
|
|
|
|
### Slow Queries
|
|
|
|
**Optimize queries:**
|
|
- Use specific labels: `{container="name"}` not `{container=~".*"}`
|
|
- Limit time range: Hours not days
|
|
- Filter early: `|= "error"` before parsing
|
|
- Avoid complex regex
|
|
|
|
## Best Practices
|
|
|
|
### Log Verbosity
|
|
|
|
Configure appropriate log levels per environment:
|
|
- **Production**: `info` or `warning`
|
|
- **Debugging**: `debug` or `trace`
|
|
|
|
Too verbose = wasted resources!
|
|
|
|
### Retention Strategy
|
|
|
|
Match retention to importance:
|
|
- **Critical services**: 60-90 days
|
|
- **Normal services**: 30 days
|
|
- **High-volume services**: 7-14 days
|
|
|
|
### Useful Queries to Save
|
|
|
|
Create saved queries for common tasks:
|
|
|
|
**Recent errors:**
|
|
```logql
|
|
{job="docker"} |= "error" | __timestamp__ >= now() - 15m
|
|
```
|
|
|
|
**Service health check:**
|
|
```logql
|
|
{container="traefik"} |= "request"
|
|
```
|
|
|
|
**Failed logins:**
|
|
```logql
|
|
{container="lldap"} |= "failed" |= "login"
|
|
```
|
|
|
|
## Integration Tips
|
|
|
|
### Embed in Homarr
|
|
|
|
Add Grafana dashboards to Homarr:
|
|
|
|
1. Edit Homarr dashboard
|
|
2. Add **iFrame widget**
|
|
3. URL: `https://logs.fig.systems/d/<dashboard-id>`
|
|
|
|
### Use with Backups
|
|
|
|
Include logging data in backups:
|
|
|
|
```bash
|
|
cd ~/homelab/compose/monitoring/logging
|
|
tar czf logging-backup-$(date +%Y%m%d).tar.gz loki-data/ grafana-data/
|
|
```
|
|
|
|
### Combine with Metrics
|
|
|
|
Later you can add Prometheus for metrics:
|
|
- Loki for logs
|
|
- Prometheus for metrics (CPU, RAM, disk)
|
|
- Both in Grafana dashboards
|
|
|
|
## Common LogQL Patterns
|
|
|
|
### Filter by Time
|
|
|
|
```logql
|
|
# Last 5 minutes
|
|
{container="name"} | __timestamp__ >= now() - 5m
|
|
|
|
# Specific time range (in Grafana UI time picker)
|
|
# Or use: __timestamp__ >= "2024-01-01T00:00:00Z"
|
|
```
|
|
|
|
### Pattern Matching
|
|
|
|
```logql
|
|
# Contains
|
|
{container="name"} |= "error"
|
|
|
|
# Does not contain
|
|
{container="name"} != "404"
|
|
|
|
# Regex match
|
|
{container="name"} |~ "error|fail|critical"
|
|
|
|
# Regex does not match
|
|
{container="name"} !~ "debug|trace"
|
|
```
|
|
|
|
### Aggregations
|
|
|
|
```logql
|
|
# Count
|
|
count_over_time({container="name"}[5m])
|
|
|
|
# Rate
|
|
rate({container="name"}[5m])
|
|
|
|
# Sum
|
|
sum(count_over_time({job="docker"}[1h])) by (container)
|
|
|
|
# Average
|
|
avg_over_time({container="name"} | unwrap bytes [5m])
|
|
```
|
|
|
|
### JSON Parsing
|
|
|
|
```logql
|
|
# Parse JSON and filter
|
|
{container="name"} | json | level="error"
|
|
|
|
# Extract field
|
|
{container="name"} | json | line_format "{{.message}}"
|
|
|
|
# Filter on JSON field
|
|
{container="name"} | json status_code="500"
|
|
```
|
|
|
|
## Resource Usage
|
|
|
|
**Typical usage:**
|
|
- **Loki**: 200-500MB RAM, 1-5GB disk/week
|
|
- **Promtail**: 50-100MB RAM
|
|
- **Grafana**: 100-200MB RAM, ~100MB disk
|
|
- **Total**: ~400-700MB RAM
|
|
|
|
**For 20 containers with moderate logging**
|
|
|
|
## Next Steps
|
|
|
|
1. ✅ Explore your logs in Grafana
|
|
2. ✅ Create useful dashboards
|
|
3. ✅ Set up alerts for critical errors
|
|
4. ⬜ Add Prometheus for metrics (future)
|
|
5. ⬜ Add Tempo for distributed tracing (future)
|
|
6. ⬜ Create log-based SLA tracking
|
|
|
|
## Resources
|
|
|
|
- [Loki Documentation](https://grafana.com/docs/loki/latest/)
|
|
- [LogQL Reference](https://grafana.com/docs/loki/latest/logql/)
|
|
- [Grafana Dashboards](https://grafana.com/grafana/dashboards/)
|
|
- [Community Dashboards](https://grafana.com/grafana/dashboards/?search=loki)
|
|
|
|
---
|
|
|
|
**Now debug issues 10x faster with centralized logs!** 🔍
|