Merge pull request #4 from efigueroa/claude/centralized-logging-011CUqEzDETA2BqAzYUcXtjt

feat: Add centralized logging stack with Loki, Promtail, and Grafana
2025-11-08 17:17:53 -08:00 · 2025-11-08 17:17:53 -08:00 · 25aea7dc34
commit 25aea7dc34
parent 165c72818c 7797f89fcb
10 changed files with 1305 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -31,6 +31,11 @@ compose/
 │       ├── radarr/     # Movie management
 │       ├── sabnzbd/    # Usenet downloader
 │       └── qbittorrent/# Torrent client
+├── monitoring/      # Monitoring & logging
+│   └── logging/     # Centralized logging stack
+│       ├── loki/        # Log aggregation (loki.fig.systems)
+│       ├── promtail/    # Log collection agent
+│       └── grafana/     # Log visualization (logs.fig.systems)
 └── services/       # Utility services
    ├── homarr/         # Dashboard (home.fig.systems)
    ├── backrest/       # Backup manager (backup.fig.systems)
@ -58,6 +63,10 @@ All services are accessible via:
 | Traefik Dashboard | traefik.fig.systems | ✅ |
 | LLDAP | lldap.fig.systems | ✅ |
 | Tinyauth | auth.fig.systems | ❌ |
+| **Monitoring** | | |
+| Grafana (Logs) | logs.fig.systems | ❌* |
+| Loki (API) | loki.fig.systems | ✅ |
+| **Dashboard & Management** | | |
 | Homarr | home.fig.systems | ✅ |
 | Backrest | backup.fig.systems | ✅ |
 | Jellyfin | flix.fig.systems | ❌* |
@ -149,6 +158,9 @@ cd compose/services/linkwarden && docker compose up -d
 cd compose/services/vikunja && docker compose up -d
 cd compose/services/homarr && docker compose up -d
 cd compose/services/backrest && docker compose up -d
+
+# Monitoring (optional but recommended)
+cd compose/monitoring/logging && docker compose up -d
 cd compose/services/lubelogger && docker compose up -d
 cd compose/services/calibre-web && docker compose up -d
 cd compose/services/booklore && docker compose up -d
--- a/compose/monitoring/logging/.env
+++ b/compose/monitoring/logging/.env
@ -0,0 +1,28 @@
+# Centralized Logging Configuration
+
+# Timezone
+TZ=America/Los_Angeles
+
+# Grafana Admin Credentials
+# Default username: admin
+# Change this password immediately after first login!
+# Example format: MyGr@f@n@P@ssw0rd!2024
+GF_SECURITY_ADMIN_PASSWORD=changeme_please_set_secure_grafana_password
+
+# Grafana Configuration
+GF_SERVER_ROOT_URL=https://logs.fig.systems
+GF_SERVER_DOMAIN=logs.fig.systems
+
+# Disable Grafana analytics (optional)
+GF_ANALYTICS_REPORTING_ENABLED=false
+GF_ANALYTICS_CHECK_FOR_UPDATES=false
+
+# Allow embedding (for Homarr dashboard integration)
+GF_SECURITY_ALLOW_EMBEDDING=true
+
+# Loki Configuration
+# Retention period in days (default: 30 days)
+LOKI_RETENTION_PERIOD=30d
+
+# Promtail Configuration
+# No additional configuration needed - configured via promtail-config.yaml
--- a/compose/monitoring/logging/.gitignore
+++ b/compose/monitoring/logging/.gitignore
@ -0,0 +1,13 @@
+# Loki data
+loki-data/
+
+# Grafana data
+grafana-data/
+
+# Keep provisioning and config files
+!grafana-provisioning/
+!loki-config.yaml
+!promtail-config.yaml
+
+# Keep .env.example if created
+!.env.example
--- a/compose/monitoring/logging/README.md
+++ b/compose/monitoring/logging/README.md
@ -0,0 +1,527 @@
+# Centralized Logging Stack
+
+Grafana Loki + Promtail + Grafana for centralized Docker container log aggregation and visualization.
+
+## Overview
+
+This stack provides centralized logging for all Docker containers in your homelab:
+
+- **Loki**: Log aggregation backend (like Prometheus but for logs)
+- **Promtail**: Agent that collects logs from Docker containers
+- **Grafana**: Web UI for querying and visualizing logs
+
+### Why This Stack?
+
+- ✅ **Lightweight**: Minimal resource usage compared to ELK stack
+- ✅ **Docker-native**: Automatically discovers and collects logs from all containers
+- ✅ **Powerful search**: LogQL query language for filtering and searching
+- ✅ **Retention**: Configurable log retention (default: 30 days)
+- ✅ **Labels**: Automatic labeling by container, image, compose project
+- ✅ **Integrated**: Works seamlessly with existing homelab services
+
+## Quick Start
+
+### 1. Configure Environment
+
+```bash
+cd ~/homelab/compose/monitoring/logging
+nano .env
+```
+
+**Update:**
+```env
+# Change this!
+GF_SECURITY_ADMIN_PASSWORD=<your-strong-password>
+```
+
+### 2. Deploy the Stack
+
+```bash
+docker compose up -d
+```
+
+### 3. Access Grafana
+
+Go to: **https://logs.fig.systems**
+
+**Default credentials:**
+- Username: `admin`
+- Password: `<your GF_SECURITY_ADMIN_PASSWORD>`
+
+**⚠️ Change the password immediately after first login!**
+
+### 4. View Logs
+
+1. Click "Explore" (compass icon) in left sidebar
+2. Select "Loki" datasource (should be selected by default)
+3. Start querying logs!
+
+## Usage
+
+### Basic Log Queries
+
+**View all logs from a container:**
+```logql
+{container="jellyfin"}
+```
+
+**View logs from a compose project:**
+```logql
+{compose_project="media"}
+```
+
+**View logs from specific service:**
+```logql
+{compose_service="lldap"}
+```
+
+**Filter by log level:**
+```logql
+{container="immich_server"} |= "error"
+```
+
+**Exclude lines:**
+```logql
+{container="traefik"} != "404"
+```
+
+**Multiple filters:**
+```logql
+{container="jellyfin"} |= "error" != "404"
+```
+
+### Advanced Queries
+
+**Count errors per minute:**
+```logql
+sum(count_over_time({container="jellyfin"} |= "error" [1m])) by (container)
+```
+
+**Rate of logs:**
+```logql
+rate({container="traefik"}[5m])
+```
+
+**Logs from last hour:**
+```logql
+{container="immich_server"} | __timestamp__ >= now() - 1h
+```
+
+**Filter by multiple containers:**
+```logql
+{container=~"jellyfin|immich.*|sonarr"}
+```
+
+**Extract and filter JSON:**
+```logql
+{container="linkwarden"} | json | level="error"
+```
+
+## Configuration
+
+### Log Retention
+
+Default: **30 days**
+
+To change retention period:
+
+**Edit `.env`:**
+```env
+LOKI_RETENTION_PERIOD=60d  # Keep logs for 60 days
+```
+
+**Edit `loki-config.yaml`:**
+```yaml
+limits_config:
+  retention_period: 60d  # Must match .env
+
+table_manager:
+  retention_period: 60d  # Must match above
+```
+
+**Restart:**
+```bash
+docker compose restart loki
+```
+
+### Adjust Resource Limits
+
+**Edit `loki-config.yaml`:**
+```yaml
+limits_config:
+  ingestion_rate_mb: 10          # MB/sec per stream
+  ingestion_burst_size_mb: 20    # Burst size
+```
+
+### Add Custom Labels
+
+**Edit `promtail-config.yaml`:**
+```yaml
+scrape_configs:
+  - job_name: docker
+    docker_sd_configs:
+      - host: unix:///var/run/docker.sock
+
+    relabel_configs:
+      # Add custom label
+      - source_labels: ['__meta_docker_container_label_environment']
+        target_label: 'environment'
+```
+
+## How It Works
+
+### Architecture
+
+```
+Docker Containers
+    ↓ (logs via Docker socket)
+Promtail (scrapes and ships)
+    ↓ (HTTP push)
+Loki (stores and indexes)
+    ↓ (LogQL queries)
+Grafana (visualization)
+```
+
+### Log Collection
+
+Promtail automatically collects logs from:
+1. **All Docker containers** via Docker socket
+2. **System logs** from `/var/log`
+
+Logs are labeled with:
+- `container`: Container name
+- `image`: Docker image
+- `compose_project`: Docker Compose project name
+- `compose_service`: Service name from compose.yaml
+- `stream`: stdout or stderr
+
+### Storage
+
+Logs are stored in:
+- **Location**: `./loki-data/`
+- **Format**: Compressed chunks
+- **Index**: BoltDB
+- **Retention**: Automatic cleanup after retention period
+
+## Integration with Services
+
+### Option 1: Automatic (Default)
+
+Promtail automatically discovers all containers. No changes needed!
+
+### Option 2: Explicit Labels (Recommended)
+
+Add labels to services for better organization:
+
+**Edit any service's `compose.yaml`:**
+```yaml
+services:
+  servicename:
+    # ... existing config ...
+    labels:
+      # ... existing labels ...
+
+      # Add logging labels
+      logging: "promtail"
+      log_level: "info"
+      environment: "production"
+```
+
+These labels will be available in Loki for filtering.
+
+### Option 3: Send Logs Directly to Loki
+
+Instead of Promtail scraping, send logs directly:
+
+**Edit service `compose.yaml`:**
+```yaml
+services:
+  servicename:
+    # ... existing config ...
+    logging:
+      driver: loki
+      options:
+        loki-url: "http://loki:3100/loki/api/v1/push"
+        loki-external-labels: "container={{.Name}},compose_project={{.Config.Labels[\"com.docker.compose.project\"]}}"
+```
+
+**Note**: This requires the Loki Docker driver plugin (not recommended for simplicity).
+
+## Grafana Dashboards
+
+### Built-in Explore
+
+Best way to start - use Grafana's Explore view:
+1. Click "Explore" icon (compass)
+2. Select "Loki" datasource
+3. Use builder to create queries
+4. Save interesting queries
+
+### Pre-built Dashboards
+
+You can import community dashboards:
+
+1. Go to Dashboards → Import
+2. Use dashboard ID: `13639` (Docker logs dashboard)
+3. Select "Loki" as datasource
+4. Import
+
+### Create Custom Dashboard
+
+1. Click "+" → "Dashboard"
+2. Add panel
+3. Select Loki datasource
+4. Build query using LogQL
+5. Save dashboard
+
+**Example panels:**
+- Error count by container
+- Log volume over time
+- Top 10 logging containers
+- Recent errors table
+
+## Alerting
+
+### Create Log-Based Alerts
+
+1. Go to Alerting → Alert rules
+2. Create new alert rule
+3. Query: `sum(count_over_time({container="jellyfin"} |= "error" [5m])) > 10`
+4. Set thresholds and notification channels
+5. Save
+
+**Example alerts:**
+- Too many errors in container
+- Container restarted
+- Disk space warnings
+- Failed authentication attempts
+
+## Troubleshooting
+
+### Promtail Not Collecting Logs
+
+**Check Promtail is running:**
+```bash
+docker logs promtail
+```
+
+**Verify Docker socket access:**
+```bash
+docker exec promtail ls -la /var/run/docker.sock
+```
+
+**Test Promtail config:**
+```bash
+docker exec promtail promtail -config.file=/etc/promtail/config.yaml -dry-run
+```
+
+### Loki Not Receiving Logs
+
+**Check Loki health:**
+```bash
+curl http://localhost:3100/ready
+```
+
+**View Loki logs:**
+```bash
+docker logs loki
+```
+
+**Check Promtail is pushing:**
+```bash
+docker logs promtail | grep -i push
+```
+
+### Grafana Can't Connect to Loki
+
+**Test Loki from Grafana container:**
+```bash
+docker exec grafana wget -O- http://loki:3100/ready
+```
+
+**Check datasource configuration:**
+- Grafana → Configuration → Data sources → Loki
+- URL should be: `http://loki:3100`
+
+### No Logs Appearing
+
+**Wait a few minutes** - logs take time to appear
+
+**Check retention:**
+```bash
+# Logs older than retention period are deleted
+grep retention_period loki-config.yaml
+```
+
+**Verify time range in Grafana:**
+- Make sure selected time range includes recent logs
+- Try "Last 5 minutes"
+
+### High Disk Usage
+
+**Check Loki data size:**
+```bash
+du -sh ./loki-data
+```
+
+**Reduce retention:**
+```env
+LOKI_RETENTION_PERIOD=7d  # Shorter retention
+```
+
+**Manual cleanup:**
+```bash
+# Stop Loki
+docker compose stop loki
+
+# Remove old data (CAREFUL!)
+rm -rf ./loki-data/chunks/*
+
+# Restart
+docker compose start loki
+```
+
+## Performance Tuning
+
+### For Low Resources (< 8GB RAM)
+
+**Edit `loki-config.yaml`:**
+```yaml
+limits_config:
+  retention_period: 7d              # Shorter retention
+  ingestion_rate_mb: 5              # Lower rate
+  ingestion_burst_size_mb: 10       # Lower burst
+
+query_range:
+  results_cache:
+    cache:
+      embedded_cache:
+        max_size_mb: 50             # Smaller cache
+```
+
+### For High Volume
+
+**Edit `loki-config.yaml`:**
+```yaml
+limits_config:
+  ingestion_rate_mb: 20             # Higher rate
+  ingestion_burst_size_mb: 40       # Higher burst
+
+query_range:
+  results_cache:
+    cache:
+      embedded_cache:
+        max_size_mb: 200            # Larger cache
+```
+
+## Best Practices
+
+### Log Levels
+
+Configure services to log appropriately:
+- **Production**: `info` or `warning`
+- **Development**: `debug`
+- **Troubleshooting**: `trace`
+
+Too much logging = higher resource usage!
+
+### Retention Strategy
+
+- **Critical services**: 60+ days
+- **Normal services**: 30 days
+- **High volume services**: 7-14 days
+
+### Query Optimization
+
+- **Use specific labels**: `{container="name"}` not `{container=~".*"}`
+- **Limit time range**: Query hours not days when possible
+- **Use filters early**: `|= "error"` before parsing
+- **Avoid regex when possible**: `|= "string"` faster than `|~ "reg.*ex"`
+
+### Storage Management
+
+Monitor disk usage:
+```bash
+# Check regularly
+du -sh compose/monitoring/logging/loki-data
+
+# Set up alerts when > 80% disk usage
+```
+
+## Integration with Homarr
+
+Grafana will automatically appear in Homarr dashboard. You can also:
+
+### Add Grafana Widget to Homarr
+
+1. Edit Homarr dashboard
+2. Add "iFrame" widget
+3. URL: `https://logs.fig.systems/d/<dashboard-id>`
+4. This embeds Grafana dashboards in Homarr
+
+## Backup and Restore
+
+### Backup
+
+```bash
+# Backup Loki data
+tar czf loki-backup-$(date +%Y%m%d).tar.gz ./loki-data
+
+# Backup Grafana dashboards and datasources
+tar czf grafana-backup-$(date +%Y%m%d).tar.gz ./grafana-data ./grafana-provisioning
+```
+
+### Restore
+
+```bash
+# Restore Loki
+docker compose down
+tar xzf loki-backup-YYYYMMDD.tar.gz
+docker compose up -d
+
+# Restore Grafana
+docker compose down
+tar xzf grafana-backup-YYYYMMDD.tar.gz
+docker compose up -d
+```
+
+## Updating
+
+```bash
+cd ~/homelab/compose/monitoring/logging
+
+# Pull latest images
+docker compose pull
+
+# Restart with new images
+docker compose up -d
+```
+
+## Resource Usage
+
+**Typical usage:**
+- **Loki**: 200-500MB RAM
+- **Promtail**: 50-100MB RAM
+- **Grafana**: 100-200MB RAM
+- **Disk**: ~1-5GB per week (depends on log volume)
+
+## Next Steps
+
+1. ✅ Deploy the stack
+2. ✅ Login to Grafana and explore logs
+3. ✅ Create useful dashboards
+4. ✅ Set up alerts for errors
+5. ✅ Configure retention based on needs
+6. ⬜ Add Prometheus for metrics (future)
+7. ⬜ Add Tempo for distributed tracing (future)
+
+## Resources
+
+- [Loki Documentation](https://grafana.com/docs/loki/latest/)
+- [LogQL Query Language](https://grafana.com/docs/loki/latest/logql/)
+- [Promtail Configuration](https://grafana.com/docs/loki/latest/clients/promtail/configuration/)
+- [Grafana Tutorials](https://grafana.com/tutorials/)
+
+---
+
+**Now you can see logs from all containers in one place!** 🎉
--- a/compose/monitoring/logging/compose.yaml
+++ b/compose/monitoring/logging/compose.yaml
@ -0,0 +1,123 @@
+# Centralized Logging Stack - Loki + Promtail + Grafana
+# Docs: https://grafana.com/docs/loki/latest/
+
+services:
+  loki:
+    container_name: loki
+    image: grafana/loki:2.9.3
+    restart: unless-stopped
+
+    env_file:
+      - .env
+
+    volumes:
+      - ./loki-config.yaml:/etc/loki/local-config.yaml:ro
+      - ./loki-data:/loki
+
+    command: -config.file=/etc/loki/local-config.yaml
+
+    networks:
+      - homelab
+      - logging_internal
+
+    labels:
+      # Traefik (for API access)
+      traefik.enable: true
+      traefik.docker.network: homelab
+
+      # Loki API
+      traefik.http.routers.loki.rule: Host(`loki.fig.systems`) || Host(`loki.edfig.dev`)
+      traefik.http.routers.loki.entrypoints: websecure
+      traefik.http.routers.loki.tls.certresolver: letsencrypt
+      traefik.http.services.loki.loadbalancer.server.port: 3100
+
+      # SSO Protection
+      traefik.http.routers.loki.middlewares: tinyauth
+
+      # Homarr Discovery
+      homarr.name: Loki (Logs)
+      homarr.group: Monitoring
+      homarr.icon: mdi:math-log
+
+    healthcheck:
+      test: ["CMD-SHELL", "wget --no-verbose --tries=1 --spider http://localhost:3100/ready || exit 1"]
+      interval: 30s
+      timeout: 10s
+      retries: 3
+      start_period: 40s
+
+  promtail:
+    container_name: promtail
+    image: grafana/promtail:2.9.3
+    restart: unless-stopped
+
+    env_file:
+      - .env
+
+    volumes:
+      - ./promtail-config.yaml:/etc/promtail/config.yaml:ro
+      - /var/log:/var/log:ro
+      - /var/lib/docker/containers:/var/lib/docker/containers:ro
+      - /var/run/docker.sock:/var/run/docker.sock:ro
+
+    command: -config.file=/etc/promtail/config.yaml
+
+    networks:
+      - logging_internal
+
+    depends_on:
+      loki:
+        condition: service_healthy
+
+  grafana:
+    container_name: grafana
+    image: grafana/grafana:10.2.3
+    restart: unless-stopped
+
+    env_file:
+      - .env
+
+    volumes:
+      - ./grafana-data:/var/lib/grafana
+      - ./grafana-provisioning:/etc/grafana/provisioning
+
+    networks:
+      - homelab
+      - logging_internal
+
+    depends_on:
+      loki:
+        condition: service_healthy
+
+    labels:
+      # Traefik
+      traefik.enable: true
+      traefik.docker.network: homelab
+
+      # Grafana Web UI
+      traefik.http.routers.grafana.rule: Host(`logs.fig.systems`) || Host(`logs.edfig.dev`)
+      traefik.http.routers.grafana.entrypoints: websecure
+      traefik.http.routers.grafana.tls.certresolver: letsencrypt
+      traefik.http.services.grafana.loadbalancer.server.port: 3000
+
+      # SSO Protection (optional - Grafana has its own auth)
+      # traefik.http.routers.grafana.middlewares: tinyauth
+
+      # Homarr Discovery
+      homarr.name: Grafana (Logs Dashboard)
+      homarr.group: Monitoring
+      homarr.icon: mdi:chart-line
+
+    healthcheck:
+      test: ["CMD-SHELL", "wget --no-verbose --tries=1 --spider http://localhost:3000/api/health || exit 1"]
+      interval: 30s
+      timeout: 10s
+      retries: 3
+      start_period: 40s
+
+networks:
+  homelab:
+    external: true
+  logging_internal:
+    name: logging_internal
+    driver: bridge
--- a/compose/monitoring/logging/grafana-provisioning/dashboards/dashboards.yaml
+++ b/compose/monitoring/logging/grafana-provisioning/dashboards/dashboards.yaml
@ -0,0 +1,13 @@
+apiVersion: 1
+
+providers:
+  - name: 'Loki Dashboards'
+    orgId: 1
+    folder: 'Loki'
+    type: file
+    disableDeletion: false
+    updateIntervalSeconds: 10
+    allowUiUpdates: true
+    options:
+      path: /etc/grafana/provisioning/dashboards
+      foldersFromFilesStructure: true
--- a/compose/monitoring/logging/grafana-provisioning/datasources/loki.yaml
+++ b/compose/monitoring/logging/grafana-provisioning/datasources/loki.yaml
@ -0,0 +1,17 @@
+apiVersion: 1
+
+datasources:
+  - name: Loki
+    type: loki
+    access: proxy
+    url: http://loki:3100
+    isDefault: true
+    editable: true
+    jsonData:
+      maxLines: 1000
+      derivedFields:
+        # Extract traceID from logs for distributed tracing (optional)
+        - datasourceUid: tempo
+          matcherRegex: "traceID=(\\w+)"
+          name: TraceID
+          url: "$${__value.raw}"
--- a/compose/monitoring/logging/loki-config.yaml
+++ b/compose/monitoring/logging/loki-config.yaml
@ -0,0 +1,57 @@
+auth_enabled: false
+
+server:
+  http_listen_port: 3100
+  grpc_listen_port: 9096
+
+common:
+  instance_addr: 127.0.0.1
+  path_prefix: /loki
+  storage:
+    filesystem:
+      chunks_directory: /loki/chunks
+      rules_directory: /loki/rules
+  replication_factor: 1
+  ring:
+    kvstore:
+      store: inmemory
+
+query_range:
+  results_cache:
+    cache:
+      embedded_cache:
+        enabled: true
+        max_size_mb: 100
+
+schema_config:
+  configs:
+    - from: 2020-10-24
+      store: boltdb-shipper
+      object_store: filesystem
+      schema: v11
+      index:
+        prefix: index_
+        period: 24h
+
+ruler:
+  alertmanager_url: http://localhost:9093
+
+# Retention - keeps logs for 30 days
+limits_config:
+  retention_period: 30d
+  ingestion_rate_mb: 10
+  ingestion_burst_size_mb: 20
+
+# Cleanup old logs
+compactor:
+  working_directory: /loki/compactor
+  shared_store: filesystem
+  compaction_interval: 10m
+  retention_enabled: true
+  retention_delete_delay: 2h
+  retention_delete_worker_count: 150
+
+# Table manager for retention
+table_manager:
+  retention_deletes_enabled: true
+  retention_period: 30d
--- a/compose/monitoring/logging/promtail-config.yaml
+++ b/compose/monitoring/logging/promtail-config.yaml
@ -0,0 +1,70 @@
+server:
+  http_listen_port: 9080
+  grpc_listen_port: 0
+
+positions:
+  filename: /tmp/positions.yaml
+
+clients:
+  - url: http://loki:3100/loki/api/v1/push
+
+scrape_configs:
+  # Docker containers logs
+  - job_name: docker
+    docker_sd_configs:
+      - host: unix:///var/run/docker.sock
+        refresh_interval: 5s
+        filters:
+          - name: label
+            values: ["logging=promtail"]
+
+    relabel_configs:
+      # Use container name as job
+      - source_labels: ['__meta_docker_container_name']
+        regex: '/(.*)'
+        target_label: 'container'
+
+      # Use image name
+      - source_labels: ['__meta_docker_container_image']
+        target_label: 'image'
+
+      # Use container ID
+      - source_labels: ['__meta_docker_container_id']
+        target_label: 'container_id'
+
+      # Add all docker labels as labels
+      - action: labelmap
+        regex: __meta_docker_container_label_(.+)
+
+  # All Docker containers (fallback)
+  - job_name: docker_all
+    docker_sd_configs:
+      - host: unix:///var/run/docker.sock
+        refresh_interval: 5s
+
+    relabel_configs:
+      - source_labels: ['__meta_docker_container_name']
+        regex: '/(.*)'
+        target_label: 'container'
+
+      - source_labels: ['__meta_docker_container_image']
+        target_label: 'image'
+
+      - source_labels: ['__meta_docker_container_log_stream']
+        target_label: 'stream'
+
+      # Extract compose project and service
+      - source_labels: ['__meta_docker_container_label_com_docker_compose_project']
+        target_label: 'compose_project'
+
+      - source_labels: ['__meta_docker_container_label_com_docker_compose_service']
+        target_label: 'compose_service'
+
+  # System logs
+  - job_name: system
+    static_configs:
+      - targets:
+          - localhost
+        labels:
+          job: varlogs
+          __path__: /var/log/*log
--- a/docs/guides/centralized-logging.md
+++ b/docs/guides/centralized-logging.md
@ -0,0 +1,445 @@
+# Centralized Logging with Loki
+
+Guide for setting up and using the centralized logging stack (Loki + Promtail + Grafana).
+
+## Overview
+
+The logging stack provides centralized log aggregation and visualization for all Docker containers:
+
+- **Loki**: Log aggregation backend (stores and indexes logs)
+- **Promtail**: Agent that collects logs from Docker containers
+- **Grafana**: Web UI for querying and visualizing logs
+
+### Why Centralized Logging?
+
+**Problems without it:**
+- Logs scattered across many containers
+- Hard to correlate events across services
+- Logs lost when containers restart
+- No easy way to search historical logs
+
+**Benefits:**
+- ✅ Single place to view all logs
+- ✅ Powerful search and filtering (LogQL)
+- ✅ Persist logs even after container restarts
+- ✅ Correlate events across services
+- ✅ Create dashboards and alerts
+- ✅ Configurable retention (30 days default)
+
+## Quick Setup
+
+### 1. Configure Grafana Password
+
+```bash
+cd ~/homelab/compose/monitoring/logging
+nano .env
+```
+
+**Update:**
+```env
+GF_SECURITY_ADMIN_PASSWORD=<your-strong-password>
+```
+
+**Generate password:**
+```bash
+openssl rand -base64 20
+```
+
+### 2. Deploy
+
+```bash
+cd ~/homelab/compose/monitoring/logging
+docker compose up -d
+```
+
+### 3. Access Grafana
+
+Go to: **https://logs.fig.systems**
+
+**Login:**
+- Username: `admin`
+- Password: `<your GF_SECURITY_ADMIN_PASSWORD>`
+
+### 4. Start Exploring Logs
+
+1. Click **Explore** (compass icon) in left sidebar
+2. Loki datasource should be selected
+3. Start querying!
+
+## Basic Usage
+
+### View Logs from a Container
+
+```logql
+{container="jellyfin"}
+```
+
+### View Last Hour's Logs
+
+```logql
+{container="immich_server"} | __timestamp__ >= now() - 1h
+```
+
+### Filter for Errors
+
+```logql
+{container="traefik"} |= "error"
+```
+
+### Exclude Lines
+
+```logql
+{container="traefik"} != "404"
+```
+
+### Multiple Containers
+
+```logql
+{container=~"jellyfin|immich.*"}
+```
+
+### By Compose Project
+
+```logql
+{compose_project="media"}
+```
+
+## Advanced Queries
+
+### Count Errors
+
+```logql
+sum(count_over_time({container="jellyfin"} |= "error" [5m]))
+```
+
+### Error Rate
+
+```logql
+rate({container="traefik"} |= "error" [5m])
+```
+
+### Parse JSON Logs
+
+```logql
+{container="linkwarden"} | json | level="error"
+```
+
+### Top 10 Error Messages
+
+```logql
+topk(10,
+  sum by (container) (
+    count_over_time({job="docker"} |= "error" [24h])
+  )
+)
+```
+
+## Creating Dashboards
+
+### Import Pre-built Dashboard
+
+1. Go to **Dashboards** → **Import**
+2. Dashboard ID: **13639** (Docker logs)
+3. Select **Loki** as datasource
+4. Click **Import**
+
+### Create Custom Dashboard
+
+1. Click **+** → **Dashboard**
+2. **Add panel**
+3. Select **Loki** datasource
+4. Build query
+5. Choose visualization (logs, graph, table, etc.)
+6. **Save**
+
+**Example panels:**
+- Error count by container
+- Log volume over time
+- Recent errors (table)
+- Top logging containers
+
+## Setting Up Alerts
+
+### Create Alert Rule
+
+1. **Alerting** → **Alert rules** → **New alert rule**
+2. **Query:**
+   ```logql
+   sum(count_over_time({container="jellyfin"} |= "error" [5m])) > 10
+   ```
+3. **Condition**: Alert when > 10 errors in 5 minutes
+4. **Configure** notification channel (email, webhook, etc.)
+5. **Save**
+
+**Example alerts:**
+- Too many errors in service
+- Service stopped logging (might have crashed)
+- Authentication failures
+- Disk space warnings
+
+## Configuration
+
+### Change Log Retention
+
+**Default: 30 days**
+
+Edit `.env`:
+```env
+LOKI_RETENTION_PERIOD=60d  # 60 days
+```
+
+Edit `loki-config.yaml`:
+```yaml
+limits_config:
+  retention_period: 60d
+
+table_manager:
+  retention_period: 60d
+```
+
+Restart:
+```bash
+docker compose restart loki
+```
+
+### Adjust Resource Limits
+
+For low-resource systems, edit `loki-config.yaml`:
+
+```yaml
+limits_config:
+  retention_period: 7d              # Shorter retention
+  ingestion_rate_mb: 5              # Lower rate
+
+query_range:
+  results_cache:
+    cache:
+      embedded_cache:
+        max_size_mb: 50             # Smaller cache
+```
+
+### Add Labels to Services
+
+Make services easier to find by adding labels:
+
+**Edit service `compose.yaml`:**
+```yaml
+services:
+  myservice:
+    labels:
+      logging: "promtail"
+      environment: "production"
+      tier: "frontend"
+```
+
+Query with these labels:
+```logql
+{environment="production", tier="frontend"}
+```
+
+## Troubleshooting
+
+### No Logs Appearing
+
+**Wait a few minutes** - initial log collection takes time
+
+**Check Promtail:**
+```bash
+docker logs promtail
+```
+
+**Check Loki:**
+```bash
+docker logs loki
+```
+
+**Verify Promtail can reach Loki:**
+```bash
+docker exec promtail wget -O- http://loki:3100/ready
+```
+
+### Grafana Can't Connect to Loki
+
+**Test from Grafana:**
+```bash
+docker exec grafana wget -O- http://loki:3100/ready
+```
+
+**Check datasource:** Grafana → Configuration → Data sources → Loki
+- URL should be: `http://loki:3100`
+
+### High Disk Usage
+
+**Check size:**
+```bash
+du -sh compose/monitoring/logging/loki-data
+```
+
+**Reduce retention:**
+```env
+LOKI_RETENTION_PERIOD=7d
+```
+
+**Manual cleanup (CAREFUL):**
+```bash
+docker compose stop loki
+rm -rf loki-data/chunks/*
+docker compose start loki
+```
+
+### Slow Queries
+
+**Optimize queries:**
+- Use specific labels: `{container="name"}` not `{container=~".*"}`
+- Limit time range: Hours not days
+- Filter early: `|= "error"` before parsing
+- Avoid complex regex
+
+## Best Practices
+
+### Log Verbosity
+
+Configure appropriate log levels per environment:
+- **Production**: `info` or `warning`
+- **Debugging**: `debug` or `trace`
+
+Too verbose = wasted resources!
+
+### Retention Strategy
+
+Match retention to importance:
+- **Critical services**: 60-90 days
+- **Normal services**: 30 days
+- **High-volume services**: 7-14 days
+
+### Useful Queries to Save
+
+Create saved queries for common tasks:
+
+**Recent errors:**
+```logql
+{job="docker"} |= "error" | __timestamp__ >= now() - 15m
+```
+
+**Service health check:**
+```logql
+{container="traefik"} |= "request"
+```
+
+**Failed logins:**
+```logql
+{container="lldap"} |= "failed" |= "login"
+```
+
+## Integration Tips
+
+### Embed in Homarr
+
+Add Grafana dashboards to Homarr:
+
+1. Edit Homarr dashboard
+2. Add **iFrame widget**
+3. URL: `https://logs.fig.systems/d/<dashboard-id>`
+
+### Use with Backups
+
+Include logging data in backups:
+
+```bash
+cd ~/homelab/compose/monitoring/logging
+tar czf logging-backup-$(date +%Y%m%d).tar.gz loki-data/ grafana-data/
+```
+
+### Combine with Metrics
+
+Later you can add Prometheus for metrics:
+- Loki for logs
+- Prometheus for metrics (CPU, RAM, disk)
+- Both in Grafana dashboards
+
+## Common LogQL Patterns
+
+### Filter by Time
+
+```logql
+# Last 5 minutes
+{container="name"} | __timestamp__ >= now() - 5m
+
+# Specific time range (in Grafana UI time picker)
+# Or use: __timestamp__ >= "2024-01-01T00:00:00Z"
+```
+
+### Pattern Matching
+
+```logql
+# Contains
+{container="name"} |= "error"
+
+# Does not contain
+{container="name"} != "404"
+
+# Regex match
+{container="name"} |~ "error|fail|critical"
+
+# Regex does not match
+{container="name"} !~ "debug|trace"
+```
+
+### Aggregations
+
+```logql
+# Count
+count_over_time({container="name"}[5m])
+
+# Rate
+rate({container="name"}[5m])
+
+# Sum
+sum(count_over_time({job="docker"}[1h])) by (container)
+
+# Average
+avg_over_time({container="name"} | unwrap bytes [5m])
+```
+
+### JSON Parsing
+
+```logql
+# Parse JSON and filter
+{container="name"} | json | level="error"
+
+# Extract field
+{container="name"} | json | line_format "{{.message}}"
+
+# Filter on JSON field
+{container="name"} | json status_code="500"
+```
+
+## Resource Usage
+
+**Typical usage:**
+- **Loki**: 200-500MB RAM, 1-5GB disk/week
+- **Promtail**: 50-100MB RAM
+- **Grafana**: 100-200MB RAM, ~100MB disk
+- **Total**: ~400-700MB RAM
+
+**For 20 containers with moderate logging**
+
+## Next Steps
+
+1. ✅ Explore your logs in Grafana
+2. ✅ Create useful dashboards
+3. ✅ Set up alerts for critical errors
+4. ⬜ Add Prometheus for metrics (future)
+5. ⬜ Add Tempo for distributed tracing (future)
+6. ⬜ Create log-based SLA tracking
+
+## Resources
+
+- [Loki Documentation](https://grafana.com/docs/loki/latest/)
+- [LogQL Reference](https://grafana.com/docs/loki/latest/logql/)
+- [Grafana Dashboards](https://grafana.com/grafana/dashboards/)
+- [Community Dashboards](https://grafana.com/grafana/dashboards/?search=loki)
+
+---
+
+**Now debug issues 10x faster with centralized logs!** 🔍