homelab/docs/troubleshooting/common-issues.md
Claude 4adaa8e8be
docs: Add comprehensive documentation for homelab setup and operations
This commit adds extensive documentation covering all aspects of homelab setup,
configuration, and troubleshooting.

## Documentation Structure

### Main Documentation
- **docs/README.md**: Documentation hub with table of contents
- **docs/getting-started.md**: Complete setup guide from scratch
- **docs/quick-reference.md**: Fast reference for common tasks and commands

### Configuration Guides (docs/guides/)
- **secrets-management.md**: Environment variables and secrets configuration
  - How to generate secure secrets
  - Service-specific configuration
  - Automated secret generation scripts
  - Security best practices
  - Common mistakes to avoid

- **gpu-setup.md**: NVIDIA GTX 1070 GPU acceleration setup
  - Specific to Proxmox 9 on Debian 13
  - Complete passthrough configuration
  - Jellyfin hardware transcoding setup
  - Immich ML inference acceleration
  - Performance tuning and benchmarks
  - Troubleshooting GPU issues

### Troubleshooting (docs/troubleshooting/)
- **faq.md**: Frequently asked questions (60+ Q&A)
  - General questions about the homelab
  - Setup and configuration questions
  - SSL/TLS and SSO questions
  - Service-specific questions
  - Security and backup questions
  - Performance optimization

- **common-issues.md**: Common problems and solutions
  - Service startup failures
  - SSL certificate errors
  - SSO authentication issues
  - Access problems
  - Performance issues
  - Database errors
  - Network issues
  - GPU problems

### Services (docs/services/)
- **README.md**: Complete service overview
  - All 20 services with descriptions
  - Use cases for each service
  - Resource requirements
  - Deployment checklists
  - Service dependencies
  - Minimum viable setups

## Key Features

### Environment-Specific
All GPU documentation is specific to:
- **Platform**: Proxmox 9 (PVE)
- **OS**: Debian 13
- **GPU**: NVIDIA GTX 1070 (Pascal)
- Includes Proxmox-specific GPU passthrough
- VM guest setup on Debian 13
- NVIDIA Container Toolkit configuration

### Comprehensive Coverage
- 60+ FAQs answered
- 50+ common issues documented
- 100+ command examples
- Step-by-step procedures
- Troubleshooting decision trees
- Quick reference tables

### Practical Examples
- Actual command outputs
- Real-world scenarios
- Copy-paste ready commands
- Configuration file examples
- Debugging procedures

## Documentation Highlights

### Getting Started Guide
- Prerequisites checklist
- Docker installation
- Media directory setup
- DNS configuration
- Environment variable setup
- Service deployment order
- Initial service configuration
- Verification procedures

### Secrets Management
- Secret type identification
- Generation commands for each type
- Service-specific requirements
- Automated generation script
- Password manager integration
- Backup procedures
- Security best practices
- Common mistakes

### GPU Setup (Proxmox/Debian/GTX 1070)
- IOMMU enablement
- VFIO configuration
- PCI passthrough to VM
- NVIDIA driver installation on Debian 13
- Container toolkit setup
- Jellyfin NVENC configuration
- Immich CUDA acceleration
- Performance benchmarks
- NVENC stream limit unlock
- Monitoring and tuning

### Quick Reference
- All service URLs
- Common Docker Compose commands
- System check commands
- Secret generation commands
- Troubleshooting steps
- File locations
- Port reference
- Emergency procedures

### FAQ
Covers questions about:
- Hardware requirements
- Domain requirements
- Cost estimates
- Setup procedures
- Configuration details
- SSL certificates
- SSO authentication
- Service-specific issues
- Backup strategies
- Performance optimization
- Security considerations

### Common Issues
Solutions for:
- Container startup failures
- Environment variable errors
- Port conflicts
- Permission issues
- SSL certificate problems
- DNS issues
- SSO login failures
- Database connections
- Network connectivity
- GPU detection
- Resource constraints

### Services Overview
- Detailed description of all 20 services
- Use cases and features
- Required vs optional services
- Resource requirements by tier
- Service dependencies diagram
- Deployment checklists
- "When to use" guidance

## File Structure

```
docs/
├── README.md                           # Documentation hub
├── getting-started.md                  # Setup walkthrough
├── quick-reference.md                  # Command reference
├── guides/
│   ├── secrets-management.md           # Secrets configuration
│   └── gpu-setup.md                    # GPU acceleration (GTX 1070)
├── troubleshooting/
│   ├── faq.md                          # 60+ FAQs
│   └── common-issues.md                # Problem solving
└── services/
    └── README.md                       # Service overview
```

## Benefits

### For New Users
- Clear setup path from zero to running services
- Explains "why" not just "how"
- Common pitfalls documented and avoided
- Example configurations provided

### For Experienced Users
- Quick reference for commands
- Troubleshooting decision trees
- Performance tuning guides
- Advanced configurations

### For Maintenance
- Update procedures
- Backup and restore
- Monitoring guidelines
- Security hardening

## Documentation Standards

- Clear, concise writing
- Code blocks with syntax highlighting
- Examples with expected output
- Warning and tip callouts
- Cross-references between docs
- Tested commands and procedures

## Next Steps

Users should:
1. Start with getting-started.md
2. Configure secrets using secrets-management.md
3. Enable GPU if available (gpu-setup.md)
4. Use quick-reference.md for daily operations
5. Refer to faq.md and common-issues.md when stuck

---

**This documentation makes the homelab accessible to users of all skill levels!**
2025-11-06 19:32:10 +00:00

707 lines
12 KiB
Markdown

# Common Issues and Solutions
This guide covers the most common problems you might encounter and how to fix them.
## Table of Contents
- [Service Won't Start](#service-wont-start)
- [SSL/TLS Certificate Errors](#ssltls-certificate-errors)
- [SSO Authentication Issues](#sso-authentication-issues)
- [Access Issues](#access-issues)
- [Performance Problems](#performance-problems)
- [Database Errors](#database-errors)
- [Network Issues](#network-issues)
- [GPU Problems](#gpu-problems)
## Service Won't Start
### Symptom
Container exits immediately or shows "Exited (1)" status.
### Diagnosis
```bash
cd ~/homelab/compose/path/to/service
# Check container status
docker compose ps
# View logs
docker compose logs
# Check for specific errors
docker compose logs | grep -i error
```
### Common Causes and Fixes
#### 1. Environment Variables Not Set
**Error in logs:**
```
Error: POSTGRES_PASSWORD is not set
Error: required environment variable 'XXX' is missing
```
**Fix:**
```bash
# Check .env file exists
ls -la .env
# Check for changeme_ values
grep "changeme_" .env
# Update with proper secrets (see secrets guide)
nano .env
# Restart
docker compose up -d
```
#### 2. Port Already in Use
**Error in logs:**
```
Error: bind: address already in use
Error: failed to bind to port 80: address already in use
```
**Fix:**
```bash
# Find what's using the port
sudo netstat -tulpn | grep :80
sudo netstat -tulpn | grep :443
# Stop conflicting service
sudo systemctl stop apache2 # Example
sudo systemctl stop nginx # Example
# Or change port in compose.yaml
```
#### 3. Network Not Created
**Error in logs:**
```
network homelab declared as external, but could not be found
```
**Fix:**
```bash
# Create network
docker network create homelab
# Verify
docker network ls | grep homelab
# Restart service
docker compose up -d
```
#### 4. Volume Permission Issues
**Error in logs:**
```
Permission denied: '/config'
mkdir: cannot create directory '/data': Permission denied
```
**Fix:**
```bash
# Check directory ownership
ls -la ./config ./data
# Fix ownership (replace 1000:1000 with your UID:GID)
sudo chown -R 1000:1000 ./config ./data
# Restart
docker compose up -d
```
#### 5. Dependency Not Running
**Error in logs:**
```
Failed to connect to database
Connection refused: postgres:5432
```
**Fix:**
```bash
# Start dependency first
cd ~/homelab/compose/path/to/dependency
docker compose up -d
# Wait for it to be healthy
docker compose logs -f
# Then start the service
cd ~/homelab/compose/path/to/service
docker compose up -d
```
## SSL/TLS Certificate Errors
### Symptom
Browser shows "Your connection is not private" or "NET::ERR_CERT_AUTHORITY_INVALID"
### Diagnosis
```bash
# Check Traefik logs
docker logs traefik | grep -i certificate
docker logs traefik | grep -i letsencrypt
docker logs traefik | grep -i error
# Test certificate
echo | openssl s_client -servername home.fig.systems -connect home.fig.systems:443 2>/dev/null | openssl x509 -noout -dates
```
### Common Causes and Fixes
#### 1. DNS Not Configured
**Fix:**
```bash
# Test DNS resolution
dig home.fig.systems +short
# Should return your server's IP
# If not, configure DNS A records:
# *.fig.systems -> YOUR_SERVER_IP
```
#### 2. Port 80 Not Accessible
Let's Encrypt needs port 80 for HTTP-01 challenge.
**Fix:**
```bash
# Test from external network
curl -I http://home.fig.systems
# Check firewall
sudo ufw status
sudo ufw allow 80/tcp
sudo ufw allow 443/tcp
# Check port forwarding on router
# Ensure ports 80 and 443 are forwarded to server
```
#### 3. Rate Limiting
Let's Encrypt has limits: 5 certificates per domain per week.
**Fix:**
```bash
# Check Traefik logs for rate limit errors
docker logs traefik | grep -i "rate limit"
# Wait for rate limit to reset (1 week)
# Or use Let's Encrypt staging environment for testing
# Enable staging in traefik/compose.yaml:
# - --certificatesresolvers.letsencrypt.acme.caserver=https://acme-staging-v02.api.letsencrypt.org/directory
```
#### 4. First Startup - Certificates Not Yet Generated
**Fix:**
```bash
# Wait 2-5 minutes for certificate generation
docker logs traefik -f
# Look for:
# "Certificate obtained for domain"
```
#### 5. Certificate Expired
Traefik should auto-renew, but if manual renewal needed:
**Fix:**
```bash
# Remove old certificates
cd ~/homelab/compose/core/traefik
rm -rf ./acme.json
# Restart Traefik
docker compose restart
# Wait for new certificates
docker logs traefik -f
```
## SSO Authentication Issues
### Symptom
- Can't login to SSO-protected services
- Redirected to auth page but login fails
- "Invalid credentials" error
### Diagnosis
```bash
# Check LLDAP is running
docker ps | grep lldap
# Check Tinyauth is running
docker ps | grep tinyauth
# View logs
docker logs lldap
docker logs tinyauth
```
### Common Causes and Fixes
#### 1. Password Mismatch
LDAP_BIND_PASSWORD must match LLDAP_LDAP_USER_PASS.
**Fix:**
```bash
# Check both passwords
grep LLDAP_LDAP_USER_PASS ~/homelab/compose/core/lldap/.env
grep LDAP_BIND_PASSWORD ~/homelab/compose/core/tinyauth/.env
# They must be EXACTLY the same!
# If different, update tinyauth/.env
cd ~/homelab/compose/core/tinyauth
nano .env
# Set LDAP_BIND_PASSWORD to match LLDAP_LDAP_USER_PASS
# Restart Tinyauth
docker compose restart
```
#### 2. User Doesn't Exist in LLDAP
**Fix:**
```bash
# Access LLDAP web UI
# Go to: https://lldap.fig.systems
# Login with admin credentials
# Username: admin
# Password: <your LLDAP_LDAP_USER_PASS>
# Create user:
# - Click "Create user"
# - Set username, email, password
# - Add to "lldap_admin" group
# Try logging in again
```
#### 3. LLDAP or Tinyauth Not Running
**Fix:**
```bash
# Start LLDAP
cd ~/homelab/compose/core/lldap
docker compose up -d
# Wait for it to be ready
docker compose logs -f
# Start Tinyauth
cd ~/homelab/compose/core/tinyauth
docker compose up -d
docker compose logs -f
```
#### 4. Network Issue Between Tinyauth and LLDAP
**Fix:**
```bash
# Test connection
docker exec tinyauth nc -zv lldap 3890
# Should show: Connection to lldap 3890 port [tcp/*] succeeded!
# If not, check both are on homelab network
docker network inspect homelab
```
## Access Issues
### Symptom
- Can't access service from browser
- Connection timeout
- "This site can't be reached"
### Diagnosis
```bash
# Test from server
curl -I https://home.fig.systems
# Test DNS
dig home.fig.systems +short
# Check container is running
docker ps | grep servicename
# Check Traefik routing
docker logs traefik | grep servicename
```
### Common Causes and Fixes
#### 1. Service Not Running
**Fix:**
```bash
cd ~/homelab/compose/path/to/service
docker compose up -d
docker compose logs -f
```
#### 2. Traefik Not Running
**Fix:**
```bash
cd ~/homelab/compose/core/traefik
docker compose up -d
docker compose logs -f
```
#### 3. DNS Not Resolving
**Fix:**
```bash
# Check DNS
dig service.fig.systems +short
# Should return your server IP
# If not, add/update DNS A record
```
#### 4. Firewall Blocking
**Fix:**
```bash
# Check firewall
sudo ufw status
# Allow if needed
sudo ufw allow 80/tcp
sudo ufw allow 443/tcp
```
#### 5. Wrong Traefik Labels
**Fix:**
```bash
# Check compose.yaml has correct labels
cd ~/homelab/compose/path/to/service
cat compose.yaml | grep -A 10 "labels:"
# Should have:
# traefik.enable: true
# traefik.http.routers.servicename.rule: Host(`service.fig.systems`)
# etc.
```
## Performance Problems
### Symptom
- Services running slowly
- High CPU/RAM usage
- System unresponsive
### Diagnosis
```bash
# Overall system
htop
# Docker resources
docker stats
# Disk usage
df -h
docker system df
```
### Common Causes and Fixes
#### 1. Insufficient RAM
**Fix:**
```bash
# Check RAM usage
free -h
# If low, either:
# 1. Add more RAM
# 2. Stop unused services
# 3. Add resource limits to compose files
# Example resource limit:
deploy:
resources:
limits:
memory: 2G
reservations:
memory: 1G
```
#### 2. Disk Full
**Fix:**
```bash
# Check disk usage
df -h
# Clean Docker
docker system prune -a
# Remove old logs
sudo journalctl --vacuum-time=7d
# Check media folder
du -sh /media/*
```
#### 3. Too Many Services Running
**Fix:**
```bash
# Stop unused services
cd ~/homelab/compose/services/unused-service
docker compose down
# Or deploy only what you need
```
#### 4. Database Not Optimized
**Fix:**
```bash
# For postgres services, add to .env:
POSTGRES_INITDB_ARGS=--data-checksums
# Increase shared buffers (if enough RAM):
# Edit compose.yaml, add to postgres:
command: postgres -c shared_buffers=256MB -c max_connections=200
```
## Database Errors
### Symptom
- "Connection refused" to database
- "Authentication failed for user"
- "Database does not exist"
### Diagnosis
```bash
# Check database container
docker ps | grep postgres
# View database logs
docker logs <postgres_container_name>
# Test connection from app
docker exec <app_container> nc -zv <db_container> 5432
```
### Common Causes and Fixes
#### 1. Password Mismatch
**Fix:**
```bash
# Check passwords match in .env
cat .env | grep PASSWORD
# For example, in Vikunja:
# VIKUNJA_DATABASE_PASSWORD and POSTGRES_PASSWORD must match!
# Update if needed
nano .env
docker compose down
docker compose up -d
```
#### 2. Database Not Initialized
**Fix:**
```bash
# Remove database and reinitialize
docker compose down
rm -rf ./db/ # CAREFUL: This deletes all data!
docker compose up -d
```
#### 3. Database Still Starting
**Fix:**
```bash
# Wait for database to be ready
docker logs <postgres_container> -f
# Look for "database system is ready to accept connections"
# Then restart app
docker compose restart <app_service>
```
## Network Issues
### Symptom
- Containers can't communicate
- "Connection refused" between services
### Diagnosis
```bash
# Inspect network
docker network inspect homelab
# Test connectivity
docker exec container1 ping container2
docker exec container1 nc -zv container2 PORT
```
### Common Causes and Fixes
#### 1. Containers Not on Same Network
**Fix:**
```bash
# Check compose.yaml has networks section
networks:
homelab:
external: true
# Ensure service is using the network
services:
servicename:
networks:
- homelab
```
#### 2. Network Doesn't Exist
**Fix:**
```bash
docker network create homelab
docker compose up -d
```
#### 3. DNS Resolution Between Containers
**Fix:**
```bash
# Use container name, not localhost
# Wrong: http://localhost:5432
# Right: http://postgres:5432
# Or use service name from compose.yaml
```
## GPU Problems
### Symptom
- "No hardware acceleration available"
- GPU not detected in container
- "Failed to open GPU"
### Diagnosis
```bash
# Check GPU on host
nvidia-smi
# Check GPU in container
docker exec jellyfin nvidia-smi
# Check Docker GPU runtime
docker run --rm --gpus all nvidia/cuda:12.2.0-base-ubuntu22.04 nvidia-smi
```
### Common Causes and Fixes
#### 1. NVIDIA Container Toolkit Not Installed
**Fix:**
```bash
# Install toolkit
sudo apt install nvidia-container-toolkit
# Configure runtime
sudo nvidia-ctk runtime configure --runtime=docker
# Restart Docker
sudo systemctl restart docker
```
#### 2. Runtime Not Specified in Compose
**Fix:**
```bash
# Edit compose.yaml
nano compose.yaml
# Uncomment:
runtime: nvidia
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
# Restart
docker compose up -d
```
#### 3. GPU Already in Use
**Fix:**
```bash
# Check processes using GPU
nvidia-smi
# Kill process if needed
sudo kill <PID>
# Restart service
docker compose restart
```
#### 4. GPU Not Passed Through to VM (Proxmox)
**Fix:**
```bash
# From Proxmox host, check GPU passthrough
lspci | grep -i nvidia
# From VM, check GPU visible
lspci | grep -i nvidia
# If not visible, reconfigure passthrough (see GPU guide)
```
## Getting More Help
If your issue isn't listed here:
1. **Check service-specific logs**:
```bash
cd ~/homelab/compose/path/to/service
docker compose logs --tail=200
```
2. **Search container logs for errors**:
```bash
docker compose logs | grep -i error
docker compose logs | grep -i fail
```
3. **Check FAQ**: See [FAQ](./faq.md)
4. **Debugging Guide**: See [Debugging Guide](./debugging.md)
5. **Service Documentation**: Check service's official documentation
---
**Most issues can be solved by checking logs and environment variables!**