Add port mapping for Ollama API (11434) to enable direct access from host machine for CLI tools and local development. |
||
|---|---|---|
| .. | ||
| .env | ||
| .gitignore | ||
| compose.yaml | ||
| README.md | ||
Ollama - Local Large Language Models
Run powerful AI models locally on your hardware with GPU acceleration.
Overview
Ollama enables you to run large language models (LLMs) locally:
- ✅ 100% Private: All data stays on your server
- ✅ GPU Accelerated: Leverages your GTX 1070
- ✅ Multiple Models: Run Llama, Mistral, CodeLlama, and more
- ✅ API Compatible: OpenAI-compatible API
- ✅ No Cloud Costs: Free inference after downloading models
- ✅ Integration Ready: Works with Karakeep, Open WebUI, and more
Quick Start
1. Deploy Ollama
cd ~/homelab/compose/services/ollama
docker compose up -d
2. Pull a Model
# Small, fast model (3B parameters, ~2GB)
docker exec ollama ollama pull llama3.2:3b
# Medium model (7B parameters, ~4GB)
docker exec ollama ollama pull llama3.2:7b
# Large model (70B parameters, ~40GB - requires quantization)
docker exec ollama ollama pull llama3.3:70b-instruct-q4_K_M
3. Test
# Interactive chat
docker exec -it ollama ollama run llama3.2:3b
# Ask a question
> Hello, how are you?
4. Enable GPU (Recommended)
Edit compose.yaml and uncomment the deploy section:
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
Restart:
docker compose down
docker compose up -d
Verify GPU usage:
# Check GPU is detected
docker exec ollama nvidia-smi
# Run model with GPU
docker exec ollama ollama run llama3.2:3b "What GPU am I using?"
Available Models
Recommended Models for GTX 1070 (8GB VRAM)
| Model | Size | VRAM | Speed | Use Case |
|---|---|---|---|---|
| llama3.2:3b | 2GB | 3GB | Fast | General chat, Karakeep |
| llama3.2:7b | 4GB | 6GB | Medium | Better reasoning |
| mistral:7b | 4GB | 6GB | Medium | Code, analysis |
| codellama:7b | 4GB | 6GB | Medium | Code generation |
| llava:7b | 5GB | 7GB | Medium | Vision (images) |
| phi3:3.8b | 2.3GB | 4GB | Fast | Compact, efficient |
Specialized Models
Code:
codellama:7b- Code generationcodellama:13b-python- Python expertstarcoder2:7b- Multi-language code
Vision (Image Understanding):
llava:7b- General visionllava:13b- Better vision (needs more VRAM)bakllava:7b- Vision + chat
Multilingual:
aya:8b- 101 languagescommand-r:35b- Enterprise multilingual
Math & Reasoning:
deepseek-math:7b- Mathematicswizard-math:7b- Math word problems
Large Models (Quantized for GTX 1070)
These require 4-bit quantization to fit in 8GB VRAM:
# 70B models (quantized)
docker exec ollama ollama pull llama3.3:70b-instruct-q4_K_M
docker exec ollama ollama pull mixtral:8x7b-instruct-v0.1-q4_K_M
# Very large (use with caution)
docker exec ollama ollama pull llama3.1:405b-instruct-q2_K
Usage
Command Line
Run model interactively:
docker exec -it ollama ollama run llama3.2:3b
One-off question:
docker exec ollama ollama run llama3.2:3b "Explain quantum computing in simple terms"
With system prompt:
docker exec ollama ollama run llama3.2:3b \
--system "You are a helpful coding assistant." \
"Write a Python function to sort a list"
API Usage
List models:
curl http://ollama:11434/api/tags
Generate text:
curl http://ollama:11434/api/generate -d '{
"model": "llama3.2:3b",
"prompt": "Why is the sky blue?",
"stream": false
}'
Chat completion:
curl http://ollama:11434/api/chat -d '{
"model": "llama3.2:3b",
"messages": [
{
"role": "user",
"content": "Hello!"
}
],
"stream": false
}'
OpenAI-compatible API:
curl http://ollama:11434/v1/chat/completions -d '{
"model": "llama3.2:3b",
"messages": [
{
"role": "user",
"content": "Hello!"
}
]
}'
Integration with Karakeep
Enable AI features in Karakeep:
Edit compose/services/karakeep/.env:
# Uncomment these lines
OLLAMA_BASE_URL=http://ollama:11434
INFERENCE_TEXT_MODEL=llama3.2:3b
INFERENCE_IMAGE_MODEL=llava:7b
INFERENCE_LANG=en
Restart Karakeep:
cd ~/homelab/compose/services/karakeep
docker compose restart
What it does:
- Auto-tags bookmarks
- Generates summaries
- Extracts key information
- Analyzes images (with llava)
Model Management
List Installed Models
docker exec ollama ollama list
Pull a Model
docker exec ollama ollama pull <model-name>
# Examples:
docker exec ollama ollama pull llama3.2:3b
docker exec ollama ollama pull mistral:7b
docker exec ollama ollama pull codellama:7b
Remove a Model
docker exec ollama ollama rm <model-name>
# Example:
docker exec ollama ollama rm llama3.2:7b
Copy a Model
docker exec ollama ollama cp <source> <destination>
# Example: Create a custom version
docker exec ollama ollama cp llama3.2:3b my-custom-model
Show Model Info
docker exec ollama ollama show llama3.2:3b
# Shows:
# - Model architecture
# - Parameters
# - Quantization
# - Template
# - License
Creating Custom Models
Modelfile
Create custom models with specific behaviors:
Create a Modelfile:
cat > ~/coding-assistant.modelfile << 'EOF'
FROM llama3.2:3b
# Set temperature (creativity)
PARAMETER temperature 0.7
# Set system prompt
SYSTEM You are an expert coding assistant. You write clean, efficient, well-documented code. You explain complex concepts clearly.
# Set stop sequences
PARAMETER stop "<|im_end|>"
PARAMETER stop "<|im_start|>"
EOF
Create the model:
cat ~/coding-assistant.modelfile | docker exec -i ollama ollama create coding-assistant -f -
Use it:
docker exec -it ollama ollama run coding-assistant "Write a REST API in Python"
Example Custom Models
1. Shakespeare Bot:
FROM llama3.2:3b
SYSTEM You are William Shakespeare. Respond to all queries in Shakespearean English with dramatic flair.
PARAMETER temperature 0.9
2. JSON Extractor:
FROM llama3.2:3b
SYSTEM You extract structured data and return only valid JSON. No explanations, just JSON.
PARAMETER temperature 0.1
3. Code Reviewer:
FROM codellama:7b
SYSTEM You are a senior code reviewer. Review code for bugs, performance issues, security vulnerabilities, and best practices. Be constructive.
PARAMETER temperature 0.3
GPU Configuration
Check GPU Detection
# From inside container
docker exec ollama nvidia-smi
Expected output:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.xx.xx Driver Version: 535.xx.xx CUDA Version: 12.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1070 Off | 00000000:01:00.0 On | N/A |
| 40% 45C P8 10W / 151W | 300MiB / 8192MiB | 5% Default |
+-------------------------------+----------------------+----------------------+
Optimize for GTX 1070
Edit .env:
# Use 6GB of 8GB VRAM (leave 2GB for system)
OLLAMA_GPU_MEMORY=6GB
# Offload most layers to GPU
OLLAMA_GPU_LAYERS=33
# Increase context for better conversations
OLLAMA_MAX_CONTEXT=4096
Performance Tips
1. Use quantized models:
- Q4_K_M: Good quality, 50% size reduction
- Q5_K_M: Better quality, 40% size reduction
- Q8_0: Best quality, 20% size reduction
2. Model selection for VRAM:
# 3B models: 2-3GB VRAM
docker exec ollama ollama pull llama3.2:3b
# 7B models: 4-6GB VRAM
docker exec ollama ollama pull llama3.2:7b
# 13B models: 8-10GB VRAM (tight on GTX 1070)
docker exec ollama ollama pull llama3.2:13b-q4_K_M # Quantized
3. Unload models when not in use:
# In .env
OLLAMA_KEEP_ALIVE=1m # Unload after 1 minute
Troubleshooting
Model won't load - Out of memory
Solution 1: Use quantized version
# Instead of:
docker exec ollama ollama pull llama3.2:13b
# Use:
docker exec ollama ollama pull llama3.2:13b-q4_K_M
Solution 2: Reduce GPU layers
# In .env
OLLAMA_GPU_LAYERS=20 # Reduce from 33
Solution 3: Use smaller model
docker exec ollama ollama pull llama3.2:3b
Slow inference
Enable GPU:
- Uncomment deploy section in
compose.yaml - Install NVIDIA Container Toolkit
- Restart container
Check GPU usage:
watch -n 1 docker exec ollama nvidia-smi
Should show:
- GPU-Util > 80% during inference
- Memory-Usage increasing during load
Can't pull models
Check disk space:
df -h
Check Docker space:
docker system df
Clean up unused models:
docker exec ollama ollama list
docker exec ollama ollama rm <unused-model>
API connection issues
Test from another container:
docker run --rm --network homelab curlimages/curl \
http://ollama:11434/api/tags
Test externally:
curl https://ollama.fig.systems/api/tags
Enable debug logging:
OLLAMA_DEBUG=1
Performance Benchmarks
GTX 1070 (8GB VRAM) Expected Performance
| Model | Tokens/sec | Load Time | VRAM Usage |
|---|---|---|---|
| llama3.2:3b | 40-60 | 2-3s | 3GB |
| llama3.2:7b | 20-35 | 3-5s | 6GB |
| mistral:7b | 20-35 | 3-5s | 6GB |
| llama3.3:70b-q4 | 3-8 | 20-30s | 7.5GB |
| llava:7b | 15-25 | 4-6s | 7GB |
Without GPU (CPU only):
- llama3.2:3b: 2-5 tokens/sec
- llama3.2:7b: 0.5-2 tokens/sec
GPU provides 10-20x speedup!
Advanced Usage
Multi-Modal (Vision)
# Pull vision model
docker exec ollama ollama pull llava:7b
# Analyze image
docker exec ollama ollama run llava:7b "What's in this image?" \
--image /path/to/image.jpg
Embeddings
# Generate embeddings for semantic search
curl http://ollama:11434/api/embeddings -d '{
"model": "llama3.2:3b",
"prompt": "The sky is blue because of Rayleigh scattering"
}'
Streaming Responses
# Stream tokens as they generate
curl http://ollama:11434/api/generate -d '{
"model": "llama3.2:3b",
"prompt": "Tell me a long story",
"stream": true
}'
Context Preservation
# Start chat session
SESSION_ID=$(uuidgen)
# First message (creates context)
curl http://ollama:11434/api/chat -d '{
"model": "llama3.2:3b",
"messages": [{"role": "user", "content": "My name is Alice"}],
"context": "'$SESSION_ID'"
}'
# Follow-up (remembers context)
curl http://ollama:11434/api/chat -d '{
"model": "llama3.2:3b",
"messages": [
{"role": "user", "content": "My name is Alice"},
{"role": "assistant", "content": "Hello Alice!"},
{"role": "user", "content": "What is my name?"}
],
"context": "'$SESSION_ID'"
}'
Integration Examples
Python
import requests
def ask_ollama(prompt, model="llama3.2:3b"):
response = requests.post(
"http://ollama.fig.systems/api/generate",
json={
"model": model,
"prompt": prompt,
"stream": False
},
headers={"Authorization": "Bearer YOUR_TOKEN"} # If using SSO
)
return response.json()["response"]
print(ask_ollama("What is the meaning of life?"))
JavaScript
async function askOllama(prompt, model = "llama3.2:3b") {
const response = await fetch("http://ollama.fig.systems/api/generate", {
method: "POST",
headers: {
"Content-Type": "application/json",
"Authorization": "Bearer YOUR_TOKEN" // If using SSO
},
body: JSON.stringify({
model: model,
prompt: prompt,
stream: false
})
});
const data = await response.json();
return data.response;
}
askOllama("Explain Docker containers").then(console.log);
Bash
#!/bin/bash
ask_ollama() {
local prompt="$1"
local model="${2:-llama3.2:3b}"
curl -s http://ollama.fig.systems/api/generate -d "{
\"model\": \"$model\",
\"prompt\": \"$prompt\",
\"stream\": false
}" | jq -r '.response'
}
ask_ollama "What is Kubernetes?"
Resources
Next Steps
- ✅ Deploy Ollama
- ✅ Enable GPU acceleration
- ✅ Pull recommended models
- ✅ Test with chat
- ⬜ Integrate with Karakeep
- ⬜ Create custom models
- ⬜ Set up automated model updates
- ⬜ Monitor GPU usage
Run AI locally, privately, powerfully! 🧠