History

Eduardo Figueroa 4a0be9ff93 feat: Expose Ollama API port for local access Add port mapping for Ollama API (11434) to enable direct access from host machine for CLI tools and local development.		2025-12-03 19:53:51 +00:00
..
.env	feat: Replace Linkwarden with Karakeep, add Ollama LLM server, add config examples	2025-11-09 06:16:27 +00:00
.gitignore	feat: Replace Linkwarden with Karakeep, add Ollama LLM server, add config examples	2025-11-09 06:16:27 +00:00
compose.yaml	feat: Expose Ollama API port for local access	2025-12-03 19:53:51 +00:00
README.md	feat: Replace Linkwarden with Karakeep, add Ollama LLM server, add config examples	2025-11-09 06:16:27 +00:00

README.md

Ollama - Local Large Language Models

Run powerful AI models locally on your hardware with GPU acceleration.

Overview

Ollama enables you to run large language models (LLMs) locally:

✅ 100% Private: All data stays on your server
✅ GPU Accelerated: Leverages your GTX 1070
✅ Multiple Models: Run Llama, Mistral, CodeLlama, and more
✅ API Compatible: OpenAI-compatible API
✅ No Cloud Costs: Free inference after downloading models
✅ Integration Ready: Works with Karakeep, Open WebUI, and more

Quick Start

1. Deploy Ollama

cd ~/homelab/compose/services/ollama
docker compose up -d

2. Pull a Model

# Small, fast model (3B parameters, ~2GB)
docker exec ollama ollama pull llama3.2:3b

# Medium model (7B parameters, ~4GB)
docker exec ollama ollama pull llama3.2:7b

# Large model (70B parameters, ~40GB - requires quantization)
docker exec ollama ollama pull llama3.3:70b-instruct-q4_K_M

3. Test

# Interactive chat
docker exec -it ollama ollama run llama3.2:3b

# Ask a question
> Hello, how are you?

4. Enable GPU (Recommended)

Edit compose.yaml and uncomment the deploy section:

deploy:
  resources:
    reservations:
      devices:
        - driver: nvidia
          count: 1
          capabilities: [gpu]

Restart:

docker compose down
docker compose up -d

Verify GPU usage:

# Check GPU is detected
docker exec ollama nvidia-smi

# Run model with GPU
docker exec ollama ollama run llama3.2:3b "What GPU am I using?"

Available Models

Recommended Models for GTX 1070 (8GB VRAM)

Model	Size	VRAM	Speed	Use Case
llama3.2:3b	2GB	3GB	Fast	General chat, Karakeep
llama3.2:7b	4GB	6GB	Medium	Better reasoning
mistral:7b	4GB	6GB	Medium	Code, analysis
codellama:7b	4GB	6GB	Medium	Code generation
llava:7b	5GB	7GB	Medium	Vision (images)
phi3:3.8b	2.3GB	4GB	Fast	Compact, efficient

Specialized Models

Code:

codellama:7b - Code generation
codellama:13b-python - Python expert
starcoder2:7b - Multi-language code

Vision (Image Understanding):

llava:7b - General vision
llava:13b - Better vision (needs more VRAM)
bakllava:7b - Vision + chat

Multilingual:

aya:8b - 101 languages
command-r:35b - Enterprise multilingual

Math & Reasoning:

deepseek-math:7b - Mathematics
wizard-math:7b - Math word problems

Large Models (Quantized for GTX 1070)

These require 4-bit quantization to fit in 8GB VRAM:

# 70B models (quantized)
docker exec ollama ollama pull llama3.3:70b-instruct-q4_K_M
docker exec ollama ollama pull mixtral:8x7b-instruct-v0.1-q4_K_M

# Very large (use with caution)
docker exec ollama ollama pull llama3.1:405b-instruct-q2_K

Usage

Command Line

Run model interactively:

docker exec -it ollama ollama run llama3.2:3b

One-off question:

docker exec ollama ollama run llama3.2:3b "Explain quantum computing in simple terms"

With system prompt:

docker exec ollama ollama run llama3.2:3b \
  --system "You are a helpful coding assistant." \
  "Write a Python function to sort a list"

API Usage

List models:

curl http://ollama:11434/api/tags

Generate text:

curl http://ollama:11434/api/generate -d '{
  "model": "llama3.2:3b",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

Chat completion:

curl http://ollama:11434/api/chat -d '{
  "model": "llama3.2:3b",
  "messages": [
    {
      "role": "user",
      "content": "Hello!"
    }
  ],
  "stream": false
}'

OpenAI-compatible API:

curl http://ollama:11434/v1/chat/completions -d '{
  "model": "llama3.2:3b",
  "messages": [
    {
      "role": "user",
      "content": "Hello!"
    }
  ]
}'

Integration with Karakeep

Enable AI features in Karakeep:

Edit compose/services/karakeep/.env:

# Uncomment these lines
OLLAMA_BASE_URL=http://ollama:11434
INFERENCE_TEXT_MODEL=llama3.2:3b
INFERENCE_IMAGE_MODEL=llava:7b
INFERENCE_LANG=en

Restart Karakeep:

cd ~/homelab/compose/services/karakeep
docker compose restart

What it does:

Auto-tags bookmarks
Generates summaries
Extracts key information
Analyzes images (with llava)

Model Management

List Installed Models

docker exec ollama ollama list

Pull a Model

docker exec ollama ollama pull <model-name>

# Examples:
docker exec ollama ollama pull llama3.2:3b
docker exec ollama ollama pull mistral:7b
docker exec ollama ollama pull codellama:7b

Remove a Model

docker exec ollama ollama rm <model-name>

# Example:
docker exec ollama ollama rm llama3.2:7b

Copy a Model

docker exec ollama ollama cp <source> <destination>

# Example: Create a custom version
docker exec ollama ollama cp llama3.2:3b my-custom-model

Show Model Info

docker exec ollama ollama show llama3.2:3b

# Shows:
# - Model architecture
# - Parameters
# - Quantization
# - Template
# - License

Creating Custom Models

Modelfile

Create custom models with specific behaviors:

Create a Modelfile:

cat > ~/coding-assistant.modelfile << 'EOF'
FROM llama3.2:3b

# Set temperature (creativity)
PARAMETER temperature 0.7

# Set system prompt
SYSTEM You are an expert coding assistant. You write clean, efficient, well-documented code. You explain complex concepts clearly.

# Set stop sequences
PARAMETER stop "<|im_end|>"
PARAMETER stop "<|im_start|>"
EOF

Create the model:

cat ~/coding-assistant.modelfile | docker exec -i ollama ollama create coding-assistant -f -

Use it:

docker exec -it ollama ollama run coding-assistant "Write a REST API in Python"

Example Custom Models

1. Shakespeare Bot:

FROM llama3.2:3b
SYSTEM You are William Shakespeare. Respond to all queries in Shakespearean English with dramatic flair.
PARAMETER temperature 0.9

2. JSON Extractor:

FROM llama3.2:3b
SYSTEM You extract structured data and return only valid JSON. No explanations, just JSON.
PARAMETER temperature 0.1

3. Code Reviewer:

FROM codellama:7b
SYSTEM You are a senior code reviewer. Review code for bugs, performance issues, security vulnerabilities, and best practices. Be constructive.
PARAMETER temperature 0.3

GPU Configuration

Check GPU Detection

# From inside container
docker exec ollama nvidia-smi

Expected output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.xx.xx    Driver Version: 535.xx.xx    CUDA Version: 12.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1070    Off  | 00000000:01:00.0  On |                  N/A |
| 40%   45C    P8    10W / 151W |    300MiB /  8192MiB |      5%      Default |
+-------------------------------+----------------------+----------------------+

Optimize for GTX 1070

Edit .env:

# Use 6GB of 8GB VRAM (leave 2GB for system)
OLLAMA_GPU_MEMORY=6GB

# Offload most layers to GPU
OLLAMA_GPU_LAYERS=33

# Increase context for better conversations
OLLAMA_MAX_CONTEXT=4096

Performance Tips

1. Use quantized models:

Q4_K_M: Good quality, 50% size reduction
Q5_K_M: Better quality, 40% size reduction
Q8_0: Best quality, 20% size reduction

2. Model selection for VRAM:

# 3B models: 2-3GB VRAM
docker exec ollama ollama pull llama3.2:3b

# 7B models: 4-6GB VRAM
docker exec ollama ollama pull llama3.2:7b

# 13B models: 8-10GB VRAM (tight on GTX 1070)
docker exec ollama ollama pull llama3.2:13b-q4_K_M  # Quantized

3. Unload models when not in use:

# In .env
OLLAMA_KEEP_ALIVE=1m  # Unload after 1 minute

Troubleshooting

Model won't load - Out of memory

Solution 1: Use quantized version

# Instead of:
docker exec ollama ollama pull llama3.2:13b

# Use:
docker exec ollama ollama pull llama3.2:13b-q4_K_M

Solution 2: Reduce GPU layers

# In .env
OLLAMA_GPU_LAYERS=20  # Reduce from 33

Solution 3: Use smaller model

docker exec ollama ollama pull llama3.2:3b

Slow inference

Enable GPU:

Uncomment deploy section in compose.yaml
Install NVIDIA Container Toolkit
Restart container

Check GPU usage:

watch -n 1 docker exec ollama nvidia-smi

Should show:

GPU-Util > 80% during inference
Memory-Usage increasing during load

Can't pull models

Check disk space:

df -h

Check Docker space:

docker system df

Clean up unused models:

docker exec ollama ollama list
docker exec ollama ollama rm <unused-model>

API connection issues

Test from another container:

docker run --rm --network homelab curlimages/curl \
  http://ollama:11434/api/tags

Test externally:

curl https://ollama.fig.systems/api/tags

Enable debug logging:

OLLAMA_DEBUG=1

Performance Benchmarks

GTX 1070 (8GB VRAM) Expected Performance

Model	Tokens/sec	Load Time	VRAM Usage
llama3.2:3b	40-60	2-3s	3GB
llama3.2:7b	20-35	3-5s	6GB
mistral:7b	20-35	3-5s	6GB
llama3.3:70b-q4	3-8	20-30s	7.5GB
llava:7b	15-25	4-6s	7GB

Without GPU (CPU only):

llama3.2:3b: 2-5 tokens/sec
llama3.2:7b: 0.5-2 tokens/sec

GPU provides 10-20x speedup!

Advanced Usage

# Pull vision model
docker exec ollama ollama pull llava:7b

# Analyze image
docker exec ollama ollama run llava:7b "What's in this image?" \
  --image /path/to/image.jpg

Embeddings

# Generate embeddings for semantic search
curl http://ollama:11434/api/embeddings -d '{
  "model": "llama3.2:3b",
  "prompt": "The sky is blue because of Rayleigh scattering"
}'

Streaming Responses

# Stream tokens as they generate
curl http://ollama:11434/api/generate -d '{
  "model": "llama3.2:3b",
  "prompt": "Tell me a long story",
  "stream": true
}'

Context Preservation

# Start chat session
SESSION_ID=$(uuidgen)

# First message (creates context)
curl http://ollama:11434/api/chat -d '{
  "model": "llama3.2:3b",
  "messages": [{"role": "user", "content": "My name is Alice"}],
  "context": "'$SESSION_ID'"
}'

# Follow-up (remembers context)
curl http://ollama:11434/api/chat -d '{
  "model": "llama3.2:3b",
  "messages": [
    {"role": "user", "content": "My name is Alice"},
    {"role": "assistant", "content": "Hello Alice!"},
    {"role": "user", "content": "What is my name?"}
  ],
  "context": "'$SESSION_ID'"
}'

Integration Examples

Python

import requests

def ask_ollama(prompt, model="llama3.2:3b"):
    response = requests.post(
        "http://ollama.fig.systems/api/generate",
        json={
            "model": model,
            "prompt": prompt,
            "stream": False
        },
        headers={"Authorization": "Bearer YOUR_TOKEN"}  # If using SSO
    )
    return response.json()["response"]

print(ask_ollama("What is the meaning of life?"))

JavaScript

async function askOllama(prompt, model = "llama3.2:3b") {
  const response = await fetch("http://ollama.fig.systems/api/generate", {
    method: "POST",
    headers: {
      "Content-Type": "application/json",
      "Authorization": "Bearer YOUR_TOKEN"  // If using SSO
    },
    body: JSON.stringify({
      model: model,
      prompt: prompt,
      stream: false
    })
  });

  const data = await response.json();
  return data.response;
}

askOllama("Explain Docker containers").then(console.log);

Bash

#!/bin/bash
ask_ollama() {
  local prompt="$1"
  local model="${2:-llama3.2:3b}"

  curl -s http://ollama.fig.systems/api/generate -d "{
    \"model\": \"$model\",
    \"prompt\": \"$prompt\",
    \"stream\": false
  }" | jq -r '.response'
}

ask_ollama "What is Kubernetes?"

Resources

Next Steps

✅ Deploy Ollama
✅ Enable GPU acceleration
✅ Pull recommended models
✅ Test with chat
⬜ Integrate with Karakeep
⬜ Create custom models
⬜ Set up automated model updates
⬜ Monitor GPU usage

Run AI locally, privately, powerfully! 🧠