← All articles
black flat screen computer monitor

Self-Hosting AI with Ollama and Open WebUI: Run LLMs Locally

Infrastructure 2026-02-09 · 6 min read ollama open-webui llm ai machine-learning
By Selfhosted Guides Editorial TeamSelf-hosting practitioners covering open source software, home lab infrastructure, and data sovereignty.

Running large language models locally used to require deep knowledge of Python environments, model quantization, and GPU drivers. Today, two tools make it remarkably simple: Ollama handles downloading, running, and serving LLMs through a clean CLI, and Open WebUI provides a polished chat interface on top of it. Together, they give you a private, self-hosted alternative to ChatGPT that runs entirely on your own hardware.

Photo by Ladislav Sh on Unsplash

No API keys, no usage fees, no data leaving your network.

Ollama logo for running large language models locally

What Ollama Does

Ollama is a lightweight runtime for large language models. Think of it as Docker for LLMs -- you pull a model by name, and Ollama handles downloading the weights, loading them into memory, and serving an OpenAI-compatible API.

# Install and run a model in two commands
ollama pull llama3.1:8b
ollama run llama3.1:8b

Under the hood, Ollama uses llama.cpp for inference, which means it supports GGUF-quantized models and can run on both CPU and GPU. It exposes a REST API on port 11434, making it easy for other tools to interact with it.

What Open WebUI Provides

Open WebUI (formerly Ollama WebUI) is a self-hosted web interface that connects to Ollama's API. It gives you a ChatGPT-like experience in your browser, with features that go well beyond a basic chat box:

Ollama + Open WebUI Architecture User Browser Open WebUI Chat interface Port 3000 Ollama API REST endpoint Port 11434 GPU/CPU Inference engine LLM Model weights All processing happens locally — no data leaves your network

Docker Compose Setup

The simplest way to run both together is a single Docker Compose file.

CPU-Only Setup

# docker-compose.yml
services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    volumes:
      - ollama_data:/root/.ollama
    ports:
      - "11434:11434"
    restart: unless-stopped

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    ports:
      - "3000:8080"
    volumes:
      - open_webui_data:/app/backend/data
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    depends_on:
      - ollama
    restart: unless-stopped

volumes:
  ollama_data:
  open_webui_data:
docker compose up -d

Open http://your-server:3000, create an account (the first account becomes admin), and you're ready to go. You still need to pull a model -- either from the Open WebUI interface or via the CLI:

docker exec -it ollama ollama pull llama3.1:8b

NVIDIA GPU Setup

For GPU acceleration with an NVIDIA card, install the NVIDIA Container Toolkit, then modify the Ollama service:

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    volumes:
      - ollama_data:/root/.ollama
    ports:
      - "11434:11434"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    restart: unless-stopped

Verify GPU access after starting:

docker exec -it ollama ollama run llama3.1:8b "Hello, are you using GPU?"
# Check nvidia-smi to confirm GPU utilization
nvidia-smi

Like what you're reading? Subscribe to Self-Hosted Weekly — free weekly guides in your inbox.

GPU vs CPU: What to Expect

The performance difference between GPU and CPU inference is dramatic. Here are rough expectations for generating tokens with Llama 3.1 8B:

Hardware Tokens/sec Experience
Modern CPU (16 cores, AVX2) 5-15 tok/s Usable but noticeably slow
NVIDIA RTX 3060 (12 GB VRAM) 40-60 tok/s Smooth, real-time feel
NVIDIA RTX 3090 (24 GB VRAM) 60-90 tok/s Fast, comfortable
NVIDIA RTX 4090 (24 GB VRAM) 90-130 tok/s Near-instant responses
Apple M2 Pro (16 GB unified) 30-50 tok/s Good experience

CPU inference is viable for small models and occasional use. If you plan to use LLMs regularly, a GPU with at least 8 GB of VRAM makes a significant quality-of-life difference.

Recommended Models

Not all models are equal, and bigger is not always better. Here's a practical guide to which models work well for self-hosting:

Best Starting Points

Model Size VRAM Needed Good For
Llama 3.1 8B ~4.7 GB 6 GB General purpose, coding, reasoning
Mistral 7B ~4.1 GB 6 GB Fast general use, good instruction following
Gemma 2 9B ~5.4 GB 8 GB Strong reasoning, Google's quality
Phi-3 Mini 3.8B ~2.2 GB 4 GB Surprisingly capable for its size
CodeLlama 7B ~3.8 GB 6 GB Code generation and explanation

Larger Models (If You Have the Hardware)

Model Size VRAM Needed Good For
Llama 3.1 70B ~40 GB 48 GB (or CPU offload) Near-GPT-4 quality for many tasks
Mixtral 8x7B ~26 GB 32 GB Excellent quality-to-speed ratio
Qwen 2.5 32B ~18 GB 24 GB Strong multilingual and coding

Pull any of these with:

docker exec -it ollama ollama pull llama3.1:8b
docker exec -it ollama ollama pull mistral:7b
docker exec -it ollama ollama pull gemma2:9b

Hardware Requirements

Minimum Viable Setup

Comfortable Setup

Memory Rule of Thumb

For GGUF quantized models (which Ollama uses by default), expect roughly:

If a model doesn't fit entirely in VRAM, Ollama will split it between GPU and CPU memory, which works but reduces speed.

Practical Configuration Tips

Customizing Model Behavior

Create a custom model with a system prompt:

docker exec -it ollama bash -c 'cat <<EOF | ollama create my-assistant -f -
FROM llama3.1:8b
SYSTEM "You are a helpful technical assistant. Be concise and accurate."
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
EOF'

Increasing Context Length

By default, most models run with a 2048-token context window. For longer conversations:

docker exec -it ollama ollama run llama3.1:8b --num-ctx 8192

More context uses more memory. A 7B model with 8192 context needs roughly 2 GB more RAM than the default.

Exposing Ollama to Your Network

By default, Ollama only listens on localhost. To allow other devices (or Open WebUI on a different machine) to connect:

environment:
  - OLLAMA_HOST=0.0.0.0

If you do this, make sure Ollama is behind a firewall or VPN. There is no built-in authentication.

Practical Use Cases

Self-hosted LLMs shine in specific scenarios:

Self-Hosted LLMs vs Cloud APIs

Feature Self-Hosted (Ollama) Cloud (ChatGPT/Claude)
Privacy Complete -- nothing leaves your machine Data sent to provider
Cost Hardware only (one-time) Per-token or subscription
Model quality Good (7B-70B class) State-of-the-art (GPT-4o, Claude)
Speed Depends on hardware Consistently fast
Internet required No (after model download) Yes
Customization Full control (system prompts, fine-tuning) Limited
Maintenance You manage updates and hardware Zero maintenance

The honest truth: cloud models like GPT-4o and Claude are still significantly more capable than anything you can run locally on consumer hardware. Self-hosted LLMs are best as a complement to cloud services, not a replacement. Use them for privacy-sensitive tasks and experimentation, and cloud APIs when you need maximum quality.

Keeping Things Updated

Ollama and Open WebUI both move quickly. Update regularly:

docker compose pull
docker compose up -d

To update a model to the latest version:

docker exec -it ollama ollama pull llama3.1:8b

Verdict

Ollama and Open WebUI are the easiest way to run LLMs on your own hardware. The setup takes five minutes with Docker Compose, and the experience is genuinely good -- especially with a decent GPU. You won't match GPT-4 quality with a 7B model, but for many everyday tasks, local models are more than capable. The privacy and zero ongoing cost make it worth running alongside whatever cloud services you already use.

Get free weekly tips in your inbox. Subscribe to Self-Hosted Weekly