Self-Hosting AI with Ollama and Open WebUI: Run LLMs Locally

Infrastructure 2026-02-09 ollama open-webui llm ai machine-learning

Running large language models locally used to require deep knowledge of Python environments, model quantization, and GPU drivers. Today, two tools make it remarkably simple: Ollama handles downloading, running, and serving LLMs through a clean CLI, and Open WebUI provides a polished chat interface on top of it. Together, they give you a private, self-hosted alternative to ChatGPT that runs entirely on your own hardware.

No API keys, no usage fees, no data leaving your network.

What Ollama Does

Ollama is a lightweight runtime for large language models. Think of it as Docker for LLMs -- you pull a model by name, and Ollama handles downloading the weights, loading them into memory, and serving an OpenAI-compatible API.

# Install and run a model in two commands
ollama pull llama3.1:8b
ollama run llama3.1:8b

Under the hood, Ollama uses llama.cpp for inference, which means it supports GGUF-quantized models and can run on both CPU and GPU. It exposes a REST API on port 11434, making it easy for other tools to interact with it.

What Open WebUI Provides

Open WebUI (formerly Ollama WebUI) is a self-hosted web interface that connects to Ollama's API. It gives you a ChatGPT-like experience in your browser, with features that go well beyond a basic chat box:

Multi-model switching -- swap between installed models mid-conversation
Conversation history -- persistent chat logs stored locally
Document upload -- basic RAG (retrieval-augmented generation) for chatting with files
User management -- multiple accounts with separate conversation histories
Model management -- pull, delete, and configure models from the UI
Prompt templates -- save and reuse system prompts

Docker Compose Setup

The simplest way to run both together is a single Docker Compose file.

CPU-Only Setup

# docker-compose.yml
services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    volumes:
      - ollama_data:/root/.ollama
    ports:
      - "11434:11434"
    restart: unless-stopped

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    ports:
      - "3000:8080"
    volumes:
      - open_webui_data:/app/backend/data
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    depends_on:
      - ollama
    restart: unless-stopped

volumes:
  ollama_data:
  open_webui_data:

docker compose up -d

Open http://your-server:3000, create an account (the first account becomes admin), and you're ready to go. You still need to pull a model -- either from the Open WebUI interface or via the CLI:

docker exec -it ollama ollama pull llama3.1:8b

NVIDIA GPU Setup

For GPU acceleration with an NVIDIA card, install the NVIDIA Container Toolkit, then modify the Ollama service:

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    volumes:
      - ollama_data:/root/.ollama
    ports:
      - "11434:11434"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    restart: unless-stopped

Verify GPU access after starting:

docker exec -it ollama ollama run llama3.1:8b "Hello, are you using GPU?"
# Check nvidia-smi to confirm GPU utilization
nvidia-smi

GPU vs CPU: What to Expect

The performance difference between GPU and CPU inference is dramatic. Here are rough expectations for generating tokens with Llama 3.1 8B:

Hardware	Tokens/sec	Experience
Modern CPU (16 cores, AVX2)	5-15 tok/s	Usable but noticeably slow
NVIDIA RTX 3060 (12 GB VRAM)	40-60 tok/s	Smooth, real-time feel
NVIDIA RTX 3090 (24 GB VRAM)	60-90 tok/s	Fast, comfortable
NVIDIA RTX 4090 (24 GB VRAM)	90-130 tok/s	Near-instant responses
Apple M2 Pro (16 GB unified)	30-50 tok/s	Good experience

CPU inference is viable for small models and occasional use. If you plan to use LLMs regularly, a GPU with at least 8 GB of VRAM makes a significant quality-of-life difference.

Recommended Models

Not all models are equal, and bigger is not always better. Here's a practical guide to which models work well for self-hosting:

Best Starting Points

Model	Size	VRAM Needed	Good For
Llama 3.1 8B	~4.7 GB	6 GB	General purpose, coding, reasoning
Mistral 7B	~4.1 GB	6 GB	Fast general use, good instruction following
Gemma 2 9B	~5.4 GB	8 GB	Strong reasoning, Google's quality
Phi-3 Mini 3.8B	~2.2 GB	4 GB	Surprisingly capable for its size
CodeLlama 7B	~3.8 GB	6 GB	Code generation and explanation

Larger Models (If You Have the Hardware)

Model	Size	VRAM Needed	Good For
Llama 3.1 70B	~40 GB	48 GB (or CPU offload)	Near-GPT-4 quality for many tasks
Mixtral 8x7B	~26 GB	32 GB	Excellent quality-to-speed ratio
Qwen 2.5 32B	~18 GB	24 GB	Strong multilingual and coding

Pull any of these with:

docker exec -it ollama ollama pull llama3.1:8b
docker exec -it ollama ollama pull mistral:7b
docker exec -it ollama ollama pull gemma2:9b

Hardware Requirements

Minimum Viable Setup

CPU: Any modern x86_64 with AVX2 support (most CPUs from 2015+)
RAM: 8 GB (for 7B models with CPU inference)
Storage: 10 GB per model (varies by quantization level)
GPU: Not required, but strongly recommended

Comfortable Setup

CPU: 8+ cores
RAM: 32 GB
GPU: NVIDIA card with 12+ GB VRAM (RTX 3060 12 GB is the sweet spot for price/performance)
Storage: 100 GB SSD for model storage

Memory Rule of Thumb

For GGUF quantized models (which Ollama uses by default), expect roughly:

Q4_0 quantization: Model parameter count in billions x 0.6 = GB needed
Example: Llama 3.1 8B Q4_0 needs about 4.7 GB of RAM/VRAM

If a model doesn't fit entirely in VRAM, Ollama will split it between GPU and CPU memory, which works but reduces speed.

Practical Configuration Tips

Customizing Model Behavior

Create a custom model with a system prompt:

docker exec -it ollama bash -c 'cat <<EOF | ollama create my-assistant -f -
FROM llama3.1:8b
SYSTEM "You are a helpful technical assistant. Be concise and accurate."
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
EOF'

Increasing Context Length

By default, most models run with a 2048-token context window. For longer conversations:

docker exec -it ollama ollama run llama3.1:8b --num-ctx 8192

More context uses more memory. A 7B model with 8192 context needs roughly 2 GB more RAM than the default.

Exposing Ollama to Your Network

By default, Ollama only listens on localhost. To allow other devices (or Open WebUI on a different machine) to connect:

environment:
  - OLLAMA_HOST=0.0.0.0

If you do this, make sure Ollama is behind a firewall or VPN. There is no built-in authentication.

Practical Use Cases

Self-hosted LLMs shine in specific scenarios:

Privacy-sensitive queries -- medical questions, legal research, financial planning, anything you wouldn't type into a cloud service
Code assistance -- local Copilot-like functionality without sending your codebase to a third party
Document analysis -- upload contracts, papers, or reports to Open WebUI and ask questions about them
Offline access -- works without internet once models are downloaded
Learning and experimentation -- try different models, fine-tune prompts, understand how LLMs work without per-token costs

Self-Hosted LLMs vs Cloud APIs

Feature	Self-Hosted (Ollama)	Cloud (ChatGPT/Claude)
Privacy	Complete -- nothing leaves your machine	Data sent to provider
Cost	Hardware only (one-time)	Per-token or subscription
Model quality	Good (7B-70B class)	State-of-the-art (GPT-4o, Claude)
Speed	Depends on hardware	Consistently fast
Internet required	No (after model download)	Yes
Customization	Full control (system prompts, fine-tuning)	Limited
Maintenance	You manage updates and hardware	Zero maintenance

The honest truth: cloud models like GPT-4o and Claude are still significantly more capable than anything you can run locally on consumer hardware. Self-hosted LLMs are best as a complement to cloud services, not a replacement. Use them for privacy-sensitive tasks and experimentation, and cloud APIs when you need maximum quality.

Keeping Things Updated

Ollama and Open WebUI both move quickly. Update regularly:

docker compose pull
docker compose up -d

To update a model to the latest version:

docker exec -it ollama ollama pull llama3.1:8b

Verdict

Ollama and Open WebUI are the easiest way to run LLMs on your own hardware. The setup takes five minutes with Docker Compose, and the experience is genuinely good -- especially with a decent GPU. You won't match GPT-4 quality with a 7B model, but for many everyday tasks, local models are more than capable. The privacy and zero ongoing cost make it worth running alongside whatever cloud services you already use.