Self-Hosting AI with Ollama and Open WebUI: Run LLMs Locally
Running large language models locally used to require deep knowledge of Python environments, model quantization, and GPU drivers. Today, two tools make it remarkably simple: Ollama handles downloading, running, and serving LLMs through a clean CLI, and Open WebUI provides a polished chat interface on top of it. Together, they give you a private, self-hosted alternative to ChatGPT that runs entirely on your own hardware.
No API keys, no usage fees, no data leaving your network.
What Ollama Does
Ollama is a lightweight runtime for large language models. Think of it as Docker for LLMs -- you pull a model by name, and Ollama handles downloading the weights, loading them into memory, and serving an OpenAI-compatible API.
# Install and run a model in two commands
ollama pull llama3.1:8b
ollama run llama3.1:8b
Under the hood, Ollama uses llama.cpp for inference, which means it supports GGUF-quantized models and can run on both CPU and GPU. It exposes a REST API on port 11434, making it easy for other tools to interact with it.
What Open WebUI Provides
Open WebUI (formerly Ollama WebUI) is a self-hosted web interface that connects to Ollama's API. It gives you a ChatGPT-like experience in your browser, with features that go well beyond a basic chat box:
- Multi-model switching -- swap between installed models mid-conversation
- Conversation history -- persistent chat logs stored locally
- Document upload -- basic RAG (retrieval-augmented generation) for chatting with files
- User management -- multiple accounts with separate conversation histories
- Model management -- pull, delete, and configure models from the UI
- Prompt templates -- save and reuse system prompts
Docker Compose Setup
The simplest way to run both together is a single Docker Compose file.
CPU-Only Setup
# docker-compose.yml
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
volumes:
- ollama_data:/root/.ollama
ports:
- "11434:11434"
restart: unless-stopped
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
ports:
- "3000:8080"
volumes:
- open_webui_data:/app/backend/data
environment:
- OLLAMA_BASE_URL=http://ollama:11434
depends_on:
- ollama
restart: unless-stopped
volumes:
ollama_data:
open_webui_data:
docker compose up -d
Open http://your-server:3000, create an account (the first account becomes admin), and you're ready to go. You still need to pull a model -- either from the Open WebUI interface or via the CLI:
docker exec -it ollama ollama pull llama3.1:8b
NVIDIA GPU Setup
For GPU acceleration with an NVIDIA card, install the NVIDIA Container Toolkit, then modify the Ollama service:
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
volumes:
- ollama_data:/root/.ollama
ports:
- "11434:11434"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
restart: unless-stopped
Verify GPU access after starting:
docker exec -it ollama ollama run llama3.1:8b "Hello, are you using GPU?"
# Check nvidia-smi to confirm GPU utilization
nvidia-smi
GPU vs CPU: What to Expect
The performance difference between GPU and CPU inference is dramatic. Here are rough expectations for generating tokens with Llama 3.1 8B:
| Hardware | Tokens/sec | Experience |
|---|---|---|
| Modern CPU (16 cores, AVX2) | 5-15 tok/s | Usable but noticeably slow |
| NVIDIA RTX 3060 (12 GB VRAM) | 40-60 tok/s | Smooth, real-time feel |
| NVIDIA RTX 3090 (24 GB VRAM) | 60-90 tok/s | Fast, comfortable |
| NVIDIA RTX 4090 (24 GB VRAM) | 90-130 tok/s | Near-instant responses |
| Apple M2 Pro (16 GB unified) | 30-50 tok/s | Good experience |
CPU inference is viable for small models and occasional use. If you plan to use LLMs regularly, a GPU with at least 8 GB of VRAM makes a significant quality-of-life difference.
Recommended Models
Not all models are equal, and bigger is not always better. Here's a practical guide to which models work well for self-hosting:
Best Starting Points
| Model | Size | VRAM Needed | Good For |
|---|---|---|---|
| Llama 3.1 8B | ~4.7 GB | 6 GB | General purpose, coding, reasoning |
| Mistral 7B | ~4.1 GB | 6 GB | Fast general use, good instruction following |
| Gemma 2 9B | ~5.4 GB | 8 GB | Strong reasoning, Google's quality |
| Phi-3 Mini 3.8B | ~2.2 GB | 4 GB | Surprisingly capable for its size |
| CodeLlama 7B | ~3.8 GB | 6 GB | Code generation and explanation |
Larger Models (If You Have the Hardware)
| Model | Size | VRAM Needed | Good For |
|---|---|---|---|
| Llama 3.1 70B | ~40 GB | 48 GB (or CPU offload) | Near-GPT-4 quality for many tasks |
| Mixtral 8x7B | ~26 GB | 32 GB | Excellent quality-to-speed ratio |
| Qwen 2.5 32B | ~18 GB | 24 GB | Strong multilingual and coding |
Pull any of these with:
docker exec -it ollama ollama pull llama3.1:8b
docker exec -it ollama ollama pull mistral:7b
docker exec -it ollama ollama pull gemma2:9b
Hardware Requirements
Minimum Viable Setup
- CPU: Any modern x86_64 with AVX2 support (most CPUs from 2015+)
- RAM: 8 GB (for 7B models with CPU inference)
- Storage: 10 GB per model (varies by quantization level)
- GPU: Not required, but strongly recommended
Comfortable Setup
- CPU: 8+ cores
- RAM: 32 GB
- GPU: NVIDIA card with 12+ GB VRAM (RTX 3060 12 GB is the sweet spot for price/performance)
- Storage: 100 GB SSD for model storage
Memory Rule of Thumb
For GGUF quantized models (which Ollama uses by default), expect roughly:
- Q4_0 quantization: Model parameter count in billions x 0.6 = GB needed
- Example: Llama 3.1 8B Q4_0 needs about 4.7 GB of RAM/VRAM
If a model doesn't fit entirely in VRAM, Ollama will split it between GPU and CPU memory, which works but reduces speed.
Practical Configuration Tips
Customizing Model Behavior
Create a custom model with a system prompt:
docker exec -it ollama bash -c 'cat <<EOF | ollama create my-assistant -f -
FROM llama3.1:8b
SYSTEM "You are a helpful technical assistant. Be concise and accurate."
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
EOF'
Increasing Context Length
By default, most models run with a 2048-token context window. For longer conversations:
docker exec -it ollama ollama run llama3.1:8b --num-ctx 8192
More context uses more memory. A 7B model with 8192 context needs roughly 2 GB more RAM than the default.
Exposing Ollama to Your Network
By default, Ollama only listens on localhost. To allow other devices (or Open WebUI on a different machine) to connect:
environment:
- OLLAMA_HOST=0.0.0.0
If you do this, make sure Ollama is behind a firewall or VPN. There is no built-in authentication.
Practical Use Cases
Self-hosted LLMs shine in specific scenarios:
- Privacy-sensitive queries -- medical questions, legal research, financial planning, anything you wouldn't type into a cloud service
- Code assistance -- local Copilot-like functionality without sending your codebase to a third party
- Document analysis -- upload contracts, papers, or reports to Open WebUI and ask questions about them
- Offline access -- works without internet once models are downloaded
- Learning and experimentation -- try different models, fine-tune prompts, understand how LLMs work without per-token costs
Self-Hosted LLMs vs Cloud APIs
| Feature | Self-Hosted (Ollama) | Cloud (ChatGPT/Claude) |
|---|---|---|
| Privacy | Complete -- nothing leaves your machine | Data sent to provider |
| Cost | Hardware only (one-time) | Per-token or subscription |
| Model quality | Good (7B-70B class) | State-of-the-art (GPT-4o, Claude) |
| Speed | Depends on hardware | Consistently fast |
| Internet required | No (after model download) | Yes |
| Customization | Full control (system prompts, fine-tuning) | Limited |
| Maintenance | You manage updates and hardware | Zero maintenance |
The honest truth: cloud models like GPT-4o and Claude are still significantly more capable than anything you can run locally on consumer hardware. Self-hosted LLMs are best as a complement to cloud services, not a replacement. Use them for privacy-sensitive tasks and experimentation, and cloud APIs when you need maximum quality.
Keeping Things Updated
Ollama and Open WebUI both move quickly. Update regularly:
docker compose pull
docker compose up -d
To update a model to the latest version:
docker exec -it ollama ollama pull llama3.1:8b
Verdict
Ollama and Open WebUI are the easiest way to run LLMs on your own hardware. The setup takes five minutes with Docker Compose, and the experience is genuinely good -- especially with a decent GPU. You won't match GPT-4 quality with a 7B model, but for many everyday tasks, local models are more than capable. The privacy and zero ongoing cost make it worth running alongside whatever cloud services you already use.