Self-Hosting AI with Ollama and Open WebUI: Run LLMs Locally
Running large language models locally used to require deep knowledge of Python environments, model quantization, and GPU drivers. Today, two tools make it remarkably simple: Ollama handles downloading, running, and serving LLMs through a clean CLI, and Open WebUI provides a polished chat interface on top of it. Together, they give you a private, self-hosted alternative to ChatGPT that runs entirely on your own hardware.
Photo by Ladislav Sh on Unsplash
No API keys, no usage fees, no data leaving your network.
What Ollama Does
Ollama is a lightweight runtime for large language models. Think of it as Docker for LLMs -- you pull a model by name, and Ollama handles downloading the weights, loading them into memory, and serving an OpenAI-compatible API.
# Install and run a model in two commands
ollama pull llama3.1:8b
ollama run llama3.1:8b
Under the hood, Ollama uses llama.cpp for inference, which means it supports GGUF-quantized models and can run on both CPU and GPU. It exposes a REST API on port 11434, making it easy for other tools to interact with it.
What Open WebUI Provides
Open WebUI (formerly Ollama WebUI) is a self-hosted web interface that connects to Ollama's API. It gives you a ChatGPT-like experience in your browser, with features that go well beyond a basic chat box:
- Multi-model switching -- swap between installed models mid-conversation
- Conversation history -- persistent chat logs stored locally
- Document upload -- basic RAG (retrieval-augmented generation) for chatting with files
- User management -- multiple accounts with separate conversation histories
- Model management -- pull, delete, and configure models from the UI
- Prompt templates -- save and reuse system prompts
Docker Compose Setup
The simplest way to run both together is a single Docker Compose file.
CPU-Only Setup
# docker-compose.yml
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
volumes:
- ollama_data:/root/.ollama
ports:
- "11434:11434"
restart: unless-stopped
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
ports:
- "3000:8080"
volumes:
- open_webui_data:/app/backend/data
environment:
- OLLAMA_BASE_URL=http://ollama:11434
depends_on:
- ollama
restart: unless-stopped
volumes:
ollama_data:
open_webui_data:
docker compose up -d
Open http://your-server:3000, create an account (the first account becomes admin), and you're ready to go. You still need to pull a model -- either from the Open WebUI interface or via the CLI:
docker exec -it ollama ollama pull llama3.1:8b
NVIDIA GPU Setup
For GPU acceleration with an NVIDIA card, install the NVIDIA Container Toolkit, then modify the Ollama service:
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
volumes:
- ollama_data:/root/.ollama
ports:
- "11434:11434"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
restart: unless-stopped
Verify GPU access after starting:
docker exec -it ollama ollama run llama3.1:8b "Hello, are you using GPU?"
# Check nvidia-smi to confirm GPU utilization
nvidia-smi
Like what you're reading? Subscribe to Self-Hosted Weekly — free weekly guides in your inbox.
GPU vs CPU: What to Expect
The performance difference between GPU and CPU inference is dramatic. Here are rough expectations for generating tokens with Llama 3.1 8B:
| Hardware | Tokens/sec | Experience |
|---|---|---|
| Modern CPU (16 cores, AVX2) | 5-15 tok/s | Usable but noticeably slow |
| NVIDIA RTX 3060 (12 GB VRAM) | 40-60 tok/s | Smooth, real-time feel |
| NVIDIA RTX 3090 (24 GB VRAM) | 60-90 tok/s | Fast, comfortable |
| NVIDIA RTX 4090 (24 GB VRAM) | 90-130 tok/s | Near-instant responses |
| Apple M2 Pro (16 GB unified) | 30-50 tok/s | Good experience |
CPU inference is viable for small models and occasional use. If you plan to use LLMs regularly, a GPU with at least 8 GB of VRAM makes a significant quality-of-life difference.
Recommended Models
Not all models are equal, and bigger is not always better. Here's a practical guide to which models work well for self-hosting:
Best Starting Points
| Model | Size | VRAM Needed | Good For |
|---|---|---|---|
| Llama 3.1 8B | ~4.7 GB | 6 GB | General purpose, coding, reasoning |
| Mistral 7B | ~4.1 GB | 6 GB | Fast general use, good instruction following |
| Gemma 2 9B | ~5.4 GB | 8 GB | Strong reasoning, Google's quality |
| Phi-3 Mini 3.8B | ~2.2 GB | 4 GB | Surprisingly capable for its size |
| CodeLlama 7B | ~3.8 GB | 6 GB | Code generation and explanation |
Larger Models (If You Have the Hardware)
| Model | Size | VRAM Needed | Good For |
|---|---|---|---|
| Llama 3.1 70B | ~40 GB | 48 GB (or CPU offload) | Near-GPT-4 quality for many tasks |
| Mixtral 8x7B | ~26 GB | 32 GB | Excellent quality-to-speed ratio |
| Qwen 2.5 32B | ~18 GB | 24 GB | Strong multilingual and coding |
Pull any of these with:
docker exec -it ollama ollama pull llama3.1:8b
docker exec -it ollama ollama pull mistral:7b
docker exec -it ollama ollama pull gemma2:9b
Hardware Requirements
Minimum Viable Setup
- CPU: Any modern x86_64 with AVX2 support (most CPUs from 2015+)
- RAM: 8 GB (for 7B models with CPU inference)
- Storage: 10 GB per model (varies by quantization level)
- GPU: Not required, but strongly recommended
Comfortable Setup
- CPU: 8+ cores
- RAM: 32 GB
- GPU: NVIDIA card with 12+ GB VRAM (RTX 3060 12 GB is the sweet spot for price/performance)
- Storage: 100 GB SSD for model storage
Memory Rule of Thumb
For GGUF quantized models (which Ollama uses by default), expect roughly:
- Q4_0 quantization: Model parameter count in billions x 0.6 = GB needed
- Example: Llama 3.1 8B Q4_0 needs about 4.7 GB of RAM/VRAM
If a model doesn't fit entirely in VRAM, Ollama will split it between GPU and CPU memory, which works but reduces speed.
Practical Configuration Tips
Customizing Model Behavior
Create a custom model with a system prompt:
docker exec -it ollama bash -c 'cat <<EOF | ollama create my-assistant -f -
FROM llama3.1:8b
SYSTEM "You are a helpful technical assistant. Be concise and accurate."
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
EOF'
Increasing Context Length
By default, most models run with a 2048-token context window. For longer conversations:
docker exec -it ollama ollama run llama3.1:8b --num-ctx 8192
More context uses more memory. A 7B model with 8192 context needs roughly 2 GB more RAM than the default.
Exposing Ollama to Your Network
By default, Ollama only listens on localhost. To allow other devices (or Open WebUI on a different machine) to connect:
environment:
- OLLAMA_HOST=0.0.0.0
If you do this, make sure Ollama is behind a firewall or VPN. There is no built-in authentication.
Practical Use Cases
Self-hosted LLMs shine in specific scenarios:
- Privacy-sensitive queries -- medical questions, legal research, financial planning, anything you wouldn't type into a cloud service
- Code assistance -- local Copilot-like functionality without sending your codebase to a third party
- Document analysis -- upload contracts, papers, or reports to Open WebUI and ask questions about them
- Offline access -- works without internet once models are downloaded
- Learning and experimentation -- try different models, fine-tune prompts, understand how LLMs work without per-token costs
Self-Hosted LLMs vs Cloud APIs
| Feature | Self-Hosted (Ollama) | Cloud (ChatGPT/Claude) |
|---|---|---|
| Privacy | Complete -- nothing leaves your machine | Data sent to provider |
| Cost | Hardware only (one-time) | Per-token or subscription |
| Model quality | Good (7B-70B class) | State-of-the-art (GPT-4o, Claude) |
| Speed | Depends on hardware | Consistently fast |
| Internet required | No (after model download) | Yes |
| Customization | Full control (system prompts, fine-tuning) | Limited |
| Maintenance | You manage updates and hardware | Zero maintenance |
The honest truth: cloud models like GPT-4o and Claude are still significantly more capable than anything you can run locally on consumer hardware. Self-hosted LLMs are best as a complement to cloud services, not a replacement. Use them for privacy-sensitive tasks and experimentation, and cloud APIs when you need maximum quality.
Keeping Things Updated
Ollama and Open WebUI both move quickly. Update regularly:
docker compose pull
docker compose up -d
To update a model to the latest version:
docker exec -it ollama ollama pull llama3.1:8b
Verdict
Ollama and Open WebUI are the easiest way to run LLMs on your own hardware. The setup takes five minutes with Docker Compose, and the experience is genuinely good -- especially with a decent GPU. You won't match GPT-4 quality with a 7B model, but for many everyday tasks, local models are more than capable. The privacy and zero ongoing cost make it worth running alongside whatever cloud services you already use.
