Self-Hosting Stable Diffusion with ComfyUI: Local AI Image Generation
Cloud image generation services charge per image, impose content filters you can't control, and send every prompt to someone else's server. Self-hosting Stable Diffusion eliminates all three problems. You get unlimited generations, full control over what you create, and complete privacy -- all running on your own GPU.
Photo by Lightsaber Collection on Unsplash
ComfyUI is the best way to run Stable Diffusion locally. It's a node-based workflow editor that exposes the entire diffusion pipeline as a visual graph. Instead of hiding complexity behind a single "Generate" button, ComfyUI lets you wire together CLIP text encoding, KSampler nodes, VAE decoding, ControlNet conditioning, and LoRA loading exactly how you want. That sounds intimidating, but the default workflow works out of the box -- and the node graph means you can understand and modify every step of the generation process.

Why ComfyUI Over Automatic1111
AUTOMATIC1111's Web UI (A1111) was the default Stable Diffusion interface for years. It's still popular, but ComfyUI has pulled ahead for several reasons:
- Performance -- ComfyUI only re-executes nodes that changed. Edit your prompt and it skips VAE decoding from the previous run. A1111 re-runs the entire pipeline every time.
- Memory efficiency -- ComfyUI uses aggressive model offloading. It can run SDXL on 6 GB VRAM cards that choke under A1111.
- Workflow flexibility -- Node graphs let you build complex pipelines (img2img chains, ControlNet stacking, multi-LoRA blending) that would require extension hacks in A1111.
- Reproducibility -- Workflows are JSON files. Save them, share them, version-control them. Someone else can load your exact pipeline and get identical results.
- Active development -- ComfyUI supports new model architectures (Flux, SD3, Stable Cascade) faster than A1111.
The tradeoff: A1111 has a simpler interface for basic text-to-image. If you just want to type a prompt and click Generate, A1111 is more approachable. But the moment you want to do anything beyond basic generation, ComfyUI's node system is dramatically more powerful.
System Requirements
Image generation is GPU-bound. CPU inference exists but is impractically slow -- expect 10+ minutes per image instead of seconds.
Minimum (Functional)
- GPU: NVIDIA card with 6 GB VRAM (GTX 1660 Super, RTX 2060)
- RAM: 16 GB system memory
- Storage: 20 GB for ComfyUI + one model checkpoint
- OS: Linux recommended, Windows works, macOS via MPS (slower)
Recommended
- GPU: NVIDIA RTX 3060 12 GB or RTX 4060 Ti 16 GB
- RAM: 32 GB system memory
- Storage: 100+ GB SSD (checkpoints are 2-7 GB each, and you will collect them)
- Driver: NVIDIA 535+ with CUDA 12.1+
VRAM Guidelines
| Model | Resolution | VRAM Needed | Time/Image (RTX 3060) |
|---|---|---|---|
| SD 1.5 | 512x512 | 4 GB | ~3 sec |
| SDXL | 1024x1024 | 6-8 GB | ~8 sec |
| Flux Dev | 1024x1024 | 10-12 GB | ~15 sec |
| SD 1.5 + ControlNet | 512x512 | 6 GB | ~5 sec |
| SDXL + LoRA + ControlNet | 1024x1024 | 10 GB | ~12 sec |
AMD GPUs work via ROCm but expect rougher edges. Intel Arc has experimental support. Apple Silicon runs through MPS -- functional but 2-3x slower than equivalent NVIDIA hardware.
Docker Deployment
The cleanest way to run ComfyUI is with Docker and NVIDIA Container Toolkit.
First, install the NVIDIA Container Toolkit and verify your GPU is visible: docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
# docker-compose.yml
services:
comfyui:
image: ghcr.io/ai-dock/comfyui:latest
container_name: comfyui
ports:
- "8188:8188"
volumes:
- ./models:/opt/ComfyUI/models
- ./output:/opt/ComfyUI/output
- ./custom_nodes:/opt/ComfyUI/custom_nodes
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
environment:
- CLI_ARGS=--listen 0.0.0.0
restart: unless-stopped
volumes:
comfyui_data:
mkdir -p models/checkpoints models/loras models/controlnet models/vae output custom_nodes
docker compose up -d
Open http://your-server:8188 and you'll see the ComfyUI node editor. It ships with a default text-to-image workflow -- but you'll need to download a model checkpoint first.
Like what you're reading? Subscribe to Self-Hosted Weekly — free weekly guides in your inbox.
Model Management
Models are the core of image generation. You need at least one checkpoint to get started.
Downloading Your First Checkpoint
# SD 1.5 -- small, fast, huge ecosystem of LoRAs and embeddings
wget -P models/checkpoints/ \
"https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5/resolve/main/v1-5-pruned-emaonly.safetensors"
# SDXL -- higher quality, higher VRAM usage
wget -P models/checkpoints/ \
"https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/resolve/main/sd_xl_base_1.0.safetensors"
Community fine-tunes on CivitAI and Hugging Face are where the real variety lives. Models like Realistic Vision (photorealism), DreamShaper (artistic), and Juggernaut XL (general purpose SDXL) are popular starting points.
Model Directory Structure
ComfyUI expects models in specific subdirectories:
models/
├── checkpoints/ # Main model files (.safetensors)
├── loras/ # LoRA fine-tunes (style/subject adapters)
├── controlnet/ # ControlNet models (pose, depth, canny)
├── vae/ # VAE decoders (affects color/detail)
├── embeddings/ # Textual inversions
├── upscale_models/ # Upscaler models (RealESRGAN, etc.)
└── clip/ # CLIP text encoder models
Drop files into the right directory and refresh the ComfyUI browser page. No restart needed.
Workflow Basics
ComfyUI's node graph can look overwhelming at first. Here's what the default text-to-image workflow does:
- Load Checkpoint -- loads the model (UNet, CLIP, VAE) into memory
- CLIP Text Encode (Positive) -- converts your prompt into embeddings the model understands
- CLIP Text Encode (Negative) -- encodes things you don't want in the image
- KSampler -- the denoising loop that actually generates the image from noise
- VAE Decode -- converts the latent output into a visible image
- Save Image -- writes the result to disk
Each node has inputs and outputs you can rewire. Want to add a LoRA? Insert a "Load LoRA" node between the checkpoint and the CLIP encoder. Want ControlNet? Add a "Load ControlNet Model" and "Apply ControlNet" node before the KSampler. The graph makes the data flow explicit.
Useful Workflow Patterns
- Hires Fix: Generate at base resolution, then upscale with a second KSampler pass. Dramatically improves detail.
- ControlNet Posing: Feed a reference image through a pose estimator, then condition generation on the skeleton. Consistent character poses without prompt gymnastics.
- LoRA Stacking: Chain multiple LoRAs to combine styles. A "cinematic lighting" LoRA plus a "watercolor" LoRA creates interesting hybrids.
- Batch Generation: Set the KSampler batch size to generate multiple variations in one pass.
Essential Custom Nodes
ComfyUI's plugin ecosystem lives in the custom_nodes directory. Install ComfyUI Manager first -- it adds a UI button for browsing and installing everything else:
cd custom_nodes && git clone https://github.com/ltdrdata/ComfyUI-Manager.git
docker compose restart comfyui
From there, the must-haves: Impact-Pack (face detection, regional prompting), IPAdapter_plus (style transfer from reference images), AnimateDiff-Evolved (prompt-to-animation), and UltimateSDUpscale (tile-based upscaling without VRAM limits).
Performance Tips
- Enable FP16/FP8: Add
--force-fp16to CLI_ARGS for half-precision inference. Uses less VRAM with negligible quality loss. - VAE tiling: For high-resolution images, enable VAE tiling to avoid out-of-memory errors during decode.
- Model caching: ComfyUI keeps the last-used model in VRAM. Switching models frequently thrashes memory. Stick to one checkpoint per session when possible.
- SSD storage: Model loading time is dominated by disk read speed. NVMe SSDs load a 6 GB checkpoint in under 2 seconds; spinning disks take 20+.
- Queue system: ComfyUI has a built-in queue. You can stack up multiple generations and walk away.
Securing Your Instance
ComfyUI has no built-in authentication. If you expose port 8188 to your network:
- Reverse proxy with auth: Put Caddy or Nginx in front with basic auth or SSO
- VPN/Tailscale: Only expose ComfyUI over your private network
- Cloudflare Tunnel: Zero-trust access without port forwarding
Never expose ComfyUI directly to the internet. It executes arbitrary Python through custom nodes -- an unauthenticated instance is a remote code execution vulnerability.
Verdict
ComfyUI is the power-user's choice for local image generation. The node-based interface has a steeper learning curve than A1111's form-based UI, but it pays dividends immediately: better performance, lower VRAM usage, reproducible workflows, and the ability to build generation pipelines that simply aren't possible in other interfaces. If you have an NVIDIA GPU with 8+ GB of VRAM, you can be generating images in under ten minutes with the Docker Compose setup above. The entire Stable Diffusion ecosystem -- thousands of checkpoints, LoRAs, ControlNets, and community workflows -- is available to you, running entirely on your own hardware, with no per-image costs and no content restrictions beyond what you choose.
