Paperless-ngx Complete Setup Guide: Docker, OCR, Tagging, and Workflow Automation
Somewhere in your house, there's a drawer. Maybe a filing cabinet. Maybe a cardboard box in a closet. It's full of documents you might need someday: tax receipts, insurance papers, warranty cards, medical records, that letter from your bank about account changes. Finding any specific document in there takes anywhere from 5 minutes to "I'll just request a new copy."
Photo by Brett Jordan on Unsplash
Paperless-ngx replaces that drawer with a searchable, organized, automatically tagged digital archive. Scan a document, drop it in a folder, and Paperless-ngx will OCR it, classify it, tag it, detect the date, identify the sender, and file it. Months later, you type "property tax 2024" into the search bar and find it in two seconds.
This guide walks through a complete Paperless-ngx setup: Docker installation, OCR tuning, consumption directories, the tagging system, and practical workflow tips that make the difference between "I set this up once and never used it" and "this is now essential to my household."
What Paperless-ngx Actually Does
When a document enters Paperless-ngx (via file upload, email, scanner, or API), it goes through a pipeline:
- File detection — Identifies the file type (PDF, image, Office document)
- OCR processing — Runs Tesseract OCR on images and scanned PDFs, creating a searchable text layer
- Date detection — Scans the text for dates and picks the most likely document date
- Correspondent matching — Identifies who sent the document (bank, utility company, employer)
- Document type classification — Categorizes it (invoice, receipt, letter, contract)
- Tag suggestion — Machine learning suggests tags based on content similarity to previously tagged documents
- Storage — Archives the original file plus a searchable PDF/A version
- Indexing — Adds the full text to the search index
The result: every document you've ever received becomes searchable by content, date, correspondent, type, or tag. The search is fast — even with thousands of documents, results appear instantly.
Docker Compose Installation
Here's a production-ready Docker Compose configuration. This uses PostgreSQL (better performance than SQLite for large collections) and Redis (required for background task processing):
# docker-compose.yml
services:
broker:
image: redis:7-alpine
container_name: paperless-redis
restart: unless-stopped
volumes:
- redis_data:/data
db:
image: postgres:16-alpine
container_name: paperless-db
restart: unless-stopped
environment:
POSTGRES_DB: paperless
POSTGRES_USER: paperless
POSTGRES_PASSWORD: your-secure-db-password
volumes:
- pgdata:/var/lib/postgresql/data
webserver:
image: ghcr.io/paperless-ngx/paperless-ngx:latest
container_name: paperless
restart: unless-stopped
depends_on:
- db
- broker
ports:
- "8000:8000"
environment:
PAPERLESS_REDIS: redis://broker:6379
PAPERLESS_DBHOST: db
PAPERLESS_DBNAME: paperless
PAPERLESS_DBUSER: paperless
PAPERLESS_DBPASS: your-secure-db-password
PAPERLESS_SECRET_KEY: generate-a-random-64-char-string-here
PAPERLESS_URL: https://paperless.yourdomain.com
PAPERLESS_TIME_ZONE: America/New_York
PAPERLESS_OCR_LANGUAGE: eng
PAPERLESS_ADMIN_USER: admin
PAPERLESS_ADMIN_PASSWORD: your-admin-password
USERMAP_UID: 1000
USERMAP_GID: 1000
volumes:
- data:/usr/src/paperless/data
- media:/usr/src/paperless/media
- export:/usr/src/paperless/export
- ./consume:/usr/src/paperless/consume
volumes:
redis_data:
pgdata:
data:
media:
export:
Start everything:
docker compose up -d
Wait about 30 seconds for the database to initialize and the web server to start. Then visit http://your-server:8000 and log in with the admin credentials you set in the environment variables.
Generating a Secret Key
The PAPERLESS_SECRET_KEY must be a random string. Generate one with:
python3 -c 'import secrets; print(secrets.token_hex(32))'
Or with OpenSSL:
openssl rand -hex 32
OCR Configuration
Paperless-ngx uses Tesseract for OCR. The default configuration handles English documents well, but there are several settings worth tuning.
Multi-Language OCR
If you receive documents in multiple languages, configure Tesseract to recognize them:
environment:
PAPERLESS_OCR_LANGUAGE: eng+deu+fra # English, German, French
Additional language packs are included in the Docker image. Common codes: eng (English), deu (German), fra (French), spa (Spanish), ita (Italian), por (Portuguese), nld (Dutch), jpn (Japanese), zho (Chinese).
OCR Mode
Paperless-ngx has several OCR modes:
environment:
PAPERLESS_OCR_MODE: skip_noarchive # Default: skip if text layer exists, no archive
# Options:
# skip — Skip OCR if the PDF already has a text layer
# redo — Always re-OCR, even if text layer exists
# force — OCR everything, overwriting existing text layers
# skip_noarchive — Skip OCR if text exists, don't create archive version
For most users, skip is the best option. It avoids re-processing PDFs that already have embedded text (like digitally-generated bank statements) while still OCR-ing scanned documents.
OCR Output Type
environment:
PAPERLESS_OCR_OUTPUT_TYPE: pdfa # Default
# pdfa — PDF/A format (archival standard, recommended)
# pdf — Standard PDF
# pdfa-1 — PDF/A-1b specifically
# pdfa-2 — PDF/A-2b specifically
PDF/A is the archival standard — it ensures your documents remain readable decades from now. Stick with the default.
Image DPI for OCR
If you're scanning at high resolution but OCR quality seems poor, adjust the DPI setting:
environment:
PAPERLESS_OCR_IMAGE_DPI: 300 # Default: auto-detect
300 DPI is the sweet spot for text documents. Going higher doesn't improve OCR accuracy and significantly increases processing time.
Like what you're reading? Subscribe to Self-Hosted Weekly — free weekly guides in your inbox.
Consumption Directories
The consumption directory is where you drop files for Paperless-ngx to ingest. In our Docker Compose setup, it's mapped to ./consume on the host. Any file you place in this directory gets automatically processed and added to the archive.
Setting Up a Network Scanner
Most modern scanners (Brother, Fujitsu ScanSnap, Epson) support "scan to folder" over SMB/CIFS or FTP. Point your scanner at the consumption directory:
- Share the
consumedirectory via Samba:
# /etc/samba/smb.conf
[paperless-consume]
path = /path/to/consume
writable = yes
valid users = scanner
create mask = 0664
directory mask = 0775
- Configure your scanner to save to
\\your-server\paperless-consume
Every scan now automatically flows into Paperless-ngx.
Email Consumption
Paperless-ngx can fetch documents from an email inbox automatically. Configure an email account that receives your bills and statements:
environment:
PAPERLESS_EMAIL_TASK_CRON: "*/10 * * * *" # Check every 10 minutes
Then in the Paperless-ngx admin panel (Settings > Mail), add a mail account:
- IMAP server: imap.yourprovider.com
- Port: 993 (SSL)
- Username/Password: Your email credentials
- Folder: INBOX (or a dedicated subfolder)
Create a mail rule specifying which attachments to consume (PDFs, images) and what to do with processed emails (mark as read, move to folder, delete).
This is particularly powerful if you set up email forwarding: configure your bank, utility companies, and insurance providers to email statements to a dedicated address that Paperless-ngx monitors.
Subdirectory Consumption
You can use subdirectories within the consumption folder to automatically assign tags:
environment:
PAPERLESS_CONSUMER_SUBDIRS_AS_TAGS: "true"
With this enabled:
- Files in
consume/taxes/get tagged "taxes" - Files in
consume/medical/get tagged "medical" - Files in
consume/receipts/get tagged "receipts"
This is handy if you have household members who scan documents but don't want to interact with the Paperless-ngx web interface — they just drop files in the right folder.
The Tagging System
Paperless-ngx has four classification axes, each serving a different purpose:
Correspondents
A correspondent is who sent or created the document. Examples: "Bank of America," "State Farm Insurance," "City Water Department," "Dr. Smith's Office."
Paperless-ngx learns correspondents over time. After you manually assign a correspondent to a few documents from the same sender, the ML system starts suggesting it automatically for new documents with similar content.
Document Types
Document types categorize what the document is. Examples: "Invoice," "Receipt," "Contract," "Medical Record," "Tax Form," "Warranty Card," "Insurance Policy."
Keep document types broad. You don't need "Electric Bill" and "Water Bill" as separate types — "Invoice" covers both. Use tags for granularity.
Tags
Tags are the flexible classification layer. Unlike correspondents and document types (which are one-per-document), a document can have multiple tags. Examples:
- "2024," "2025" (year)
- "tax-deductible" (financial relevance)
- "car," "house," "health" (category)
- "action-required" (workflow status)
- "important" (priority)
Storage Paths
Storage paths control how Paperless-ngx organizes the archived files on disk. By default, everything goes into a flat structure. With storage paths, you can create hierarchical filing:
archive/
taxes/
2024/
W2-employer.pdf
1099-bank.pdf
2025/
W2-employer.pdf
insurance/
auto-policy-2024.pdf
home-policy-2024.pdf
medical/
lab-results-2024-03.pdf
Configure storage path templates in the admin panel. A typical template:
{correspondent}/{document_type}/{created_year}/{title}
Automatic Matching
Paperless-ngx supports several matching algorithms for auto-assigning correspondents, types, and tags:
- Exact match — Document content contains the exact string
- Regular expression — Content matches a regex pattern
- Fuzzy match — Content approximately matches (handles OCR errors)
- Auto (ML) — Machine learning based on previously classified documents
For correspondents, regex matching works well. For example, match "Bank of America" with the pattern (?i)bank\s+of\s+america|bofa|boa\s+statement. The (?i) makes it case-insensitive.
For tags, the ML auto-matching is surprisingly accurate after you've manually tagged about 20-30 documents. The more documents you correctly tag, the better the suggestions become.
Training the ML System
When you first set up Paperless-ngx, spend 30 minutes manually classifying your first batch of documents:
- Upload or scan 30-50 documents
- For each, set the correspondent, document type, and relevant tags
- After classifying this initial batch, enable auto-matching (ML) on your correspondents, types, and tags
From this point forward, Paperless-ngx will suggest classifications for new documents. Accept correct suggestions and fix incorrect ones — the system learns from corrections.
Practical Workflow
Here's a daily workflow that keeps your document archive current without becoming a chore:
Incoming Mail
When physical mail arrives:
- Open it
- Decide if you need to keep it (most mail is junk)
- If keeping: scan it with your phone (using the Paperless-ngx mobile app or any scanning app that saves to a folder), or feed it through a desktop scanner
- The document appears in Paperless-ngx within minutes
- Verify the auto-classification is correct
- Recycle the paper original (unless you need the original for legal purposes)
Digital Documents
For documents that arrive by email (statements, receipts, confirmations):
- If email consumption is configured, they arrive automatically
- If not, save the PDF and drop it in the consumption directory
Monthly Review
Once a month, spend 10 minutes:
- Check the Paperless-ngx inbox for any unclassified documents
- Review auto-assigned tags for accuracy
- Update any correspondents that weren't recognized
- Create new tags or document types if a pattern has emerged
Backup and Restore
Paperless-ngx stores data in three places that need backup:
- PostgreSQL database — All metadata, tags, correspondents, and search indexes
- Media directory — Original documents and OCR'd archive versions
- Data directory — Thumbnails, classification models, and configuration
Built-in Export
Paperless-ngx has a built-in export function:
docker compose exec webserver document_exporter ../export
This creates a manifest file plus all original documents — a portable backup you can import into a fresh installation.
Database Backup
For faster, incremental backups, dump the database separately:
docker compose exec db pg_dump -U paperless paperless > paperless-backup-$(date +%Y%m%d).sql
Include this dump plus the media volume in your regular Restic/Borg backup.
Restore from Export
docker compose exec webserver document_importer ../export
This recreates all documents, metadata, tags, and classifications from a previous export.
Performance Tuning
Worker Processes
If documents are processing slowly, increase the number of worker processes:
environment:
PAPERLESS_TASK_WORKERS: 2 # Default: 1
PAPERLESS_THREADS_PER_WORKER: 2 # Default: 1
Each worker can process one document at a time. With 2 workers, you can process two documents simultaneously. Don't set this higher than your CPU core count.
Thumbnail Generation
Thumbnails are generated for every document and used in the web UI grid view. For large archives, thumbnail generation can be slow:
environment:
PAPERLESS_WEBSERVER_WORKERS: 2 # Default: 1
Search Optimization
Paperless-ngx uses Whoosh for full-text search by default. For very large archives (10,000+ documents), the built-in search works well but may slow down. If you need faster search, consider the PAPERLESS_SEARCH_BACKEND option to use a more powerful backend.
Mobile Access
Paperless Mobile App
The community-built Paperless Mobile app (available for Android and iOS) provides a native interface for browsing, searching, and uploading documents. It connects to your Paperless-ngx instance via the REST API.
To use it, you'll need:
- Your Paperless-ngx URL accessible from outside your home network (via reverse proxy, VPN, or Cloudflare Tunnel)
- Your username and password
- API access enabled (it's on by default)
Scanning from Your Phone
Any scanning app that can save to a folder works with Paperless-ngx. The workflow:
- Scan the document with your phone's camera
- Save the PDF to a folder synced to your server (via Nextcloud, Syncthing, or similar)
- That folder is the Paperless-ngx consumption directory
- Document appears in Paperless-ngx within minutes
For Android, OpenScan or Office Lens work well. For iOS, the built-in document scanner (in Files or Notes) produces excellent scans.
Reverse Proxy Setup
For remote access with HTTPS, put Paperless-ngx behind a reverse proxy. With Caddy:
paperless.yourdomain.com {
reverse_proxy paperless:8000
}
With Nginx:
server {
listen 443 ssl http2;
server_name paperless.yourdomain.com;
client_max_body_size 100M; # Allow large document uploads
ssl_certificate /etc/letsencrypt/live/paperless.yourdomain.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/paperless.yourdomain.com/privkey.pem;
location / {
proxy_pass http://localhost:8000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
}
Note the client_max_body_size directive — without it, Nginx will reject uploads larger than 1MB.
Is Paperless-ngx Worth It?
If you regularly deal with physical or digital documents (and who doesn't?), Paperless-ngx is one of the most immediately useful self-hosted services you can run. The setup takes about an hour. The initial classification effort takes maybe two hours. After that, the ongoing maintenance is measured in minutes per month.
The payoff comes the first time you need to find a specific document and it takes 5 seconds instead of 20 minutes. Or when tax season arrives and every deductible receipt is already tagged and searchable. Or when you need to reference an insurance policy and it's right there, with the exact clause highlighted by the full-text search.
It's the kind of tool that makes you wonder how you managed without it.
