← All articles
four MacBook diskettes

Paperless-ngx Complete Setup Guide: Docker, OCR, Tagging, and Workflow Automation

Productivity 2026-02-15 · 9 min read paperless-ngx document-management ocr docker organization productivity
By Selfhosted Guides Editorial TeamSelf-hosting practitioners covering open source software, home lab infrastructure, and data sovereignty.

Somewhere in your house, there's a drawer. Maybe a filing cabinet. Maybe a cardboard box in a closet. It's full of documents you might need someday: tax receipts, insurance papers, warranty cards, medical records, that letter from your bank about account changes. Finding any specific document in there takes anywhere from 5 minutes to "I'll just request a new copy."

Photo by Brett Jordan on Unsplash

Paperless-ngx replaces that drawer with a searchable, organized, automatically tagged digital archive. Scan a document, drop it in a folder, and Paperless-ngx will OCR it, classify it, tag it, detect the date, identify the sender, and file it. Months later, you type "property tax 2024" into the search bar and find it in two seconds.

This guide walks through a complete Paperless-ngx setup: Docker installation, OCR tuning, consumption directories, the tagging system, and practical workflow tips that make the difference between "I set this up once and never used it" and "this is now essential to my household."

Paperless-ngx document management logo

What Paperless-ngx Actually Does

When a document enters Paperless-ngx (via file upload, email, scanner, or API), it goes through a pipeline:

  1. File detection — Identifies the file type (PDF, image, Office document)
  2. OCR processing — Runs Tesseract OCR on images and scanned PDFs, creating a searchable text layer
  3. Date detection — Scans the text for dates and picks the most likely document date
  4. Correspondent matching — Identifies who sent the document (bank, utility company, employer)
  5. Document type classification — Categorizes it (invoice, receipt, letter, contract)
  6. Tag suggestion — Machine learning suggests tags based on content similarity to previously tagged documents
  7. Storage — Archives the original file plus a searchable PDF/A version
  8. Indexing — Adds the full text to the search index

The result: every document you've ever received becomes searchable by content, date, correspondent, type, or tag. The search is fast — even with thousands of documents, results appear instantly.

Docker Compose Installation

Here's a production-ready Docker Compose configuration. This uses PostgreSQL (better performance than SQLite for large collections) and Redis (required for background task processing):

# docker-compose.yml
services:
  broker:
    image: redis:7-alpine
    container_name: paperless-redis
    restart: unless-stopped
    volumes:
      - redis_data:/data

  db:
    image: postgres:16-alpine
    container_name: paperless-db
    restart: unless-stopped
    environment:
      POSTGRES_DB: paperless
      POSTGRES_USER: paperless
      POSTGRES_PASSWORD: your-secure-db-password
    volumes:
      - pgdata:/var/lib/postgresql/data

  webserver:
    image: ghcr.io/paperless-ngx/paperless-ngx:latest
    container_name: paperless
    restart: unless-stopped
    depends_on:
      - db
      - broker
    ports:
      - "8000:8000"
    environment:
      PAPERLESS_REDIS: redis://broker:6379
      PAPERLESS_DBHOST: db
      PAPERLESS_DBNAME: paperless
      PAPERLESS_DBUSER: paperless
      PAPERLESS_DBPASS: your-secure-db-password
      PAPERLESS_SECRET_KEY: generate-a-random-64-char-string-here
      PAPERLESS_URL: https://paperless.yourdomain.com
      PAPERLESS_TIME_ZONE: America/New_York
      PAPERLESS_OCR_LANGUAGE: eng
      PAPERLESS_ADMIN_USER: admin
      PAPERLESS_ADMIN_PASSWORD: your-admin-password
      USERMAP_UID: 1000
      USERMAP_GID: 1000
    volumes:
      - data:/usr/src/paperless/data
      - media:/usr/src/paperless/media
      - export:/usr/src/paperless/export
      - ./consume:/usr/src/paperless/consume

volumes:
  redis_data:
  pgdata:
  data:
  media:
  export:

Start everything:

docker compose up -d

Wait about 30 seconds for the database to initialize and the web server to start. Then visit http://your-server:8000 and log in with the admin credentials you set in the environment variables.

Generating a Secret Key

The PAPERLESS_SECRET_KEY must be a random string. Generate one with:

python3 -c 'import secrets; print(secrets.token_hex(32))'

Or with OpenSSL:

openssl rand -hex 32

OCR Configuration

Paperless-ngx uses Tesseract for OCR. The default configuration handles English documents well, but there are several settings worth tuning.

Multi-Language OCR

If you receive documents in multiple languages, configure Tesseract to recognize them:

environment:
  PAPERLESS_OCR_LANGUAGE: eng+deu+fra  # English, German, French

Additional language packs are included in the Docker image. Common codes: eng (English), deu (German), fra (French), spa (Spanish), ita (Italian), por (Portuguese), nld (Dutch), jpn (Japanese), zho (Chinese).

OCR Mode

Paperless-ngx has several OCR modes:

environment:
  PAPERLESS_OCR_MODE: skip_noarchive  # Default: skip if text layer exists, no archive
  # Options:
  # skip        — Skip OCR if the PDF already has a text layer
  # redo        — Always re-OCR, even if text layer exists
  # force       — OCR everything, overwriting existing text layers
  # skip_noarchive — Skip OCR if text exists, don't create archive version

For most users, skip is the best option. It avoids re-processing PDFs that already have embedded text (like digitally-generated bank statements) while still OCR-ing scanned documents.

OCR Output Type

environment:
  PAPERLESS_OCR_OUTPUT_TYPE: pdfa  # Default
  # pdfa     — PDF/A format (archival standard, recommended)
  # pdf      — Standard PDF
  # pdfa-1   — PDF/A-1b specifically
  # pdfa-2   — PDF/A-2b specifically

PDF/A is the archival standard — it ensures your documents remain readable decades from now. Stick with the default.

Image DPI for OCR

If you're scanning at high resolution but OCR quality seems poor, adjust the DPI setting:

environment:
  PAPERLESS_OCR_IMAGE_DPI: 300  # Default: auto-detect

300 DPI is the sweet spot for text documents. Going higher doesn't improve OCR accuracy and significantly increases processing time.

Like what you're reading? Subscribe to Self-Hosted Weekly — free weekly guides in your inbox.

Consumption Directories

The consumption directory is where you drop files for Paperless-ngx to ingest. In our Docker Compose setup, it's mapped to ./consume on the host. Any file you place in this directory gets automatically processed and added to the archive.

Setting Up a Network Scanner

Most modern scanners (Brother, Fujitsu ScanSnap, Epson) support "scan to folder" over SMB/CIFS or FTP. Point your scanner at the consumption directory:

  1. Share the consume directory via Samba:
# /etc/samba/smb.conf
[paperless-consume]
  path = /path/to/consume
  writable = yes
  valid users = scanner
  create mask = 0664
  directory mask = 0775
  1. Configure your scanner to save to \\your-server\paperless-consume

Every scan now automatically flows into Paperless-ngx.

Email Consumption

Paperless-ngx can fetch documents from an email inbox automatically. Configure an email account that receives your bills and statements:

environment:
  PAPERLESS_EMAIL_TASK_CRON: "*/10 * * * *"  # Check every 10 minutes

Then in the Paperless-ngx admin panel (Settings > Mail), add a mail account:

Create a mail rule specifying which attachments to consume (PDFs, images) and what to do with processed emails (mark as read, move to folder, delete).

This is particularly powerful if you set up email forwarding: configure your bank, utility companies, and insurance providers to email statements to a dedicated address that Paperless-ngx monitors.

Subdirectory Consumption

You can use subdirectories within the consumption folder to automatically assign tags:

environment:
  PAPERLESS_CONSUMER_SUBDIRS_AS_TAGS: "true"

With this enabled:

This is handy if you have household members who scan documents but don't want to interact with the Paperless-ngx web interface — they just drop files in the right folder.

The Tagging System

Paperless-ngx has four classification axes, each serving a different purpose:

Correspondents

A correspondent is who sent or created the document. Examples: "Bank of America," "State Farm Insurance," "City Water Department," "Dr. Smith's Office."

Paperless-ngx learns correspondents over time. After you manually assign a correspondent to a few documents from the same sender, the ML system starts suggesting it automatically for new documents with similar content.

Document Types

Document types categorize what the document is. Examples: "Invoice," "Receipt," "Contract," "Medical Record," "Tax Form," "Warranty Card," "Insurance Policy."

Keep document types broad. You don't need "Electric Bill" and "Water Bill" as separate types — "Invoice" covers both. Use tags for granularity.

Tags

Tags are the flexible classification layer. Unlike correspondents and document types (which are one-per-document), a document can have multiple tags. Examples:

Storage Paths

Storage paths control how Paperless-ngx organizes the archived files on disk. By default, everything goes into a flat structure. With storage paths, you can create hierarchical filing:

archive/
  taxes/
    2024/
      W2-employer.pdf
      1099-bank.pdf
    2025/
      W2-employer.pdf
  insurance/
    auto-policy-2024.pdf
    home-policy-2024.pdf
  medical/
    lab-results-2024-03.pdf

Configure storage path templates in the admin panel. A typical template:

{correspondent}/{document_type}/{created_year}/{title}

Automatic Matching

Paperless-ngx supports several matching algorithms for auto-assigning correspondents, types, and tags:

  1. Exact match — Document content contains the exact string
  2. Regular expression — Content matches a regex pattern
  3. Fuzzy match — Content approximately matches (handles OCR errors)
  4. Auto (ML) — Machine learning based on previously classified documents

For correspondents, regex matching works well. For example, match "Bank of America" with the pattern (?i)bank\s+of\s+america|bofa|boa\s+statement. The (?i) makes it case-insensitive.

For tags, the ML auto-matching is surprisingly accurate after you've manually tagged about 20-30 documents. The more documents you correctly tag, the better the suggestions become.

Training the ML System

When you first set up Paperless-ngx, spend 30 minutes manually classifying your first batch of documents:

  1. Upload or scan 30-50 documents
  2. For each, set the correspondent, document type, and relevant tags
  3. After classifying this initial batch, enable auto-matching (ML) on your correspondents, types, and tags

From this point forward, Paperless-ngx will suggest classifications for new documents. Accept correct suggestions and fix incorrect ones — the system learns from corrections.

Practical Workflow

Here's a daily workflow that keeps your document archive current without becoming a chore:

Incoming Mail

When physical mail arrives:

  1. Open it
  2. Decide if you need to keep it (most mail is junk)
  3. If keeping: scan it with your phone (using the Paperless-ngx mobile app or any scanning app that saves to a folder), or feed it through a desktop scanner
  4. The document appears in Paperless-ngx within minutes
  5. Verify the auto-classification is correct
  6. Recycle the paper original (unless you need the original for legal purposes)

Digital Documents

For documents that arrive by email (statements, receipts, confirmations):

Monthly Review

Once a month, spend 10 minutes:

  1. Check the Paperless-ngx inbox for any unclassified documents
  2. Review auto-assigned tags for accuracy
  3. Update any correspondents that weren't recognized
  4. Create new tags or document types if a pattern has emerged

Backup and Restore

Paperless-ngx stores data in three places that need backup:

  1. PostgreSQL database — All metadata, tags, correspondents, and search indexes
  2. Media directory — Original documents and OCR'd archive versions
  3. Data directory — Thumbnails, classification models, and configuration

Built-in Export

Paperless-ngx has a built-in export function:

docker compose exec webserver document_exporter ../export

This creates a manifest file plus all original documents — a portable backup you can import into a fresh installation.

Database Backup

For faster, incremental backups, dump the database separately:

docker compose exec db pg_dump -U paperless paperless > paperless-backup-$(date +%Y%m%d).sql

Include this dump plus the media volume in your regular Restic/Borg backup.

Restore from Export

docker compose exec webserver document_importer ../export

This recreates all documents, metadata, tags, and classifications from a previous export.

Performance Tuning

Worker Processes

If documents are processing slowly, increase the number of worker processes:

environment:
  PAPERLESS_TASK_WORKERS: 2  # Default: 1
  PAPERLESS_THREADS_PER_WORKER: 2  # Default: 1

Each worker can process one document at a time. With 2 workers, you can process two documents simultaneously. Don't set this higher than your CPU core count.

Thumbnail Generation

Thumbnails are generated for every document and used in the web UI grid view. For large archives, thumbnail generation can be slow:

environment:
  PAPERLESS_WEBSERVER_WORKERS: 2  # Default: 1

Search Optimization

Paperless-ngx uses Whoosh for full-text search by default. For very large archives (10,000+ documents), the built-in search works well but may slow down. If you need faster search, consider the PAPERLESS_SEARCH_BACKEND option to use a more powerful backend.

Mobile Access

Paperless Mobile App

The community-built Paperless Mobile app (available for Android and iOS) provides a native interface for browsing, searching, and uploading documents. It connects to your Paperless-ngx instance via the REST API.

To use it, you'll need:

  1. Your Paperless-ngx URL accessible from outside your home network (via reverse proxy, VPN, or Cloudflare Tunnel)
  2. Your username and password
  3. API access enabled (it's on by default)

Scanning from Your Phone

Any scanning app that can save to a folder works with Paperless-ngx. The workflow:

  1. Scan the document with your phone's camera
  2. Save the PDF to a folder synced to your server (via Nextcloud, Syncthing, or similar)
  3. That folder is the Paperless-ngx consumption directory
  4. Document appears in Paperless-ngx within minutes

For Android, OpenScan or Office Lens work well. For iOS, the built-in document scanner (in Files or Notes) produces excellent scans.

Reverse Proxy Setup

For remote access with HTTPS, put Paperless-ngx behind a reverse proxy. With Caddy:

paperless.yourdomain.com {
    reverse_proxy paperless:8000
}

With Nginx:

server {
    listen 443 ssl http2;
    server_name paperless.yourdomain.com;

    client_max_body_size 100M;  # Allow large document uploads

    ssl_certificate /etc/letsencrypt/live/paperless.yourdomain.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/paperless.yourdomain.com/privkey.pem;

    location / {
        proxy_pass http://localhost:8000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }
}

Note the client_max_body_size directive — without it, Nginx will reject uploads larger than 1MB.

Is Paperless-ngx Worth It?

If you regularly deal with physical or digital documents (and who doesn't?), Paperless-ngx is one of the most immediately useful self-hosted services you can run. The setup takes about an hour. The initial classification effort takes maybe two hours. After that, the ongoing maintenance is measured in minutes per month.

The payoff comes the first time you need to find a specific document and it takes 5 seconds instead of 20 minutes. Or when tax season arrives and every deductible receipt is already tagged and searchable. Or when you need to reference an insurance policy and it's right there, with the exact clause highlighted by the full-text search.

It's the kind of tool that makes you wonder how you managed without it.

Get free weekly tips in your inbox. Subscribe to Self-Hosted Weekly