Paperless-NGX: Complete Guide to Self-Hosted Document Management

Productivity 2026-03-04 · 4 min read paperless-ngx document management ocr self-hosted docker scanning open-source paperless
By Selfhosted Guides Editorial Team — Self-hosting practitioners covering open source software, home lab infrastructure, and data sovereignty.

Paper accumulates. Tax documents, insurance policies, medical records, receipts, warranty cards — physically managing this is tedious and finding anything later is worse. Paperless-NGX is a self-hosted document management system that digitizes your physical documents, OCRs them into searchable text, and tags and organizes them automatically.

Photo by Shubham Dhage on Unsplash

What Paperless-NGX Does

OCR: Converts scanned images and image-based PDFs into searchable text using Tesseract
Full-text search: Search across all document content, not just titles
Automatic tagging: Define rules to tag documents based on content, correspondent, or file metadata
Correspondence tracking: Group documents by sender/source (USPS, IRS, insurance company)
Document types: Categorize by type (invoice, statement, contract, receipt)
Storage backend: Documents stored on your filesystem in an organized structure
API: Full REST API for integrating with other tools and automation

It started as a fork of the original Paperless project and is the most actively maintained version.

Docker Compose Deployment

Paperless-NGX needs a database (PostgreSQL), Redis (for task queue), and optionally Apache Tika (for Office document conversion):

services:
  broker:
    image: redis:7-alpine
    restart: unless-stopped

  db:
    image: postgres:15
    restart: unless-stopped
    environment:
      POSTGRES_DB: paperless
      POSTGRES_USER: paperless
      POSTGRES_PASSWORD: paperless-db-password
    volumes:
      - pgdata:/var/lib/postgresql/data

  webserver:
    image: ghcr.io/paperless-ngx/paperless-ngx:latest
    restart: unless-stopped
    depends_on:
      - db
      - broker
    ports:
      - 8000:8000
    volumes:
      - data:/usr/src/paperless/data
      - media:/usr/src/paperless/media
      - ./export:/usr/src/paperless/export
      - ./consume:/usr/src/paperless/consume  # Incoming document folder
    environment:
      PAPERLESS_REDIS: redis://broker:6379
      PAPERLESS_DBHOST: db
      PAPERLESS_DBPASS: paperless-db-password
      PAPERLESS_SECRET_KEY: change-this-to-a-random-string
      PAPERLESS_OCR_LANGUAGE: eng
      PAPERLESS_TIME_ZONE: America/Los_Angeles
      PAPERLESS_URL: https://paperless.yourdomain.com
      PAPERLESS_ADMIN_USER: admin
      PAPERLESS_ADMIN_PASSWORD: initial-admin-password

volumes:
  pgdata:
  data:
  media:

Start with docker compose up -d. On first run, the database initializes. Navigate to http://your-server:8000 and log in with the admin credentials you set.

Document Ingestion

Paperless-NGX watches a consume directory (./consume in the compose above). Files placed there are automatically processed.

Methods for getting documents in:

1. Watch folder: Mount a network share or local path to the consume directory. PDF scanners with "scan to folder" capability work perfectly.

2. Email: Configure Paperless to poll an email address for attachments:

environment:
  PAPERLESS_EMAIL_TASK_CRON: "*/10 * * * *"  # Check every 10 minutes

Then set up email accounts in the UI under Settings → Mail.

3. Web upload: Drag and drop through the web interface.

4. Mobile app: Several third-party iOS/Android apps support uploading to Paperless via its API.

5. API: POST a file to /api/documents/post_document/.

Want more productivity guides? Get guides like this in your inbox — Self-Hosted Weekly delivers one free deep-dive every week.

After OCR: Automatic Classification

After Paperless ingests a document, it OCRs the text and can automatically apply tags, correspondents, and document types.

Navigate to Admin → Correspondent to create correspondents (entities that send you documents):

Name: "Internal Revenue Service"
Matching Algorithm: "Any word" or "Regular expression"
Match: "Internal Revenue Service|IRS|Department of Treasury"

Create Document Types:

Name: "Tax Document"
Match: "1099|W-2|1040|Schedule [A-Z]"

Create Tags:

"to-do" — documents needing action
"medical" — health-related
"financial" — financial statements
"insurance" — insurance documents

Assignment Rules (Admin → Assignment Rules) tie it together:

If correspondent = IRS → add tag "tax"
If content contains "EOB" → type = "Explanation of Benefits", tag = "medical"

After documents are created, the classifier learns from manual corrections and improves over time.

Workflow: Physical Document to Searchable Archive

Scan: Use a document scanner (Brother, Fujitsu, Canon) with "scan to folder" → drops PDF to your consume folder
Paperless processes: OCR runs, classifier assigns tags/type/correspondent
Review (optional): Log into web UI and verify or correct classification
Shred the original: Once verified in Paperless with backup

For occasional single pages: phone scanner apps (Microsoft Lens, Adobe Scan) can email or upload directly.

Search and Retrieval

Paperless's search is one of its best features. It uses full-text search across the OCR'd content of all documents:

Search insurance premium 2024 → finds all insurance documents mentioning premiums from 2024
Filter by tag: tag:medical date:2023-2024
Search by correspondent: from:chase statements

The web interface supports faceted filtering: filter by date range, correspondent, document type, tag, or combine them.

Storage and Backup Structure

Paperless stores files in an organized directory tree:

media/documents/originals/
  2024/
    01/
      document-001.pdf
      document-002.pdf
    02/
      ...

The filename pattern is configurable. You can include title, correspondent, or date in the filename.

Backup strategy:

Back up the PostgreSQL database (stores metadata, tags, correspondents)
Back up the media volume (original files)
The consume and export volumes don't need backup

Consider syncing media to a cloud storage (Backblaze B2, S3) via rclone.

Multi-Language OCR

For documents in languages other than English, install additional Tesseract language packs:

environment:
  PAPERLESS_OCR_LANGUAGE: eng+deu  # English + German
  PAPERLESS_OCR_LANGUAGES: deu fra spa  # Install these language packs

Adding Tika for Office Documents

Apache Tika extracts text from Word, Excel, and other Office formats. Add it to your compose:

  tika:
    image: ghcr.io/paperless-ngx/tika:latest
    restart: unless-stopped

  gotenberg:
    image: docker.io/gotenberg/gotenberg:8
    restart: unless-stopped
    command:
      - "gotenberg"
      - "--chromium-disable-javascript=true"
      - "--chromium-allow-list=file:///tmp/.*"

Then add to the webserver environment:

PAPERLESS_TIKA_ENABLED: 1
PAPERLESS_TIKA_GOTENBERG_ENDPOINT: http://gotenberg:3000
PAPERLESS_TIKA_ENDPOINT: http://tika:9998

This enables processing .docx, .xlsx, .pptx, HTML, and other formats.

Resource Requirements

Minimum: 2GB RAM, 1 CPU core
Comfortable: 4GB RAM, 2 CPU cores
Storage: Depends on document volume. A decade of household documents is usually 5-20GB.

OCR is CPU-intensive. Initial bulk imports process faster on multi-core systems.

OpenProject vs. Paperless (common question)

These solve different problems. OpenProject manages ongoing work (tasks, projects, timelines). Paperless-NGX manages documents (storage, OCR, retrieval). If you have a home office, you'd likely use both: Paperless for document archiving, a project tool for tracking tasks and projects.

The repository is at paperless-ngx/paperless-ngx with active development. It's one of the most polished self-hosted home tools available.