Paperless-NGX: Complete Guide to Self-Hosted Document Management
Paper accumulates. Tax documents, insurance policies, medical records, receipts, warranty cards — physically managing this is tedious and finding anything later is worse. Paperless-NGX is a self-hosted document management system that digitizes your physical documents, OCRs them into searchable text, and tags and organizes them automatically.
Photo by Shubham Dhage on Unsplash
What Paperless-NGX Does
- OCR: Converts scanned images and image-based PDFs into searchable text using Tesseract
- Full-text search: Search across all document content, not just titles
- Automatic tagging: Define rules to tag documents based on content, correspondent, or file metadata
- Correspondence tracking: Group documents by sender/source (USPS, IRS, insurance company)
- Document types: Categorize by type (invoice, statement, contract, receipt)
- Storage backend: Documents stored on your filesystem in an organized structure
- API: Full REST API for integrating with other tools and automation
It started as a fork of the original Paperless project and is the most actively maintained version.
Docker Compose Deployment
Paperless-NGX needs a database (PostgreSQL), Redis (for task queue), and optionally Apache Tika (for Office document conversion):
services:
broker:
image: redis:7-alpine
restart: unless-stopped
db:
image: postgres:15
restart: unless-stopped
environment:
POSTGRES_DB: paperless
POSTGRES_USER: paperless
POSTGRES_PASSWORD: paperless-db-password
volumes:
- pgdata:/var/lib/postgresql/data
webserver:
image: ghcr.io/paperless-ngx/paperless-ngx:latest
restart: unless-stopped
depends_on:
- db
- broker
ports:
- 8000:8000
volumes:
- data:/usr/src/paperless/data
- media:/usr/src/paperless/media
- ./export:/usr/src/paperless/export
- ./consume:/usr/src/paperless/consume # Incoming document folder
environment:
PAPERLESS_REDIS: redis://broker:6379
PAPERLESS_DBHOST: db
PAPERLESS_DBPASS: paperless-db-password
PAPERLESS_SECRET_KEY: change-this-to-a-random-string
PAPERLESS_OCR_LANGUAGE: eng
PAPERLESS_TIME_ZONE: America/Los_Angeles
PAPERLESS_URL: https://paperless.yourdomain.com
PAPERLESS_ADMIN_USER: admin
PAPERLESS_ADMIN_PASSWORD: initial-admin-password
volumes:
pgdata:
data:
media:
Start with docker compose up -d. On first run, the database initializes. Navigate to http://your-server:8000 and log in with the admin credentials you set.
Document Ingestion
Paperless-NGX watches a consume directory (./consume in the compose above). Files placed there are automatically processed.
Methods for getting documents in:
1. Watch folder: Mount a network share or local path to the consume directory. PDF scanners with "scan to folder" capability work perfectly.
2. Email: Configure Paperless to poll an email address for attachments:
environment:
PAPERLESS_EMAIL_TASK_CRON: "*/10 * * * *" # Check every 10 minutes
Then set up email accounts in the UI under Settings → Mail.
3. Web upload: Drag and drop through the web interface.
4. Mobile app: Several third-party iOS/Android apps support uploading to Paperless via its API.
5. API: POST a file to /api/documents/post_document/.
Like what you're reading? Subscribe to Self-Hosted Weekly — free weekly guides in your inbox.
After OCR: Automatic Classification
After Paperless ingests a document, it OCRs the text and can automatically apply tags, correspondents, and document types.
Navigate to Admin → Correspondent to create correspondents (entities that send you documents):
- Name: "Internal Revenue Service"
- Matching Algorithm: "Any word" or "Regular expression"
- Match: "Internal Revenue Service|IRS|Department of Treasury"
Create Document Types:
- Name: "Tax Document"
- Match: "1099|W-2|1040|Schedule [A-Z]"
Create Tags:
- "to-do" — documents needing action
- "medical" — health-related
- "financial" — financial statements
- "insurance" — insurance documents
Assignment Rules (Admin → Assignment Rules) tie it together:
- If correspondent = IRS → add tag "tax"
- If content contains "EOB" → type = "Explanation of Benefits", tag = "medical"
After documents are created, the classifier learns from manual corrections and improves over time.
Workflow: Physical Document to Searchable Archive
- Scan: Use a document scanner (Brother, Fujitsu, Canon) with "scan to folder" → drops PDF to your consume folder
- Paperless processes: OCR runs, classifier assigns tags/type/correspondent
- Review (optional): Log into web UI and verify or correct classification
- Shred the original: Once verified in Paperless with backup
For occasional single pages: phone scanner apps (Microsoft Lens, Adobe Scan) can email or upload directly.
Search and Retrieval
Paperless's search is one of its best features. It uses full-text search across the OCR'd content of all documents:
- Search
insurance premium 2024→ finds all insurance documents mentioning premiums from 2024 - Filter by tag:
tag:medical date:2023-2024 - Search by correspondent:
from:chase statements
The web interface supports faceted filtering: filter by date range, correspondent, document type, tag, or combine them.
Storage and Backup Structure
Paperless stores files in an organized directory tree:
media/documents/originals/
2024/
01/
document-001.pdf
document-002.pdf
02/
...
The filename pattern is configurable. You can include title, correspondent, or date in the filename.
Backup strategy:
- Back up the PostgreSQL database (stores metadata, tags, correspondents)
- Back up the media volume (original files)
- The consume and export volumes don't need backup
Consider syncing media to a cloud storage (Backblaze B2, S3) via rclone.
Multi-Language OCR
For documents in languages other than English, install additional Tesseract language packs:
environment:
PAPERLESS_OCR_LANGUAGE: eng+deu # English + German
PAPERLESS_OCR_LANGUAGES: deu fra spa # Install these language packs
Adding Tika for Office Documents
Apache Tika extracts text from Word, Excel, and other Office formats. Add it to your compose:
tika:
image: ghcr.io/paperless-ngx/tika:latest
restart: unless-stopped
gotenberg:
image: docker.io/gotenberg/gotenberg:8
restart: unless-stopped
command:
- "gotenberg"
- "--chromium-disable-javascript=true"
- "--chromium-allow-list=file:///tmp/.*"
Then add to the webserver environment:
PAPERLESS_TIKA_ENABLED: 1
PAPERLESS_TIKA_GOTENBERG_ENDPOINT: http://gotenberg:3000
PAPERLESS_TIKA_ENDPOINT: http://tika:9998
This enables processing .docx, .xlsx, .pptx, HTML, and other formats.
Resource Requirements
- Minimum: 2GB RAM, 1 CPU core
- Comfortable: 4GB RAM, 2 CPU cores
- Storage: Depends on document volume. A decade of household documents is usually 5-20GB.
OCR is CPU-intensive. Initial bulk imports process faster on multi-core systems.
OpenProject vs. Paperless (common question)
These solve different problems. OpenProject manages ongoing work (tasks, projects, timelines). Paperless-NGX manages documents (storage, OCR, retrieval). If you have a home office, you'd likely use both: Paperless for document archiving, a project tool for tracking tasks and projects.
The repository is at paperless-ngx/paperless-ngx with active development. It's one of the most polished self-hosted home tools available.
