Every year the same ritual: hunting through years of email to find last year's home insurance policy, the car MOT certificate, or proof of an Amazon return. It works — barely — but it is not a system. It is a memory test.
Paperless-NGX solves this properly. It is an open-source document management system that ingests PDFs and images, runs OCR on everything, and gives you a searchable, tagged archive of your entire document history. Think of it as a personal Google Drive that you control entirely, runs on your own hardware, and understands what is inside every file.
This post walks through how I set it up as part of a self-hosted homelab, with automatic ingestion from Gmail and persistent storage on a TrueNAS NAS. By the end, new documents from Octopus Energy, HSBC, HMRC, and others arrive in Paperless automatically — tagged, searchable, and backed up — without any manual work.
At its core, Paperless-NGX does three things:
Ingestion — it watches for new documents via email, a consume folder, or direct API upload. When a new file arrives, it gets queued for processing.
Processing — OCR runs on every document (using Tesseract under the hood), extracting the full text. HTML emails get converted to PDFs via Gotenberg and Chromium. Complex file formats like Office documents are handled by Apache Tika.
Storage and retrieval — documents are filed by correspondent, type, and date, with full-text search across everything. Metadata lives in PostgreSQL. The actual files live wherever you mount them.
The result is that searching for "car insurance 2024" finds the exact document in under a second, regardless of which insurer you were with that year.
Paperless-NGX is not a single container — it is a stack of five services that work together. Here is how they fit on a self-hosted Docker setup:
Each container has a specific job:
The document files themselves (originals, OCR'd archives, thumbnails) live on TrueNAS rather than inside the container, so they survive any container rebuild.
Manually uploading documents defeats the purpose. The real power comes from connecting Paperless directly to your email — so when Octopus Energy sends your monthly bill, it ends up in Paperless without you touching it.
The setup uses two layers of filtering:
Gmail does the pre-filtering. You create a Gmail filter for each sender domain (@octopus.energy, @hsbc.co.uk, @hmrc.gov.uk, and so on) that matches only emails with PDF attachments. Any match gets labelled Paperless automatically. Gmail does this for free — Paperless never sees the marketing noise, only documents that were pre-qualified.
Paperless polls the label via IMAP. Every ten minutes, Paperless connects to your Gmail account via IMAP and checks for unread messages in the Paperless label. Any email it finds gets consumed: PDF attachments are ingested directly, and HTML email bodies are converted to PDF via Gotenberg.
The key insight is that Gmail filters are free to run and very fast — it makes no sense to have Paperless download and parse hundreds of marketing emails just to discard them. Use Gmail as the gatekeeper.
Here is what happens from the moment a relevant email arrives to when it appears in Paperless:
One thing worth noting: Paperless marks emails as read after processing rather than deleting them. Your Gmail archive stays intact — Paperless is reading it, not consuming it.
Where data lives matters enormously with Docker. The temptation is to let Docker manage everything via named volumes, but these are opaque, hard to back up, and easily wiped by a careless docker volume prune. A better approach is to split storage deliberately:
Document files (originals, archives, thumbnails) live directly on TrueNAS, mounted via SMB. These are the irreplaceable assets — the actual PDFs and images. TrueNAS provides ZFS checksumming, snapshots, and resilience for free.
The PostgreSQL database lives on a bind mount on the Docker host's local disk (/opt/appdata/paperless/pgdata). Databases need fast local I/O, and they cannot be backed up safely with a raw filesystem copy while they are running. Use pg_dump instead — it produces a clean SQL snapshot that restores cleanly regardless of version differences.
Backups run nightly via a simple shell script: pg_dump to TrueNAS for the database, and the files are already there. The script also checks that the TrueNAS mount is live before doing anything, so a missed backup is immediately obvious in the log rather than silently producing empty dump files.
This approach means the system can survive a complete Docker environment rebuild: the files are on TrueNAS, and the database can be restored from the last nightly dump. The worst case is losing up to 24 hours of document ingestion.
Paperless's mail rules are where most of the intelligence lives. Each rule specifies:
Paperless — capital P is important, it is case-sensitive)A practical setup is one rule per correspondent (Octopus Energy, HSBC, Zen Internet, HMRC, and so on), each targeting the Paperless label. Paperless will also auto-learn correspondents over time from OCR'd content — after seeing enough Octopus bills, it recognises them without needing an explicit rule.
One gotcha: Paperless's email consumer is designed for PDF attachments. Emails that are HTML-only (like Amazon marketing emails or review requests) get sent to Gotenberg for conversion — but Gotenberg's Chromium renderer sometimes chokes on complex HTML with external resource dependencies. For this reason, Gmail filters should always include has:attachment filename:pdf for senders that send marketing noise alongside real documents. Let Gmail do the heavy lifting; Paperless should only see what is worth keeping.
Before ingesting anything, set up your taxonomy. Paperless uses three classification dimensions:
Document Types describe what the document is — Invoice, Policy Document, Renewal Notice, Statement, Certificate, Receipt, Letter. These are generic enough to apply across all correspondents.
Correspondents are who sent it — Octopus Energy, HSBC, HMRC, Zen Internet. Create these as they appear rather than front-loading every company you can think of.
Tags are the subject matter category — Insurance, Utilities, Banking, Vehicle, Medical, Tax, Subscriptions. These are the dimension you will actually filter by when searching ("show me all Insurance documents from 2023").
The filename format {{ created_year }}/{{ created_month }}/{{ correspondent }}/{{ title }} produces a clean directory structure on TrueNAS that makes sense even without the Paperless UI — useful if you ever need to find a document without the system running.
A few things that were not obvious from the documentation:
Gmail IMAP folder names are case-sensitive. The label in Gmail shows as Paperless (capital P). If your mail rule specifies paperless (lowercase), it will fail with a folder-not-found error every time it runs. Check the logs — Paperless will list all available folders when it cannot find the one you specified, which makes the correct name obvious.
Named Docker volumes are fragile. They live in /var/lib/docker/volumes/ and can be wiped by docker compose down -v or docker volume prune. Always use bind mounts to explicit paths for anything you care about keeping. This applies to both the database and any application config.
The PostgreSQL volume mount path matters. Mount to /var/lib/postgresql/data, not /var/lib/postgresql. The latter includes the data directory plus the version files that the container manages — mounting the parent can cause initialisation issues on first start.
Watchtower updates wipe named volumes silently. If you use Watchtower for automatic updates and your volumes are named rather than bind-mounted, an update that recreates the container will effectively reset the database to empty. This is not hypothetical.
With this setup running, the workflow becomes: an Octopus Energy bill arrives in Gmail, Gmail's filter labels it within seconds, Paperless picks it up on the next 10-minute poll, OCR runs, and it appears in the archive — tagged as Utilities, filed under Octopus Energy, with every word in the bill searchable.
Finding last year's car insurance renewal is a two-second search rather than a ten-minute email archaeology expedition. The same goes for bank statements, medical letters, Amazon invoices, subscription receipts, and anything else that arrives as a PDF.
The system runs entirely on local hardware, costs nothing beyond the electricity, and does not depend on any third-party cloud service for storage or processing. Everything is on infrastructure you control.
The version used in this setup is Paperless-NGX v2.20.11, running on PostgreSQL 18 and Redis 8, with Gotenberg 8.15 and Apache Tika 3.1.0.