A Go tool that walks a maildir-style directory tree, parses each RFC 822 email file, and bulk-uploads the parsed records to ZincSearch for full-text search.
Built originally to index the Enron email corpus (~500k messages, 1.5 GB on disk) and feed the companion search UI in email-search-engine.
- Recursively walks a maildir-style directory tree
- Parses each file as an RFC 822 email (Message-ID, From, To, Subject, headers, body)
- Repairs common header continuation bugs in Enron-style dumps before parsing
- Deduplicates at the storage layer by using each email's
Message-IDas the ZincSearch document_id(re-indexing the same email replaces the existing record instead of duplicating it) - Buffers records into ndjson chunks (capped at 5,000 entries or 10 MB, whichever comes first)
- Posts chunks concurrently to ZincSearch's
/_multibulk endpoint with a semaphore-bounded worker pool
main.go
├── helpers/
│ ├── directory_reader.go / directory_checker.go Filesystem traversal
│ ├── emails.go RFC 822 parsing
│ ├── headers.go Repair multi-line header bugs before mail.ReadMessage()
│ └── bulk_data.go Buffered ndjson chunker + ZincSearch HTTP client
└── models/
└── email.go Email struct mapping every header to a JSON field
The chunk-and-stream design keeps memory bounded regardless of corpus size — the whole Enron dataset runs in a few hundred MB of resident memory rather than loading every message at once.
- Go 1.26+
- A reachable ZincSearch instance (defaults to
http://localhost:4080)
cp .env.example .env| Variable | Default | Purpose |
|---|---|---|
ZINCSEARCH_URL |
http://localhost:4080 |
Base URL of the ZincSearch instance |
ZINCSEARCH_INDEX |
emails |
Target index name |
ZINCSEARCH_USERNAME |
"" |
ZincSearch admin user |
ZINCSEARCH_PASSWORD |
"" |
ZincSearch admin password |
go run . /path/to/maildirThe argument must point at the root of a maildir-style tree. For the Enron corpus that's typically enron_mail_<date>/maildir/.
go test ./...A tiny synthetic email lives under helpers/testdata/maildir/sample/sent/1 so tests pass on a fresh clone without any external data.
MIT — see LICENSE.