中文 | English
This guide explains how to contribute to MailCLI's shared parser without weakening the machine-facing contract.
The parser is one of the highest-value parts of the project.
It is also one of the easiest places to cause subtle regressions.
The parser takes raw message bytes and returns a StandardMessage.
Its job is to:
- decode MIME structure and transfer encodings
- normalize charsets into UTF-8 text
- choose the best body representation for agent use
- clean noisy HTML into Markdown-friendly content
- extract machine-usable actions, codes, and bounce context
- estimate token usage conservatively
Its job is not to:
- talk to mailbox providers
- apply provider-specific business policy
- decide inbox workflow policy for a specific product
- hide transport failures that belong in drivers
Start here before changing parser behavior:
pkg/parser/parser.gopkg/parser/mime.gopkg/parser/charset.gopkg/parser/html.gopkg/parser/actions.gopkg/parser/codes.gopkg/parser/token.gopkg/parser/parser_test.go
Fixture and golden data live in:
testdata/emails/testdata/golden/
CLI JSON snapshots that protect downstream output shape live in:
cmd/json_snapshot_test.gocmd/testdata/snapshots/
Use this order unless there is a strong reason not to:
- Add a focused failing test in
pkg/parser/parser_test.go. - Add or update a fixture in
testdata/emails/if the case is end-to-end. - Update the matching golden file in
testdata/golden/only when the new output is intentional. - Run targeted parser tests first.
- Run the full test suite before opening a PR.
Recommended commands:
go test ./pkg/parser -run TestParse
go test ./pkg/parser -run TestCleanHTML
go test ./...If your parser change affects command JSON output, also run:
go test ./cmdParser work is allowed to be heuristic.
Parser work is not allowed to be vague.
Good parser heuristics:
- are motivated by a concrete fixture or failure case
- are conservative when evidence is weak
- keep useful structure like headings, links, tables, and key images
- come with regression coverage
- improve agent-facing clarity without inventing meaning
Bad parser heuristics:
- special-case one provider brand without a reusable pattern
- delete content because it "looks like marketing" without evidence
- rewrite URLs that are not obvious tracking wrappers
- classify actions from weak hints alone
- change output shape without updating docs and tests
When working on html.go, optimize for these outcomes:
- keep the primary body content
- remove obvious chrome such as preheaders, footers, navigation, and preference links
- avoid throwing away short transactional bodies
- preserve valuable structure for Markdown conversion
Do not assume that footer-like words mean the whole container is safe to delete.
Prefer scoring and narrow removal over broad keyword deletion.
When working on actions.go:
- normalize only obvious redirect wrappers
- preserve legitimate destination URLs
- keep labels readable for agents
- deduplicate actions deterministically
- avoid turning generic product links into semantic actions without strong evidence
If a URL cleanup rule could destroy a legitimate app link, the rule is too aggressive.
Prefer fixtures organized by behavior, not by brand.
Good fixture categories:
- newsletter or promo
- transactional status update
- verification or sign-in
- invoice or payment
- bounce or delivery failure
- security reset
- attachment entry
Representative fixtures already in the repo:
mercury.eml: large HTML-heavy newsletter or promo mailconfirm_subscription.eml: email-address verification that confirms a subscription, not a sign-inverification.eml: straightforward verification flowreply_quoted_verification.eml: verification content inside quoted thread noiseinvoice.eml: transactional billing or invoice flowattachment_notice.eml: message where the main user action is attachment accessbounce.emlandpostfix_bounce.eml: delivery failure and machine-generated bounce contextunsubscribe_mixed.eml: mixed-content mail where unsubscribe must stay available but not dominate the body
When adding a fixture:
- anonymize addresses, ids, and tokens when possible
- keep the raw mail realistic enough to preserve the failure mode
- add the smallest fixture that still protects the behavior
MailCLI already carries multilingual action coverage, including Chinese transactional and preference-management cases.
Representative multilingual fixtures:
unsubscribe_cn.emlconfirm_subscription_cn.emlinvoice_cn.emlsecurity_reset_cn.emlsecurity_verify_cn.emlverification_cn_fullwidth.emlview_online_abuse_cn.emlattachment_notice_cn.eml
When expanding multilingual coverage:
- prefer adding a real action pattern over adding generic translated copy
- cover the action label and the surrounding body context together
- include at least one negative case if a new keyword could overfire
- keep heuristics language-aware, but still pattern-based rather than sender-branded
If a new rule only works because one provider always uses one exact phrase, it probably belongs in a narrower test or should not land yet.
Golden files are contract protection, not a shortcut for approving changed output.
Before updating a golden:
- read the full diff, not only the changed lines
- confirm the new output is easier for agents to consume
- confirm token or structure changes are intentional
- check whether the change also affects
cmd/json_snapshot_test.go
Avoid bundling unrelated parser cleanup into the same golden refresh. Small, explainable diffs are easier to review and safer to keep.
For parser-focused work, this is the preferred loop:
go test ./pkg/parser -run TestParse
go test ./pkg/parser -run TestExtractActions
go test ./pkg/parser -run TestExtractCodes
go test ./cmd
go test ./...
go build -o /tmp/mailcli ./cmd/mailcliIf your change affects local demo artifacts or thread-facing JSON output, also run:
make demo-local-thread-refresh
make demo-local-thread-checkThe first command regenerates the checked-in demo artifacts.
The second command proves the refreshed artifacts still match repository expectations.
If your parser change affects:
StandardMessage- action or code shape
- command JSON output
- any documented stable field
open an RFC-style issue first instead of only sending a code PR.
See:
A parser contribution is stronger when:
- the behavior is easier for agents to consume
- the regression is clearly covered by tests
- the rule generalizes beyond one provider
- the output contract becomes clearer, not noisier
If you are unsure whether a behavior belongs in the parser, default to a smaller change and document the uncertainty in the PR.