Summary
A self-hosted instance running honcho-api + honcho-deriver (release ~2026-04-20) suffered a silent regression where:
peer.chat() returned 500 errors at every call for ~4 days
- the dreamer produced zero new deductive/inductive observations during that period
- the explicit deriver kept running, masking the issue
Root cause: environment variables intended to override LLM provider/model for DIALECTIC, DERIVER, SUMMARY and DREAM were named at the wrong level of nesting (DIALECTIC_LEVELS__medium__PROVIDER=google instead of DIALECTIC_LEVELS__medium__MODEL_CONFIG__TRANSPORT=gemini). Because all relevant settings classes set extra="ignore", Pydantic accepted the misnamed vars without complaint and silently fell back to the hardcoded default transport="openai", model="gpt-5.4-mini". With no OPENAI_API_KEY configured, client_for_model_config raised ValidationException("Missing API key for openai model config") at src/llm/registry.py:126.
This issue groups six related defects that together made a 5-minute config bug burn 4 days of memory curation. Each could be fixed independently.
Defect 1 — Silent fallback on misnamed env vars
SettingsConfigDict(env_prefix="DIALECTIC_", env_nested_delimiter="__", extra="ignore") (cf. src/config.py:872, 1090, 710) means typos and stale variable names are accepted without warning. The hardcoded defaults transport="openai", model="gpt-5.4-mini" then take over.
Reproduction
```bash
Operator follows out-of-date docs/blog post
echo 'DIALECTIC_LEVELS__medium__PROVIDER=google' >> .env
echo 'DIALECTIC_LEVELS__medium__MODEL=gemini-3.1-flash-lite-preview' >> .env
echo 'LLM_GEMINI_API_KEY=...' >> .env
docker compose up -d
Boot succeeds. Dialectic crashes only on first peer.chat() call, hours later.
```
Suggested fix
- Switch to
extra="forbid" for all *Settings classes whose schema is well-defined, OR
- Emit a startup warning listing any env var matching the prefix that did not map to a known field (similar to systemd's
unrecognized assignment warnings).
- Add a config self-check on startup: log the resolved
transport + model for each major subsystem (DIALECTIC levels, DERIVER, SUMMARY, DREAM specialists). One log line per resolved model config, at INFO level. Operators currently have no way to see what was actually loaded without dropping into a Python REPL.
Defect 2 — Inconsistent config layout across subsystems
| Subsystem |
Resolved path |
DIALECTIC |
DIALECTIC.LEVELS[<level>].MODEL_CONFIG |
DERIVER |
DERIVER.MODEL_CONFIG |
SUMMARY |
SUMMARY.MODEL_CONFIG |
DREAM |
DREAM.DEDUCTION_MODEL_CONFIG and DREAM.INDUCTION_MODEL_CONFIG (no root MODEL_CONFIG) |
An operator who correctly fixes DIALECTIC_LEVELS__medium__MODEL_CONFIG__TRANSPORT will instinctively try DREAM_MODEL_CONFIG__TRANSPORT for symmetry — and silently no-op (see Defect 1).
Suggested fix
- Either harmonize: give
DREAM a root MODEL_CONFIG that acts as default for both specialists.
- Or document the asymmetry prominently in
.env.template and add explicit DREAM_DEDUCTION_MODEL_CONFIG__* and DREAM_INDUCTION_MODEL_CONFIG__* examples (currently only generic DREAM_* variants appear in some examples / older docs).
Defect 3 — transport value naming clashes with operator mental model
Literal["anthropic", "openai", "gemini"] (cf. src/config.py:25). The transport key for Google's API is gemini, but operator-facing material (env templates, blog posts, possibly older docs) commonly uses google as the provider name. Operators write transport=google and Pydantic refuses; or they write _PROVIDER=google (Defect 1) and it's silently dropped.
Suggested fix
- Accept both
gemini and google as transport values via a BeforeValidator that normalizes to gemini.
- Or rename the literal to
google and keep gemini as a deprecated alias with a warning.
Defect 4 — MCP chat tool swallows the underlying error
src/mcp/server.ts:807 propagates only a generic \"Error: An unexpected error occurred\" to the MCP client when the upstream /peers/{id}/chat returns 500. The actual cause (Missing API key for openai model config) is lost. Operators must SSH to the host and grep docker logs honcho-api-1 to discover the issue.
Suggested fix
- Forward the
detail field from FastAPI 4xx/5xx responses to the MCP error message (or at minimum HTTP status + first line of response body).
- Add a
honcho_status MCP tool that pings /health + checks resolved model configs + reports last successful chat() and last successful dream cycle.
Defect 5 — Tenacity retry masks non-retryable validation errors
src/dialectic/core.py:413 raises tenacity.RetryError[ValidationException] after retrying a Missing API key failure. A config error cannot succeed on retry; the retry just delays the failure and obscures the root exception in stack traces.
Suggested fix
- Mark
ValidationException (and any other config-time exception) as non-retryable in the tenacity policy.
- Surface the original exception via
raise ... from (already done in some places, missed here).
Defect 6 — Dreamer failures are invisible
When the dream specialists' model config is broken:
schedule_dream (or the auto-trigger from crud/representation.py:177) fails before enqueue_dream is called
- No row appears in the
queue table with payload->>task_type = 'dream'
- The deriver continues processing
representation / summary / webhook tasks normally
- The peer card decays silently (no contradiction detection, no induction, no consolidation) because the curation pipeline is dead — but every other observable metric (queue depth, explicit observation count, API uptime) looks healthy
In our case: 0 dreams for 4 days, only discovered while debugging an unrelated peer.chat() issue.
Suggested fix
- Add a Prometheus metric `honcho_last_dream_completion_timestamp_seconds{workspace,observer}`. Operators can alert on it.
- Add a "stale dreams" warning in the API health endpoint when no dream has completed for `MIN_HOURS_BETWEEN_DREAMS * 3` hours despite enabled flag.
- Capture and log scheduler-side exceptions distinctly from worker-side exceptions, so a "dreamer can't start" condition emits a loud structured error instead of being lost.
Combined impact
A single misnamed env var (typo, stale doc, copy from older Honcho version) currently silently degrades the system over days:
- The dialectic returns 500s
- The dreamer stops, peer card stops self-curating
- Operator sees the dialectic 500 first (because it's user-facing), fixes that, but doesn't realize the dreamer was equally affected unless they check observation counts manually.
Each individual defect is small. Together they produce a high-cost, low-visibility failure mode that's hostile to self-hosted operators. The two highest-leverage fixes IMO are: (a) boot-time config validation that logs resolved model configs and warns on unknown env vars, and (b) a last_dream health signal to detect silent dreamer death.
Environment
- Honcho self-hosted via `docker-compose.yml` (5 containers: api, deriver, postgres, redis, tei)
- Embedding via local TEI service (Hugging Face), other LLMs via Gemini
- ~5800 conclusions accumulated over ~3 weeks of usage
- Config via `.env` file (no `config.toml`)
Summary
A self-hosted instance running
honcho-api+honcho-deriver(release ~2026-04-20) suffered a silent regression where:peer.chat()returned 500 errors at every call for ~4 daysRoot cause: environment variables intended to override LLM provider/model for
DIALECTIC,DERIVER,SUMMARYandDREAMwere named at the wrong level of nesting (DIALECTIC_LEVELS__medium__PROVIDER=googleinstead ofDIALECTIC_LEVELS__medium__MODEL_CONFIG__TRANSPORT=gemini). Because all relevant settings classes setextra="ignore", Pydantic accepted the misnamed vars without complaint and silently fell back to the hardcoded defaulttransport="openai", model="gpt-5.4-mini". With noOPENAI_API_KEYconfigured,client_for_model_configraisedValidationException("Missing API key for openai model config")atsrc/llm/registry.py:126.This issue groups six related defects that together made a 5-minute config bug burn 4 days of memory curation. Each could be fixed independently.
Defect 1 — Silent fallback on misnamed env vars
SettingsConfigDict(env_prefix="DIALECTIC_", env_nested_delimiter="__", extra="ignore")(cf.src/config.py:872, 1090, 710) means typos and stale variable names are accepted without warning. The hardcoded defaultstransport="openai", model="gpt-5.4-mini"then take over.Reproduction
```bash
Operator follows out-of-date docs/blog post
echo 'DIALECTIC_LEVELS__medium__PROVIDER=google' >> .env
echo 'DIALECTIC_LEVELS__medium__MODEL=gemini-3.1-flash-lite-preview' >> .env
echo 'LLM_GEMINI_API_KEY=...' >> .env
docker compose up -d
Boot succeeds. Dialectic crashes only on first peer.chat() call, hours later.
```
Suggested fix
extra="forbid"for all*Settingsclasses whose schema is well-defined, ORunrecognized assignmentwarnings).transport+modelfor each major subsystem (DIALECTIC levels, DERIVER, SUMMARY, DREAM specialists). One log line per resolved model config, at INFO level. Operators currently have no way to see what was actually loaded without dropping into a Python REPL.Defect 2 — Inconsistent config layout across subsystems
DIALECTICDIALECTIC.LEVELS[<level>].MODEL_CONFIGDERIVERDERIVER.MODEL_CONFIGSUMMARYSUMMARY.MODEL_CONFIGDREAMDREAM.DEDUCTION_MODEL_CONFIGandDREAM.INDUCTION_MODEL_CONFIG(no rootMODEL_CONFIG)An operator who correctly fixes
DIALECTIC_LEVELS__medium__MODEL_CONFIG__TRANSPORTwill instinctively tryDREAM_MODEL_CONFIG__TRANSPORTfor symmetry — and silently no-op (see Defect 1).Suggested fix
DREAMa rootMODEL_CONFIGthat acts as default for both specialists..env.templateand add explicitDREAM_DEDUCTION_MODEL_CONFIG__*andDREAM_INDUCTION_MODEL_CONFIG__*examples (currently only genericDREAM_*variants appear in some examples / older docs).Defect 3 —
transportvalue naming clashes with operator mental modelLiteral["anthropic", "openai", "gemini"](cf.src/config.py:25). The transport key for Google's API isgemini, but operator-facing material (env templates, blog posts, possibly older docs) commonly usesgoogleas the provider name. Operators writetransport=googleand Pydantic refuses; or they write_PROVIDER=google(Defect 1) and it's silently dropped.Suggested fix
geminiandgoogleas transport values via aBeforeValidatorthat normalizes togemini.googleand keepgeminias a deprecated alias with a warning.Defect 4 — MCP
chattool swallows the underlying errorsrc/mcp/server.ts:807propagates only a generic\"Error: An unexpected error occurred\"to the MCP client when the upstream/peers/{id}/chatreturns 500. The actual cause (Missing API key for openai model config) is lost. Operators must SSH to the host and grepdocker logs honcho-api-1to discover the issue.Suggested fix
detailfield from FastAPI 4xx/5xx responses to the MCP error message (or at minimum HTTP status + first line of response body).honcho_statusMCP tool that pings/health+ checks resolved model configs + reports last successfulchat()and last successfuldreamcycle.Defect 5 — Tenacity retry masks non-retryable validation errors
src/dialectic/core.py:413raisestenacity.RetryError[ValidationException]after retrying aMissing API keyfailure. A config error cannot succeed on retry; the retry just delays the failure and obscures the root exception in stack traces.Suggested fix
ValidationException(and any other config-time exception) as non-retryable in the tenacity policy.raise ... from(already done in some places, missed here).Defect 6 — Dreamer failures are invisible
When the dream specialists' model config is broken:
schedule_dream(or the auto-trigger fromcrud/representation.py:177) fails beforeenqueue_dreamis calledqueuetable withpayload->>task_type = 'dream'representation/summary/webhooktasks normallyIn our case: 0 dreams for 4 days, only discovered while debugging an unrelated
peer.chat()issue.Suggested fix
Combined impact
A single misnamed env var (typo, stale doc, copy from older Honcho version) currently silently degrades the system over days:
Each individual defect is small. Together they produce a high-cost, low-visibility failure mode that's hostile to self-hosted operators. The two highest-leverage fixes IMO are: (a) boot-time config validation that logs resolved model configs and warns on unknown env vars, and (b) a
last_dreamhealth signal to detect silent dreamer death.Environment