Config validation is silent and inconsistent: misnamed env vars are dropped without warning, dreamer/dialectic break invisibly, and MCP errors lose the underlying cause

## Summary

A self-hosted instance running `honcho-api` + `honcho-deriver` (release ~2026-04-20) suffered a silent regression where:

- `peer.chat()` returned 500 errors at every call for ~4 days
- the dreamer produced **zero** new deductive/inductive observations during that period
- the explicit deriver kept running, masking the issue

Root cause: **environment variables intended to override LLM provider/model for `DIALECTIC`, `DERIVER`, `SUMMARY` and `DREAM` were named at the wrong level of nesting** (`DIALECTIC_LEVELS__medium__PROVIDER=google` instead of `DIALECTIC_LEVELS__medium__MODEL_CONFIG__TRANSPORT=gemini`). Because all relevant settings classes set `extra="ignore"`, Pydantic accepted the misnamed vars without complaint and silently fell back to the hardcoded default `transport="openai", model="gpt-5.4-mini"`. With no `OPENAI_API_KEY` configured, `client_for_model_config` raised `ValidationException("Missing API key for openai model config")` at `src/llm/registry.py:126`.

This issue groups six related defects that together made a 5-minute config bug burn 4 days of memory curation. Each could be fixed independently.

## Defect 1 — Silent fallback on misnamed env vars

`SettingsConfigDict(env_prefix="DIALECTIC_", env_nested_delimiter="__", extra="ignore")` (cf. `src/config.py:872, 1090, 710`) means typos and stale variable names are accepted without warning. The hardcoded defaults `transport="openai", model="gpt-5.4-mini"` then take over.

**Reproduction**

\`\`\`bash
# Operator follows out-of-date docs/blog post
echo 'DIALECTIC_LEVELS__medium__PROVIDER=google' >> .env
echo 'DIALECTIC_LEVELS__medium__MODEL=gemini-3.1-flash-lite-preview' >> .env
echo 'LLM_GEMINI_API_KEY=...' >> .env
docker compose up -d
# Boot succeeds. Dialectic crashes only on first peer.chat() call, hours later.
\`\`\`

**Suggested fix**

- Switch to `extra="forbid"` for all `*Settings` classes whose schema is well-defined, OR
- Emit a startup warning listing any env var matching the prefix that did **not** map to a known field (similar to systemd's `unrecognized assignment` warnings).
- Add a config self-check on startup: log the resolved `transport` + `model` for each major subsystem (DIALECTIC levels, DERIVER, SUMMARY, DREAM specialists). One log line per resolved model config, at INFO level. Operators currently have no way to see what was actually loaded without dropping into a Python REPL.

## Defect 2 — Inconsistent config layout across subsystems

| Subsystem | Resolved path |
|---|---|
| `DIALECTIC` | `DIALECTIC.LEVELS[<level>].MODEL_CONFIG` |
| `DERIVER` | `DERIVER.MODEL_CONFIG` |
| `SUMMARY` | `SUMMARY.MODEL_CONFIG` |
| `DREAM` | `DREAM.DEDUCTION_MODEL_CONFIG` **and** `DREAM.INDUCTION_MODEL_CONFIG` (no root `MODEL_CONFIG`) |

An operator who correctly fixes `DIALECTIC_LEVELS__medium__MODEL_CONFIG__TRANSPORT` will instinctively try `DREAM_MODEL_CONFIG__TRANSPORT` for symmetry — and silently no-op (see Defect 1).

**Suggested fix**

- Either harmonize: give `DREAM` a root `MODEL_CONFIG` that acts as default for both specialists.
- Or document the asymmetry **prominently** in `.env.template` and add explicit `DREAM_DEDUCTION_MODEL_CONFIG__*` and `DREAM_INDUCTION_MODEL_CONFIG__*` examples (currently only generic `DREAM_*` variants appear in some examples / older docs).

## Defect 3 — `transport` value naming clashes with operator mental model

`Literal["anthropic", "openai", "gemini"]` (cf. `src/config.py:25`). The transport key for Google's API is `gemini`, but operator-facing material (env templates, blog posts, possibly older docs) commonly uses `google` as the provider name. Operators write `transport=google` and Pydantic refuses; or they write `_PROVIDER=google` (Defect 1) and it's silently dropped.

**Suggested fix**

- Accept both `gemini` and `google` as transport values via a `BeforeValidator` that normalizes to `gemini`.
- Or rename the literal to `google` and keep `gemini` as a deprecated alias with a warning.

## Defect 4 — MCP `chat` tool swallows the underlying error

`src/mcp/server.ts:807` propagates only a generic `\"Error: An unexpected error occurred\"` to the MCP client when the upstream `/peers/{id}/chat` returns 500. The actual cause (`Missing API key for openai model config`) is lost. Operators must SSH to the host and grep `docker logs honcho-api-1` to discover the issue.

**Suggested fix**

- Forward the `detail` field from FastAPI 4xx/5xx responses to the MCP error message (or at minimum HTTP status + first line of response body).
- Add a `honcho_status` MCP tool that pings `/health` + checks resolved model configs + reports last successful `chat()` and last successful `dream` cycle.

## Defect 5 — Tenacity retry masks non-retryable validation errors

`src/dialectic/core.py:413` raises `tenacity.RetryError[ValidationException]` after retrying a `Missing API key` failure. A config error cannot succeed on retry; the retry just delays the failure and obscures the root exception in stack traces.

**Suggested fix**

- Mark `ValidationException` (and any other config-time exception) as non-retryable in the tenacity policy.
- Surface the original exception via `raise ... from` (already done in some places, missed here).

## Defect 6 — Dreamer failures are invisible

When the dream specialists' model config is broken:

- `schedule_dream` (or the auto-trigger from `crud/representation.py:177`) fails before `enqueue_dream` is called
- No row appears in the `queue` table with `payload->>task_type = 'dream'`
- The deriver continues processing `representation` / `summary` / `webhook` tasks normally
- The peer card decays silently (no contradiction detection, no induction, no consolidation) because the curation pipeline is dead — but every other observable metric (queue depth, explicit observation count, API uptime) looks healthy

In our case: **0 dreams for 4 days**, only discovered while debugging an unrelated `peer.chat()` issue.

**Suggested fix**

- Add a Prometheus metric \`honcho_last_dream_completion_timestamp_seconds{workspace,observer}\`. Operators can alert on it.
- Add a \"stale dreams\" warning in the API health endpoint when no dream has completed for \`MIN_HOURS_BETWEEN_DREAMS * 3\` hours despite enabled flag.
- Capture and log scheduler-side exceptions distinctly from worker-side exceptions, so a \"dreamer can't start\" condition emits a loud structured error instead of being lost.

## Combined impact

A single misnamed env var (typo, stale doc, copy from older Honcho version) currently silently degrades the system over days:

1. The dialectic returns 500s
2. The dreamer stops, peer card stops self-curating
3. Operator sees the dialectic 500 first (because it's user-facing), fixes that, but doesn't realize the dreamer was equally affected unless they check observation counts manually.

Each individual defect is small. Together they produce a high-cost, low-visibility failure mode that's hostile to self-hosted operators. The two highest-leverage fixes IMO are: (a) **boot-time config validation that logs resolved model configs and warns on unknown env vars**, and (b) **a `last_dream` health signal** to detect silent dreamer death.

## Environment

- Honcho self-hosted via \`docker-compose.yml\` (5 containers: api, deriver, postgres, redis, tei)
- Embedding via local TEI service (Hugging Face), other LLMs via Gemini
- ~5800 conclusions accumulated over ~3 weeks of usage
- Config via \`.env\` file (no \`config.toml\`)

Subsystem	Resolved path
`DIALECTIC`	`DIALECTIC.LEVELS[<level>].MODEL_CONFIG`
`DERIVER`	`DERIVER.MODEL_CONFIG`
`SUMMARY`	`SUMMARY.MODEL_CONFIG`
`DREAM`	`DREAM.DEDUCTION_MODEL_CONFIG` and `DREAM.INDUCTION_MODEL_CONFIG` (no root `MODEL_CONFIG`)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Config validation is silent and inconsistent: misnamed env vars are dropped without warning, dreamer/dialectic break invisibly, and MCP errors lose the underlying cause #608

Summary

Defect 1 — Silent fallback on misnamed env vars

Operator follows out-of-date docs/blog post

Boot succeeds. Dialectic crashes only on first peer.chat() call, hours later.

Defect 2 — Inconsistent config layout across subsystems

Defect 3 — `transport` value naming clashes with operator mental model

Defect 4 — MCP `chat` tool swallows the underlying error

Defect 5 — Tenacity retry masks non-retryable validation errors

Defect 6 — Dreamer failures are invisible

Combined impact

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Config validation is silent and inconsistent: misnamed env vars are dropped without warning, dreamer/dialectic break invisibly, and MCP errors lose the underlying cause #608

Description

Summary

Defect 1 — Silent fallback on misnamed env vars

Operator follows out-of-date docs/blog post

Boot succeeds. Dialectic crashes only on first peer.chat() call, hours later.

Defect 2 — Inconsistent config layout across subsystems

Defect 3 — transport value naming clashes with operator mental model

Defect 4 — MCP chat tool swallows the underlying error

Defect 5 — Tenacity retry masks non-retryable validation errors

Defect 6 — Dreamer failures are invisible

Combined impact

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Defect 3 — `transport` value naming clashes with operator mental model

Defect 4 — MCP `chat` tool swallows the underlying error