Skip to content

Config validation is silent and inconsistent: misnamed env vars are dropped without warning, dreamer/dialectic break invisibly, and MCP errors lose the underlying cause #608

@haroldparis

Description

@haroldparis

Summary

A self-hosted instance running honcho-api + honcho-deriver (release ~2026-04-20) suffered a silent regression where:

  • peer.chat() returned 500 errors at every call for ~4 days
  • the dreamer produced zero new deductive/inductive observations during that period
  • the explicit deriver kept running, masking the issue

Root cause: environment variables intended to override LLM provider/model for DIALECTIC, DERIVER, SUMMARY and DREAM were named at the wrong level of nesting (DIALECTIC_LEVELS__medium__PROVIDER=google instead of DIALECTIC_LEVELS__medium__MODEL_CONFIG__TRANSPORT=gemini). Because all relevant settings classes set extra="ignore", Pydantic accepted the misnamed vars without complaint and silently fell back to the hardcoded default transport="openai", model="gpt-5.4-mini". With no OPENAI_API_KEY configured, client_for_model_config raised ValidationException("Missing API key for openai model config") at src/llm/registry.py:126.

This issue groups six related defects that together made a 5-minute config bug burn 4 days of memory curation. Each could be fixed independently.

Defect 1 — Silent fallback on misnamed env vars

SettingsConfigDict(env_prefix="DIALECTIC_", env_nested_delimiter="__", extra="ignore") (cf. src/config.py:872, 1090, 710) means typos and stale variable names are accepted without warning. The hardcoded defaults transport="openai", model="gpt-5.4-mini" then take over.

Reproduction

```bash

Operator follows out-of-date docs/blog post

echo 'DIALECTIC_LEVELS__medium__PROVIDER=google' >> .env
echo 'DIALECTIC_LEVELS__medium__MODEL=gemini-3.1-flash-lite-preview' >> .env
echo 'LLM_GEMINI_API_KEY=...' >> .env
docker compose up -d

Boot succeeds. Dialectic crashes only on first peer.chat() call, hours later.

```

Suggested fix

  • Switch to extra="forbid" for all *Settings classes whose schema is well-defined, OR
  • Emit a startup warning listing any env var matching the prefix that did not map to a known field (similar to systemd's unrecognized assignment warnings).
  • Add a config self-check on startup: log the resolved transport + model for each major subsystem (DIALECTIC levels, DERIVER, SUMMARY, DREAM specialists). One log line per resolved model config, at INFO level. Operators currently have no way to see what was actually loaded without dropping into a Python REPL.

Defect 2 — Inconsistent config layout across subsystems

Subsystem Resolved path
DIALECTIC DIALECTIC.LEVELS[<level>].MODEL_CONFIG
DERIVER DERIVER.MODEL_CONFIG
SUMMARY SUMMARY.MODEL_CONFIG
DREAM DREAM.DEDUCTION_MODEL_CONFIG and DREAM.INDUCTION_MODEL_CONFIG (no root MODEL_CONFIG)

An operator who correctly fixes DIALECTIC_LEVELS__medium__MODEL_CONFIG__TRANSPORT will instinctively try DREAM_MODEL_CONFIG__TRANSPORT for symmetry — and silently no-op (see Defect 1).

Suggested fix

  • Either harmonize: give DREAM a root MODEL_CONFIG that acts as default for both specialists.
  • Or document the asymmetry prominently in .env.template and add explicit DREAM_DEDUCTION_MODEL_CONFIG__* and DREAM_INDUCTION_MODEL_CONFIG__* examples (currently only generic DREAM_* variants appear in some examples / older docs).

Defect 3 — transport value naming clashes with operator mental model

Literal["anthropic", "openai", "gemini"] (cf. src/config.py:25). The transport key for Google's API is gemini, but operator-facing material (env templates, blog posts, possibly older docs) commonly uses google as the provider name. Operators write transport=google and Pydantic refuses; or they write _PROVIDER=google (Defect 1) and it's silently dropped.

Suggested fix

  • Accept both gemini and google as transport values via a BeforeValidator that normalizes to gemini.
  • Or rename the literal to google and keep gemini as a deprecated alias with a warning.

Defect 4 — MCP chat tool swallows the underlying error

src/mcp/server.ts:807 propagates only a generic \"Error: An unexpected error occurred\" to the MCP client when the upstream /peers/{id}/chat returns 500. The actual cause (Missing API key for openai model config) is lost. Operators must SSH to the host and grep docker logs honcho-api-1 to discover the issue.

Suggested fix

  • Forward the detail field from FastAPI 4xx/5xx responses to the MCP error message (or at minimum HTTP status + first line of response body).
  • Add a honcho_status MCP tool that pings /health + checks resolved model configs + reports last successful chat() and last successful dream cycle.

Defect 5 — Tenacity retry masks non-retryable validation errors

src/dialectic/core.py:413 raises tenacity.RetryError[ValidationException] after retrying a Missing API key failure. A config error cannot succeed on retry; the retry just delays the failure and obscures the root exception in stack traces.

Suggested fix

  • Mark ValidationException (and any other config-time exception) as non-retryable in the tenacity policy.
  • Surface the original exception via raise ... from (already done in some places, missed here).

Defect 6 — Dreamer failures are invisible

When the dream specialists' model config is broken:

  • schedule_dream (or the auto-trigger from crud/representation.py:177) fails before enqueue_dream is called
  • No row appears in the queue table with payload->>task_type = 'dream'
  • The deriver continues processing representation / summary / webhook tasks normally
  • The peer card decays silently (no contradiction detection, no induction, no consolidation) because the curation pipeline is dead — but every other observable metric (queue depth, explicit observation count, API uptime) looks healthy

In our case: 0 dreams for 4 days, only discovered while debugging an unrelated peer.chat() issue.

Suggested fix

  • Add a Prometheus metric `honcho_last_dream_completion_timestamp_seconds{workspace,observer}`. Operators can alert on it.
  • Add a "stale dreams" warning in the API health endpoint when no dream has completed for `MIN_HOURS_BETWEEN_DREAMS * 3` hours despite enabled flag.
  • Capture and log scheduler-side exceptions distinctly from worker-side exceptions, so a "dreamer can't start" condition emits a loud structured error instead of being lost.

Combined impact

A single misnamed env var (typo, stale doc, copy from older Honcho version) currently silently degrades the system over days:

  1. The dialectic returns 500s
  2. The dreamer stops, peer card stops self-curating
  3. Operator sees the dialectic 500 first (because it's user-facing), fixes that, but doesn't realize the dreamer was equally affected unless they check observation counts manually.

Each individual defect is small. Together they produce a high-cost, low-visibility failure mode that's hostile to self-hosted operators. The two highest-leverage fixes IMO are: (a) boot-time config validation that logs resolved model configs and warns on unknown env vars, and (b) a last_dream health signal to detect silent dreamer death.

Environment

  • Honcho self-hosted via `docker-compose.yml` (5 containers: api, deriver, postgres, redis, tei)
  • Embedding via local TEI service (Hugging Face), other LLMs via Gemini
  • ~5800 conclusions accumulated over ~3 weeks of usage
  • Config via `.env` file (no `config.toml`)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions