Skip to content

docs(experiments): add model experiments spec#3463

Merged
markijbema merged 4 commits into
mainfrom
mark-model-experiments-spec
May 25, 2026
Merged

docs(experiments): add model experiments spec#3463
markijbema merged 4 commits into
mainfrom
mark-model-experiments-spec

Conversation

@markijbema
Copy link
Copy Markdown
Contributor

Summary

  • Add .specs/model-experiments.md as the durable source of truth for model experiment scope, routing, lifecycle, retention, reporting, and secret-handling rules.
  • Register the spec in AGENTS.md so future model-experiment changes read the business rules before editing the domain.

Verification

  • Not manually exercised; docs-only spec and registration change.

Visual Changes

N/A

Reviewer Notes

  • The spec is intentionally business-rule focused and excludes implementation-plan details unless they protect an invariant.

@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot Bot commented May 25, 2026

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Executive Summary

Docs-only PR adding the model-experiments spec; all four incremental commits are spec refinements with no logic or security concerns. The activation-minimum WARNING from the previous review is now resolved.

Incremental Changes Reviewed (all 4 commits)
  • docs(experiments): add model experiments spec — Initial spec file and AGENTS.md registration.
  • remove redundant info — Removed the "Status" (Draft) and "Conventions" (RFC 2119) boilerplate sections.
  • docs(experiments): loosen prompt capture cap requirement — Changed prompt-body cap from MUST to MAY, with conditional MUST rules applying only when truncation is used.
  • docs(experiments): allow single-variant activation — Changed activation minimum from "at least two variants" to "at least one variant". This resolves the previous WARNING that flagged a conflict with the implementation (which only rejects variants.length < 1). The PR author confirmed this is intentional: the spec should support sequential (single-variant) testing, not only A/B.
Previous Issues — Status
File Issue Status
.specs/model-experiments.md line 11 WARNING: Scope exclusions (BYOK, kilo-internal) not enforced by current routing/admin path Open (acknowledged by author as intentional scoping)
.specs/model-experiments.md WARNING: Activation minimum conflicts with current behavior (spec said ≥2, impl accepts ≥1) Resolved — changed to "at least one variant" in latest commit
Files Reviewed (2 files)
  • .specs/model-experiments.md — 122-line spec; activation-minimum WARNING resolved
  • AGENTS.md — spec registration row; no issues

Reviewed by claude-sonnet-4.6 · 286,158 tokens

Review guidance: REVIEW.md from base branch main

Copy link
Copy Markdown
Contributor

@jeanduplessis jeanduplessis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left two warning comments where the proposed source-of-truth spec appears stricter than current model-experiment behavior.


Model experiments exist only to A/B test preview or otherwise experimental model checkpoints in partnership with model providers. They are not a general-purpose traffic-splitting or rollout mechanism for production models.

An experimented `public_model_id` MUST be a dedicated preview or experiment id that users explicitly select. Production model ids MUST NOT be silently bucketed. Experimented ids MUST NOT be added to `kilo-auto` candidate sets, presets, or other automatic selection paths unless the spec is explicitly changed to allow that behavior.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Scope exclusions are not enforced by the current routing/admin path.

Trace:

  1. The spec adds hard guarantees that experimented IDs are dedicated preview/experiment IDs and that kilo-internal/... traffic is outside model-experiment routing.
  2. Current admin/routing code only requires public_model_id to be a non-empty string; assertActivatable does not reject production IDs or kilo-internal/...; getProvider checks experiment membership before kilo-internal/... custom LLM routing.
  3. An admin can create/activate an experiment for a production model ID or kilo-internal/... ID, after which membership causes the experiment path to run despite the new spec saying those states MUST NOT happen.

Impact: The merged spec would claim opt-in/preview-only and kilo-internal/... exclusion invariants that current code does not preserve. A misconfigured active experiment can silently reroute production or internal-model traffic and capture experiment attribution/prompts under a contract that says that traffic is out of scope.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is true; i scoped this out. I think we might want to reconsider how exactly we serve these experiments (it might make sense to combine them with a fallback for when an experiment pauses for instance, and put them behind another model (like kilo-auto/free). However, we do these changes manually so i think we should always do this carefully.

Comment thread .specs/model-experiments.md Outdated
- `paused`: gateway returns a local model-unavailable response for the experimented public id and MUST NOT silently fall through to default model routing.
- `completed`: historical and non-routing; removed from routing membership and eligible to coexist with a draft or active replacement for the same public id.

Activation MUST validate that the experiment has at least two variants, every variant has positive weight, every variant has a current version effective at or before activation time, and no other active or paused experiment targets the same public id.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Activation minimum conflicts with current behavior.

Trace:

  1. The spec requires activation to validate at least two variants.
  2. Current implementation rejects only variants.length < 1; the admin UI says “at least 1 variant”; and model-experiments-router.test.ts explicitly activates a one-variant experiment successfully.
  3. The repository already permits and tests a one-variant active experiment, while the new source-of-truth spec says activation MUST reject it.

Impact: Future reviewers/implementers will rely on a false invariant. One-variant experiments can be activated today, so the experiment may not be an A/B test and cohort/reporting assumptions based on “at least two variants” can be wrong.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh nice catch, this is actually wrong. Though we started out with a/b testing i also want this to be used with purely sequential testing

@markijbema markijbema merged commit 2994039 into main May 25, 2026
14 checks passed
@markijbema markijbema deleted the mark-model-experiments-spec branch May 25, 2026 14:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants