References

This page collects sources, evidence levels, and experiment details for the Meta-Programming documentation.

Evidence Levels

Marker	Level	Meaning
🟢	Proven	Our experiment, our data, measured result
🟡	Trusted source	Anthropic, Microsoft Research, peer-reviewed — we read the primary source
🟠	Community reports	Widely observed, not independently verified by us
🔴	Unverified	Heard, not checked
⚪	Opinion	Our synthesis, reasoned but not proven

Our Experiments

#	Description	Key Finding	Pages
1	EventBus refactor A/B (5 variants, 509 TS files, DDD, Cloudflare Workers)	Process beats information: Scout→Spec→Worker→Review pipeline ($8.45) outperformed raw context injection ($2.84 fail, $14.30 pass). Code maps hurt: $9.99 fail vs $6.63 pass without map.	pipeline, principles
2	Telegram bot without spec (support reply feature)	Two consecutive deterministic failures without spec. First corrupted existing handler, second missed intent entirely.	specification, principles
3	24-file type refactor	253 tests, zero regressions, $5.50 API cost. Review required three iterations (both failure types were missing pre-flight checks). 106-turn task didn't contaminate subsequent tasks due to context reset.	index, verification, pipeline
4	Spec review with barrel re-export	Agent proposed barrel re-export during spec review, recognized it was unnecessary when asked to explain. Spec review is cheaper than code review.	specification, principles
5	Pipeline variant comparison	Structured pipeline: correct on first attempt, $8.45. Raw prompt: wrong, $2.84. Sequential with context: correct, $14.30. With code map: wrong, $9.99.	pipeline
6	KB A/B test (generic vs structured)	Generic Sonnet said "give it a code map." KB-loaded agent flagged exploration-vs-exploitation paradox with prior session evidence before writing code.	index, principles
7	Edit tool investigation	Persistent error pattern traced to our own extension, not the platform. Post-fix benchmark: 7.1% errors, 1.1% data loss.	index
8	Model evaluation (thinking levels)	Thinking level acts as compliance-to-conviction dial. Soft sycophancy identified: agent says no while providing implementation.	index
9	Opus degradation incident (April 2026)	Read:Edit ratio dropped from 6.6 to 2.0, thinking depth fell 67%, costs spiked 80×. Three-day silent degradation with no API-side signal.	verification, principles
10	Opus 4.7 release analysis (April 16, 2026)	Tokenizer inflation 1.25–1.35× tokens per request, `budget_tokens` silently rerouted to `task_budget`, `xhigh` as new default, self-verification +~15% output tokens. Modelled workload $118K → $157.5K (+33.5%) at unchanged nominal pricing.	landscape, verification

External Sources

Large-Scale Studies

LinearB (2026). "The Real Impact of AI on Developer Productivity." 8.1M pull requests, 4,800 teams, 42 countries. AI-generated code: 1.7× more review revisions, 4.6× longer review wait, 32.7% acceptance rate (vs 84.4% human). Developers feel 20% faster; tasks take 19% longer end-to-end. 🟡
- Referenced in: index, verification, principles
Meta-Harness (Lee, Nair, Zhang, Lee, Khattab, Finn — Stanford + MIT, arxiv 2603.28052, March 2026). End-to-end optimization of model harnesses. 6× performance gap on the same benchmark from harness changes alone. Meta-Harness agentic proposer searches harness code via filesystem: +7.7 points on online text classification with 4× fewer context tokens, +4.7 points on retrieval-augmented math (200 IMO-level), 76.4% on TerminalBench-2 (top auto-optimized system). 🟡
- Referenced in: index, principles
ETH Zurich (Feb 2026). 138 real-world tasks across 3 models. Auto-generated context files reduced task success by 3%. Human-written boundaries improved success by 4%. 🟡
- Referenced in: specification, pipeline, principles
Sonar (2026). Survey of 1,000+ developers. Only 48% verify AI output before shipping. 🟡
- Referenced in: index
Microsoft Copilot Study (2026). 10-month study, 878 pull requests. "The bottleneck moved from typing speed to knowledge, judgment, and ability to articulate tasks." 🟡
- Referenced in: index
BSWEN (2026). 133 cycles, 42 development phases, four models in strict isolation. GPT caught Python security issues Claude missed. Claude caught architectural violations GPT normalized. Each model had different blind spots. 🟡
- Referenced in: verification, principles
Bamberg/Heidelberg (2026). Systematic analysis of 2,926 repositories across Claude Code, GitHub Copilot, Cursor, Gemini CLI, Pydantic AI. Converging on identical patterns independently. 🟡
- Referenced in: index

Research Papers

Tsinghua NLAH (March 2026). "Natural-Language Agent Harnesses." Harness behavior externalized as "a portable executable artifact in editable natural language." 🟡
- Referenced in: index
Microsoft Research RiSE (March 2026). Lahiri et al. "Intent Formalization" named as a grand challenge for 2026. Intent gap: the semantic distance between what a developer means and what the system does. 🟡
- Referenced in: index, specification
ERL — Experiential Reflective Learning (Allard, Teinturier, Xing, Viaud, arxiv 2603.24639, March 2026). Agents with heuristics extracted from prior trajectories outperformed ReAct baselines by +7.8% on the Gaia2 benchmark. Two-component framework: heuristic generation from task experience + retrieval-augmented execution for new tasks. "Heuristics provide more transferable abstractions than few-shot prompting." 🟡
- Referenced in: index
ExpeL (Andrew Zhao et al., 2023). Experience Learning: three-stage self-improvement (act → reflect → extract). On HotpotQA and ALFWorld, ExpeL agents improve with each batch of trajectories. 🟡
- Referenced in: self-improvement, landscape
Chroma 'Context Rot' (July 2025). 18 frontier models tested (GPT-4.1, Claude 4, Gemini 2.5, Qwen3, others). Universal degradation with input length: 20-50% accuracy drop between 10K and 100K input tokens (NIAH low-similarity), >30% accuracy loss in mid-window positions across all 18 models. Coined "Context Rot" as continuous decline, not overflow. 🟡
- Referenced in: context-engineering, principles

12b. Paulsen MECW (arxiv Oct 2025, pub Jan 2026). "Context Is What You Need: The Maximum Effective Context Window for Real World Limits of LLMs." Separate paper from Chroma; coined the term MECW. Effective context is task-dependent: simple retrieval tolerates ~3000-5000 tokens, complex operations (sort, summarize) collapse at 400-1200 tokens, some top models fail at as few as ~100 tokens on specific tasks. Effective window can be reduced "as much as 99%" of advertised MCW on worst-case structured tasks. 🟡 - Referenced in: context-engineering

Reflexion (NeurIPS 2023). Verbal reflection with episodic memory raised HumanEval from 80% to 91%. 🟡
- Referenced in: landscape
ADAS (ICLR 2025). Automated Design of Agentic Systems. The agent designs its own pipeline structure. 🟡
- Referenced in: landscape
Gödel Agent (ICLR 2025). Recursive self-modification via confidence-based logic. 🟡
- Referenced in: landscape
DKB (January 2026). Deterministic Knowledge Bases. AST graphs beat vector RAG and LLM-generated knowledge graphs for code navigation. 🟡
- Referenced in: landscape
AMBIG-SWE (ICLR 2026). Benchmark for ambiguity detection in software engineering tasks. 🟡
- Referenced in: specification

17b. Specification Gap (Chacón Sartori, ICN2 Barcelona, arxiv 2603.24284, March 2026). 51 class-generation tasks across four spec detail levels, single and multi-agent. Single-agent: 89% → 56% as spec details are removed. Multi-agent: 58% → 25%. 16pp coordination cost plus 11pp information asymmetry, additive. AST conflict detector at 97% precision: Δ = 0pp. Restoring full spec recovers 89% ceiling. 🟡 - Referenced in: specification, pipeline, landscape

17c. Context Engineering (Calboreanu, Swift North AI Lab, arxiv 2604.04258, April 2026). Five-role context package: Authority, Exemplar, Constraint, Rubric, Metadata. 200 documented interactions across four tools. Incomplete context triggered 72% of iteration cycles. Structured package: iterations 3.8 → 2.0, first-pass acceptance 32% → 55%. 🟡 - Referenced in: specification, context-engineering, landscape

17d. SLUMP — Specification Loss Under eMergent sPecification (Purdue, arxiv 2603.17104, March 2026). Specifications that emerge during a session drift from the original problem statement as conversation extends. ProjectGuard external state tracker recovers 90% of the faithfulness gap, cuts severe failures from 72 to 49 on benchmark. 🟡 - Referenced in: specification, landscape, self-improvement

17e. Intent Gap (tianpan.co, April 10, 2026). Intent misalignment accounts for ~32% of dissatisfactory LLM responses in production — largest single category. Four-layer user input model: immediate text, final goal, background desiderata, autonomy. Salesforce production: 58% single-turn success, 35% multi-turn. 67% resolution rate even after user correction. 🟡🟠 - Referenced in: specification, landscape

17f. Behavioral Drivers (Mehtiyev & Assunção, NCSU, arxiv 2604.02547, April 2026). 9,374 agent trajectories across 19 agents. Trajectory structure discriminates success: "gather context before editing, invest in validation" is agent-determined, not task-adaptive. Framework effect shrinks with each generation of base model. 🟡 - Referenced in: pipeline, principles, landscape

17g. Cognitive Companion (Khan & Khan, IBM Dublin, arxiv 2604.13759, April 2026). Four cognitive states: ON_TRACK, LOOPING, DRIFTING, STUCK. Two detector architectures: LLM-based companion (periodic structured prompt, −52–62% repetition, 11% overhead, API-accessible) and probe-based (linear classifier on hidden states layer 28, AUROC 0.84, requires open weights). 🟡 - Referenced in: verification, self-improvement

17h. AGENTS.md paper (arxiv 2602.11988). Context files actively reduce SWE-bench success rates past 500 lines. Cliff drop, not gradual. Sweet spot: 200–300 actionable lines. "Not giving the model a mental model, giving it a compliance checklist." 🟡 - Referenced in: specification, context-engineering, landscape

17i. Expectation-Realisation Gap (Lobentanzer et al., arxiv 2602.20292, February 2026). Extension of METR study. 16 developers expected +24% productivity from AI tools, measured −19%. 43-point calibration error. 🟡 - Referenced in: verification

17j. WebXSkill (Microsoft + UNC, arxiv 2604.13318, April 2026). Skill defined as parameterized action + natural-language guidance. +9.8 / +12.9 points on WebArena / WebVoyager against baseline. Concrete instance of Layer 2 intent form selection. 🟡 - Referenced in: index, self-improvement

ACE Framework. Agentic Context Engineering. Memory scoring: each unit carries a score that updates on use. Quality saturates at ~7 governed memories per entity across 500 adversarial queries. 🟡
- Referenced in: self-improvement
Pavlyshyn (Jan 2026). History of constrained natural language in programming: COBOL, SQL, Simula. 60-year progression. 🟡
- Referenced in: specification

Vendor Documentation & Posts

Anthropic — "Building Agents with Skills." Skills as zero-cost-until-invoked context units. Self-evaluation bias documented explicitly. Claude Code architecture: tiered context loading, compaction, coordinator mode. 🟡
- Referenced in: index, context-engineering, verification, principles
DSPy (Stanford, 33K stars). Prompts as learnable parameters. BootstrapFewShot, MIPROv2 search language space automatically. 🟡
- Referenced in: index, landscape
Pydantic (2026). Analyzed 4,668 pull request comments, extracted 150 AGENTS.md rules. Engineering taste compiled into agent instructions. 🟡
- Referenced in: specification
Promptfoo (acquired by OpenAI, March 2026, $86M). 350K developers. Trajectory assertions: tool-used, tool-args-match, tool-sequence, step-count, goal-success. 🟡
- Referenced in: verification, landscape
spec-kit (79K stars). GitHub's 5-command SDD workflow: constitution → specify → plan → tasks → implement. 20+ agents. 🟡
- Referenced in: specification, landscape
Kiro (Amazon). IDE built around requirements → design → tasks. Specs live in project root, evolve with codebase. 🟡
- Referenced in: specification, landscape
AGENTS.md. 60,000+ repositories. Linux Foundation / Agentic AI Foundation standard. Cross-tool coordination protocol with Anthropic, Google, Microsoft, Cursor backing. 🟡
- Referenced in: index, specification, landscape
Augment Code. Single-writer rule for hotspot files. Sequential merge strategy. 🟡
- Referenced in: pipeline
OpenTelemetry GenAI SIG. Semantic conventions: gen_ai.chat for LLM calls, agent.invoke for agent steps, tool.execute for tool calls. Datadog, Honeycomb, New Relic support natively. 🟡
- Referenced in: verification
Simon Willison (@simonw). Rigorous public reference on agentic engineering. Tests are free and mandatory. Agents follow existing code patterns. 🟡🟠
- Referenced in: context-engineering, landscape
Martin Fowler. Spec progression: spec-first → spec-anchored → spec-as-source. Maturity curve mapping. 🟡
- Referenced in: index

30b. Harrison Chase — "Continual learning for AI agents" (LangChain blog, April 2026). Three-layer framework: Model (weights), Harness (code + always-present instructions), Context (CLAUDE.md, skills, mcp.json). Hot-path vs offline memory update modes. Traces as shared substrate across all three layers. OpenClaw explicitly mapped as "Pi plus scaffolding" = harness layer. 🟡 - Referenced in: landscape, self-improvement, index

30c. GitLab AI-Assisted Development Playbook. Five autonomy levels: L1 Baseline (autocomplete), L2 Pair, L3 Conductor, L4 Orchestrator, L5 Harness. Five principles: failing test before every feature, fix the environment not the prompt, constraints are multipliers, repo is single source of truth, ask the agent to challenge you. Warning: skipping to L4/L5 without infrastructure amplifies technical debt. 🟡 - Referenced in: landscape, playbook, principles

30d. TechDebt.guru — 7 Copilot Anti-Patterns. Accept-and-Forget, Tab-Tab-Tab Syndrome (40% higher defect density), Context Blindness, Dependency Sprawl, Test Scaffolding Decay, Documentation Displacement, Style Drift. GitClear: 55% higher revert rate within two weeks for AI code. CodeRabbit (470 repos): 1.7× more bugs, 75% more logic errors, 8× more I/O performance bugs. 🟡🟠 - Referenced in: landscape, verification

30e. Cursor Official Best Practices (April 2026). Plan Mode (Shift+Tab) for research → questions → plan → approve → build. "Start over from a plan" preferred to mid-agent fixing. New conversation when context polluted. Save plans to .cursor/plans/ for team docs and resumption. 🟡 - Referenced in: landscape, pipeline

30f. Opus 4.7 Migration Analysis (Anthropic platform docs + @badlogicgames + ravoid.com + dev.to). Tokenizer inflation 1.0–1.35×, budget_tokens silently rerouted, xhigh as new default, self-verification +~15% output. Modelled workload: $118K/mo → $157.5K/mo (+33.5%) on identical behavior. Prompt cache cold for 2–4 weeks post-migration. 🟡 - Referenced in: landscape, verification

Community Reports

Amazon Kiro deployment (March 2026). 21,000 agents, 80% weekly usage mandate. 4 Sev-1 incidents in 90 days. March 5 outage: 6 hours, ~6.3M lost orders (99% US order drop). Internal Treadwell email cites "Gen-AI assisted changes" with "high blast radius." 90-day safety reset across 335 Tier-1 systems, mandatory two-person review. ~30,000 layoffs concurrent with AI scaling. Primary reporting: Financial Times (internal docs). Summary: The Register, 10 Mar 2026. 🟠
- Referenced in: index, verification, principles, landscape
Self-improvement tools (March 2026). Three independent projects (skill-loop, selfwrite, iterate) shipped in the same week without coordination. All focused on instruction improvement, not weight modification. 🟠
- Referenced in: self-improvement, landscape
Edit tool failure rate. Agents express edits as text replacements, which break on whitespace drift, formatting changes, and multi-cursor ambiguity. Documented across multiple tools. 🟠
- Referenced in: landscape
Spec sizing (900-1600 tokens). Community-reported sweet spot for structured quick-dev specs. Below 900: ambiguity risk. Above 1600: tail instruction degradation. 🟠
- Referenced in: specification
Rory Teehan. Structured error logging: what happened, why, what should have happened. 🟡
- Referenced in: self-improvement

35b. adelaidasofia/claude-performance (GitHub, April 2026). Measurement-driven CLAUDE.md: reads session JSONLs, computes six effectiveness metrics (one-shot edit rate, agent spawn distribution, model mix, hook fire rates, activity distribution, project allocation), writes behavioral rules when a metric falls below target, re-measures weekly, retires rules when metric stabilises. Rule lifecycle: add → measure → retire. 🟠 - Referenced in: self-improvement, playbook, landscape

35c. Homunculus plugin (Reddit r/ClaudeAI, April 2026). Observes user patterns, auto-writes skills/hooks/commands when repetitive behavior detected. Probabilistic skills (50–80% fire rate), deterministic commands. Per-project state in .claude/homunculus/. 🟠 - Referenced in: self-improvement, landscape

35d. u/thurn2 (Reddit r/ClaudeCode). "Agent teams = expensive subagents with better marketing." Community consensus across multiple threads: communication overhead overwhelms team leader's context, idle notifications consume context, no proven benefit over simple subagent spawning for current implementations. 🟠 - Referenced in: pipeline

35e. agent-lsp (blackwell-systems, May 2026). Go MCP server bridging Language Server Protocol to agents. 56 tools, 30 CI-verified language servers (TypeScript, Python, Go, Rust, Java, C/C++, Ruby, others). Reproducible benchmark across five real codebases (15K–319K LOC): 92–99% of grep matches on symbol references are false positives — HashiCorp Consul Close 1156 grep → 12 real refs (99% noise); Hono Close 15 → 1; FastAPI validate 64 → 2. Structured navigation 5–34× more token-efficient than grep-and-read, scaling with codebase size. Ships speculative execution (apply hypothetical edit in memory, get diagnostic delta, commit or discard) and phase enforcement (declared phase + structural rejection of out-of-phase tool calls = hooks-as-gates pattern shipped). Source: blackwell-systems/agent-lsp. Bench docs: docs/token-savings.md. 🟡 - Referenced in: landscape, verification, principles

35f. lopi (konjoai, May 2026). Rust + tokio orchestrator for Claude Code agents. Fourth independent shipped implementation of the Ralph-loop recovery pattern after Huntley's original blog, LoopTroop, and ralph-claude-code. Run loop: Plan → Implement → Test → Score → Fix-in-place → Retry with git reset --hard per failed attempt, hard turn limits, diff scope check, last-error injection into next attempt's plan, model routing cheap→Opus only after failure. On terminal failure: post-mortem stage distills failure into single imperative constraint (must / do not / always / never, ≤200 chars), Jaccard-on-keywords retrieval in SQLite, no embeddings. Quality gate: LESSON_QUALITY_GATE = 0.6 skips lesson write when score below threshold. Uses TOON format (toonformat.dev v3.0) for tabular pattern injection. Source: konjoai/lopi. 🟡 - Referenced in: landscape, self-improvement, pipeline, index

People to Follow

Andrej Karpathy (@karpathy). Autoresearch: 700 commits in two days, −11% validation loss. Memory should be tree-structured, not flat. 🟠
Mario Zechner (@badlogicgames). Built Pi. When agents self-praise, human review becomes the bottleneck. 🟠
Harrison Chase (@hwchase17). LangChain. Model/Harness/Context three-layer continual learning framework. What does production agent orchestration actually look like at scale. 🟡
Armin Ronacher (@mitsuhiko). Advocate for lat.md (knowledge graph in markdown). Shipped multi-edit tooling for Pi. Direct critic of agent anti-patterns. 🟠

This reference list is maintained alongside the documentation. Sources marked with evidence levels as used in the main text.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

References

Evidence Levels

Our Experiments

External Sources

Large-Scale Studies

Research Papers

Vendor Documentation & Posts

Community Reports

People to Follow

FilesExpand file tree

references.md

Latest commit

History

references.md

File metadata and controls

References

Evidence Levels

Our Experiments

External Sources

Large-Scale Studies

Research Papers

Vendor Documentation & Posts

Community Reports

People to Follow