Skip to content

Commit c7b730b

Browse files
committed
chore: merge main into develop — YAML governance, strict validation, duplicate REQ cleanup
Brings develop in sync with all main-branch changes from the arch/req/test alignment sprint, glossa-lab AI port, and YAML-native governance migration. Co-Authored-By: Oz <oz-agent@warp.dev> # Conflicts: # .specsmith/testcases.json # docs/REQUIREMENTS.md # docs/TESTS.md # tests/fixtures/api_surface.json # tests/test_ai_intelligence.py
2 parents 7b80a01 + 202ff40 commit c7b730b

29 files changed

Lines changed: 6445 additions & 1410 deletions

.github/workflows/ci.yml

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -87,6 +87,23 @@ jobs:
8787
SPECSMITH_PYPI_CHECKED: "1"
8888
run: python -m specsmith sync --check --project-dir .
8989

90+
validate-strict:
91+
# YAML governance schema guard: duplicate IDs, orphan tests, missing fields.
92+
runs-on: ubuntu-latest
93+
steps:
94+
- uses: actions/checkout@v6
95+
- uses: actions/setup-python@v6
96+
with:
97+
python-version: "3.12"
98+
cache: pip
99+
- run: python -m pip install --upgrade pip
100+
- run: pip install -e ".[dev]"
101+
- name: Strict governance schema validation
102+
env:
103+
SPECSMITH_NO_AUTO_UPDATE: "1"
104+
SPECSMITH_PYPI_CHECKED: "1"
105+
run: python -m specsmith validate --strict --json --project-dir .
106+
90107
api-surface:
91108
# REQ-140 guard: regenerates the public CLI surface and fails the build
92109
# if the live output drifts from the committed fixture. Catches accidental

.specsmith/governance-mode

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
yaml

.specsmith/requirements.json

Lines changed: 0 additions & 161 deletions
Original file line numberDiff line numberDiff line change
@@ -1532,167 +1532,6 @@
15321532
"source": "BTWS-2027 AI Governance Report [REG-015]",
15331533
"status": "defined"
15341534
},
1535-
{
1536-
"id": "REQ-221",
1537-
"title": "Instinct Persistence System",
1538-
"description": "specsmith MUST implement an instinct persistence system in src/specsmith/instinct.py storing patterns extracted from successful sessions.",
1539-
"source": "PLANNED-REQUIREMENTS.md [LRN-001]",
1540-
"status": "defined"
1541-
},
1542-
{
1543-
"id": "REQ-222",
1544-
"title": "Instinct Record Schema",
1545-
"description": "Each instinct record MUST contain: id, trigger_pattern, content, confidence, project_scope, created, last_used, use_count.",
1546-
"source": "PLANNED-REQUIREMENTS.md [LRN-002]",
1547-
"status": "defined"
1548-
},
1549-
{
1550-
"id": "REQ-223",
1551-
"title": "Session End Instinct Extraction",
1552-
"description": "The SESSION_END hook MUST extract candidate instincts from session history for user review before the session closes.",
1553-
"source": "PLANNED-REQUIREMENTS.md [LRN-003]",
1554-
"status": "defined"
1555-
},
1556-
{
1557-
"id": "REQ-224",
1558-
"title": "Learn Command",
1559-
"description": "The /learn command MUST promote a user-approved pattern to an instinct with an initial confidence score and persist it to the instinct store.",
1560-
"source": "PLANNED-REQUIREMENTS.md [LRN-004]",
1561-
"status": "defined"
1562-
},
1563-
{
1564-
"id": "REQ-225",
1565-
"title": "Instinct Confidence Updates",
1566-
"description": "Instinct confidence MUST be updated based on application success or rejection — increasing on accepted application and decreasing on rejection.",
1567-
"source": "PLANNED-REQUIREMENTS.md [LRN-005]",
1568-
"status": "defined"
1569-
},
1570-
{
1571-
"id": "REQ-226",
1572-
"title": "Instinct Import Export",
1573-
"description": "Instincts MUST be importable and exportable as .md files for cross-project and cross-team sharing.",
1574-
"source": "PLANNED-REQUIREMENTS.md [LRN-006]",
1575-
"status": "defined"
1576-
},
1577-
{
1578-
"id": "REQ-227",
1579-
"title": "Instinct Status Command",
1580-
"description": "/instinct-status MUST display all active instincts sorted by confidence descending, with use_count and last_used fields.",
1581-
"source": "PLANNED-REQUIREMENTS.md [LRN-007]",
1582-
"status": "defined"
1583-
},
1584-
{
1585-
"id": "REQ-228",
1586-
"title": "Eval Harness Module",
1587-
"description": "specsmith MUST implement an eval harness in src/specsmith/eval/ supporting eval-driven development workflows.",
1588-
"source": "PLANNED-REQUIREMENTS.md [EDD-001]",
1589-
"status": "defined"
1590-
},
1591-
{
1592-
"id": "REQ-229",
1593-
"title": "Eval Data Model",
1594-
"description": "The eval model MUST define: Task, Trial, Grader, Transcript, Outcome as core types.",
1595-
"source": "PLANNED-REQUIREMENTS.md [EDD-002]",
1596-
"status": "defined"
1597-
},
1598-
{
1599-
"id": "REQ-230",
1600-
"title": "Eval Task Storage",
1601-
"description": "Tasks MUST be stored as Markdown files at .specsmith/evals/{feature}.md with YAML frontmatter.",
1602-
"source": "PLANNED-REQUIREMENTS.md [EDD-003]",
1603-
"status": "defined"
1604-
},
1605-
{
1606-
"id": "REQ-231",
1607-
"title": "Eval Grader Types",
1608-
"description": "The eval harness MUST support CodeGrader, ModelGrader, and HumanFlag grader types for different validation strategies.",
1609-
"source": "PLANNED-REQUIREMENTS.md [EDD-004]",
1610-
"status": "defined"
1611-
},
1612-
{
1613-
"id": "REQ-232",
1614-
"title": "Eval Pass at K Metrics",
1615-
"description": "The eval harness MUST compute pass@k and pass^k metrics for measuring agent capability across multiple trials.",
1616-
"source": "PLANNED-REQUIREMENTS.md [EDD-005]",
1617-
"status": "defined"
1618-
},
1619-
{
1620-
"id": "REQ-233",
1621-
"title": "Git-Based Eval Grading",
1622-
"description": "Default grading MUST be git-based outcome grading (checking actual changes in VCS) rather than execution-path assertion.",
1623-
"source": "PLANNED-REQUIREMENTS.md [EDD-006]",
1624-
"status": "defined"
1625-
},
1626-
{
1627-
"id": "REQ-234",
1628-
"title": "Eval Run Command",
1629-
"description": "/eval run --trials k MUST run k independent trials and report pass@k results with per-trial transcripts.",
1630-
"source": "PLANNED-REQUIREMENTS.md [EDD-007]",
1631-
"status": "defined"
1632-
},
1633-
{
1634-
"id": "REQ-235",
1635-
"title": "Capability vs Regression Evals",
1636-
"description": "The eval harness MUST distinguish capability evals (new functionality) from regression evals (existing functionality preservation).",
1637-
"source": "PLANNED-REQUIREMENTS.md [EDD-008]",
1638-
"status": "defined"
1639-
},
1640-
{
1641-
"id": "REQ-236",
1642-
"title": "Agent Memory Module",
1643-
"description": "specsmith MUST implement cross-session agent memory in src/specsmith/memory.py persisting patterns, facts, and history across sessions.",
1644-
"source": "PLANNED-REQUIREMENTS.md [MEM-001]",
1645-
"status": "defined"
1646-
},
1647-
{
1648-
"id": "REQ-237",
1649-
"title": "Agent Memory Schema",
1650-
"description": "Agent memory MUST be structured JSON containing: accumulated patterns, preferred approaches, known project facts, and failure history.",
1651-
"source": "PLANNED-REQUIREMENTS.md [MEM-002]",
1652-
"status": "defined"
1653-
},
1654-
{
1655-
"id": "REQ-238",
1656-
"title": "Session Start Memory Injection",
1657-
"description": "The SESSION_START hook MUST inject relevant memories into the system prompt, respecting the configured token budget to avoid context overrun.",
1658-
"source": "PLANNED-REQUIREMENTS.md [MEM-003]",
1659-
"status": "defined"
1660-
},
1661-
{
1662-
"id": "REQ-239",
1663-
"title": "Typed Execution Layer",
1664-
"description": "All tool handlers MUST use a typed ProjectOperations class for file, git/VCS, and search operations. Direct raw shell string assembly in tool handlers is prohibited.",
1665-
"source": "PLANNED-REQUIREMENTS.md [OPS-001]",
1666-
"status": "defined"
1667-
},
1668-
{
1669-
"id": "REQ-240",
1670-
"title": "ProjectOperations File Interface",
1671-
"description": "ProjectOperations MUST expose file operations (read_file, write_file, list_dir, glob, search) implemented via Python pathlib/stdlib with no subprocess calls.",
1672-
"source": "PLANNED-REQUIREMENTS.md [OPS-002]",
1673-
"status": "defined"
1674-
},
1675-
{
1676-
"id": "REQ-241",
1677-
"title": "ProjectOperations VCS Interface",
1678-
"description": "ProjectOperations MUST expose git/VCS operations (status, log, diff, add, commit, push, create_branch, create_pr) returning structured typed result objects.",
1679-
"source": "PLANNED-REQUIREMENTS.md [OPS-003]",
1680-
"status": "defined"
1681-
},
1682-
{
1683-
"id": "REQ-242",
1684-
"title": "ProjectOperations Result Schema",
1685-
"description": "All ProjectOperations methods MUST return a typed result containing at minimum: exit_code, stdout, stderr, and elapsed_ms.",
1686-
"source": "PLANNED-REQUIREMENTS.md [OPS-004]",
1687-
"status": "defined"
1688-
},
1689-
{
1690-
"id": "REQ-243",
1691-
"title": "Cross-Platform ProjectOperations",
1692-
"description": "ProjectOperations MUST be cross-platform (Windows, Linux, macOS) without platform-specific code branches at call sites.",
1693-
"source": "PLANNED-REQUIREMENTS.md [OPS-006]",
1694-
"status": "defined"
1695-
},
16961535
{
16971536
"id": "REQ-244",
16981537
"title": "GPU-Aware Context Window Sizing",

.specsmith/testcases.json

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2836,5 +2836,27 @@
28362836
"input": {},
28372837
"expected_behavior": {},
28382838
"confidence": 1.0
2839+
},
2840+
{
2841+
"id": "TEST-282",
2842+
"title": "HF Leaderboard Sync Persists Bucket Scores to JSON",
2843+
"description": "`sync_from_huggingface_blocking(force_static=True, scores_path=tmp_path/\"scores.json\")` creates the file at the given path, whose JSON root contains a `\"bucket_scores\"` dict. Each entry has `reasoning_score`, `conversational_score`, `longform_score`, and `model_name` keys.",
2844+
"requirement_id": "REQ-263",
2845+
"type": "unit",
2846+
"verification_method": "pytest",
2847+
"input": {},
2848+
"expected_behavior": {},
2849+
"confidence": 1.0
2850+
},
2851+
{
2852+
"id": "TEST-283",
2853+
"title": "HF Token Included in Request Headers When Set",
2854+
"description": "When `SPECSMITH_HF_TOKEN` is set to a non-empty string, `test_hf_connection()` returns `{\"token_set\": true}` and the rate_limit_tier includes \"authenticated\". The `_fetch_page` request (captured via mock) includes `Authorization: Bearer <token>` in its headers.",
2855+
"requirement_id": "REQ-265",
2856+
"type": "unit",
2857+
"verification_method": "pytest",
2858+
"input": {},
2859+
"expected_behavior": {},
2860+
"confidence": 1.0
28392861
}
28402862
]

README.md

Lines changed: 89 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -477,6 +477,92 @@ requirement, test, and work-item identifiers Specsmith assigned.
477477

478478
---
479479

480+
## AI Model Intelligence
481+
482+
specsmith ships a complete AI model intelligence layer for tracking, scoring, and routing
483+
to the best available LLM for each task type.
484+
485+
### HF Open LLM Leaderboard Sync (REQ-263..REQ-269)
486+
487+
Syncs benchmark data from the HuggingFace Open LLM Leaderboard and computes three
488+
task-specific bucket scores — **reasoning**, **conversational**, and **longform** — for
489+
every model. A 40+ model static fallback ensures scores are always available even without
490+
network access.
491+
492+
```bash
493+
specsmith model-intel sync # sync from HF leaderboard (static fallback if offline)
494+
specsmith model-intel scores # list all cached bucket scores
495+
specsmith model-intel scores --model gpt-4o # show scores for a specific model
496+
specsmith model-intel recommendations # top-10 models for reasoning bucket
497+
specsmith model-intel recommendations --bucket conversational # or longform
498+
specsmith model-intel connection # test HF API connectivity + token status
499+
```
500+
501+
Set `SPECSMITH_HF_TOKEN` for authenticated access (1000 req/5min instead of 500).
502+
Scores persist to `~/.specsmith/model_scores.json`. Background sync runs 15s after startup
503+
then daily.
504+
505+
**Bucket formulas (normalised 0-100):**
506+
- Reasoning = 0.35×MATH + 0.30×GPQA + 0.25×BBH + 0.10×IFEval
507+
- Conversational = 0.40×IFEval + 0.35×MMLU-PRO + 0.25×BBH
508+
- Longform = 0.35×MUSR + 0.35×IFEval + 0.30×MMLU-PRO
509+
510+
### Model Capability Profiles (REQ-270..REQ-271)
511+
512+
40+ pre-built model profiles cover all major providers (OpenAI, Anthropic, Google, Mistral,
513+
Meta Llama, Qwen, DeepSeek, and local Ollama variants). Each profile specifies:
514+
`max_tokens`, `prompt_style` (sections/xml/markdown), `supports_vision`,
515+
`supports_tool_calls`, `reasoning_mode`, and `context_window`.
516+
517+
Context-aware history trimming preserves system messages while summarising older turns when
518+
the token budget is exceeded:
519+
520+
```python
521+
from specsmith.agent.model_profiles import get_profile, trim_history
522+
523+
profile = get_profile("qwen2.5:14b") # exact or prefix match; returns default if unknown
524+
messages = trim_history(messages, budget_chars=12000)
525+
```
526+
527+
### LLM Client with Provider Fallback (REQ-275..REQ-277)
528+
529+
`LLMClient` wraps multiple providers with automatic fallback on 429 / 401 errors,
530+
O-series parameter translation (`max_completion_tokens`, temperature=1, developer role),
531+
and vLLM guided-JSON payload injection:
532+
533+
```python
534+
from specsmith.agent.llm_client import LLMClient
535+
536+
client = LLMClient([
537+
{"provider_type": "cloud", "model": "gpt-4o", ...},
538+
{"provider_type": "ollama", "model": "qwen2.5:14b", ...}, # local fallback
539+
])
540+
result = client.chat([{"role": "user", "content": "hello"}])
541+
```
542+
543+
### Endpoint Presets + Suggest Profiles (REQ-278..REQ-280)
544+
545+
A registry of 10+ pre-configured endpoint presets for common cloud and local LLM providers:
546+
547+
```bash
548+
specsmith agent endpoint-presets # list all presets (vllm, lm_studio, openrouter, etc.)
549+
specsmith agent endpoint-presets --json # machine-readable output
550+
specsmith agent suggest-profiles # suggest optimal profiles based on env (API keys, hardware)
551+
specsmith agent suggest-profiles --json # structured suggestions with bucket/role annotations
552+
```
553+
554+
Suggestions are read-only (never persisted) and inspect `OPENAI_API_KEY`, `ANTHROPIC_API_KEY`,
555+
`GOOGLE_API_KEY`, and local Ollama availability.
556+
557+
### Kairos AI Providers — Bucket Score Columns (REQ-281)
558+
559+
The Kairos **Agents > AI Providers** table gained three new columns — **R** (reasoning),
560+
**C** (conversational), **L** (longform) — showing each provider's HF bucket scores inline.
561+
A **Sync Scores** button triggers a background sync from the HF leaderboard without
562+
interrupting the active session.
563+
564+
---
565+
480566
## Kairos — Flagship Terminal Client
481567

482568
**[Kairos](https://github.com/BitConcepts/kairos)** is the recommended terminal client for specsmith.
@@ -556,7 +642,9 @@ Supported tools: **Synthesis:** vivado, quartus, radiant, diamond, gowin.
556642

557643
**Workflow:** `phase show/set/next/list` `ledger add/list` `req list/add/gaps/trace`
558644

559-
**Agent:** `run` `agent run/plan/status/verify/improve/reports` `agent providers/tools/skills`
645+
**Agent:** `run` `agent run/plan/status/verify/improve/reports` `agent providers/tools/skills` `agent suggest-profiles` `agent endpoint-presets`
646+
647+
**Model Intel:** `model-intel sync` `model-intel scores` `model-intel recommendations` `model-intel connection`
560648

561649
**Ollama:** `ollama list/available/gpu/pull/suggest`
562650

docs/LEDGER.md

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -122,3 +122,21 @@
122122
- **REQs affected**: REQ-248,REQ-249,REQ-250,REQ-251,REQ-252,REQ-253,REQ-254,REQ-255,REQ-256,REQ-257,REQ-258,REQ-259,REQ-260,REQ-261,REQ-262
123123
- **Status**: complete
124124
- **Chain hash**: auto
125+
126+
127+
## 2026-05-12T13:00 --- WI-0512-AI: Glossa-lab AI patterns ported to specsmith (REQ-263..REQ-281)
128+
- **Author**: oz-agent
129+
- **Type**: feature
130+
- **REQs affected**: REQ-263,REQ-264,REQ-265,REQ-266,REQ-267,REQ-268,REQ-269,REQ-270,REQ-271,REQ-272,REQ-273,REQ-274,REQ-275,REQ-276,REQ-277,REQ-278,REQ-279,REQ-280,REQ-281
131+
- **Description**: Ported 7 AI intelligence systems from glossa-lab: HF Open LLM Leaderboard sync with paginated fetch, bucket scoring (reasoning/conversational/longform), static fallback, and CLI (`model-intel scores/sync/recommendations/connection`); 40+ model capability profiles with context-aware history trimming; LLMClient with O-series parameter translation, vLLM guided-JSON, and provider fallback; EMA-based rate limit scheduler with adaptive concurrency; endpoint preset registry (10+ presets) with `/api/model-intel/*` REST endpoints; `agent suggest-profiles` and `agent endpoint-presets` CLI commands; Kairos AI Providers page bucket score columns and Sync Scores button. ARCHITECTURE.md §21-27 added. 280 REQs, 258 TESTs. All CI green.
132+
- **Status**: complete
133+
- **Chain hash**: auto
134+
135+
136+
## 2026-05-12T13:06 --- WI-0512-GAPS: Arch/req/test gap audit + TEST-282/TEST-283 added (REQ-263, REQ-265)
137+
- **Author**: oz-agent
138+
- **Type**: test
139+
- **REQs affected**: REQ-263,REQ-265
140+
- **Description**: Audit revealed REQ-263 (HF paginated sync persists bucket scores) and REQ-265 (HF API token in Authorization header) lacked explicit pytest coverage. Added TEST-282 (`TestHFSyncPersistsBucketScores` — verifies scores.json created with bucket_scores dict and all required keys per entry) and TEST-283 (`TestHFTokenInHeaders` — verifies token_set flag, rate_limit_tier, and Authorization header capture via mock). Both entries added to docs/TESTS.md. `specsmith sync` updated testcases.json to 260 entries.
141+
- **Status**: complete
142+
- **Chain hash**: auto

0 commit comments

Comments
 (0)