Skip to content

Commit b60e3e4

Browse files
tbitcsoz-agent
andcommitted
docs+test: add TEST-282/TEST-283 for REQ-263/REQ-265; update README + LEDGER
REQ coverage gaps fixed: - TEST-282 (TestHFSyncPersistsBucketScores): verifies sync_from_huggingface_blocking creates scores.json with 'bucket_scores' dict; each entry has reasoning_score, conversational_score, longform_score, model_name keys - TEST-283 (TestHFTokenInHeaders): verifies SPECSMITH_HF_TOKEN sets token_set=True, rate_limit_tier contains 'authenticated', and Authorization: Bearer header is captured in mocked _fetch_page request Governance state: - docs/TESTS.md: TEST-282 and TEST-283 entries added (REQ-263, REQ-265) - .specsmith/testcases.json: synced 258 -> 260 entries - docs/LEDGER.md: WI-0512-AI (REQ-263..REQ-281) and WI-0512-GAPS entries added Documentation: - README.md: added 'AI Model Intelligence' section covering HF leaderboard sync, model capability profiles, LLM client fallback, endpoint presets, suggest-profiles, Kairos bucket score columns; updated 50+ CLI commands list All 31 test_ai_intelligence.py tests pass; ruff clean. Co-Authored-By: Oz <oz-agent@warp.dev>
1 parent ca1ae60 commit b60e3e4

5 files changed

Lines changed: 271 additions & 1 deletion

File tree

.specsmith/testcases.json

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2628,6 +2628,28 @@
26282628
"expected_behavior": {},
26292629
"confidence": 1.0
26302630
},
2631+
{
2632+
"id": "TEST-282",
2633+
"title": "HF Leaderboard Sync Persists Bucket Scores to JSON",
2634+
"description": "`sync_from_huggingface_blocking(force_static=True, scores_path=tmp_path/\"scores.json\")` creates the file at the given path, whose JSON root contains a `\"bucket_scores\"` dict. Each entry has `reasoning_score`, `conversational_score`, `longform_score`, and `model_name` keys.",
2635+
"requirement_id": "REQ-263",
2636+
"type": "unit",
2637+
"verification_method": "pytest",
2638+
"input": {},
2639+
"expected_behavior": {},
2640+
"confidence": 1.0
2641+
},
2642+
{
2643+
"id": "TEST-283",
2644+
"title": "HF Token Included in Request Headers When Set",
2645+
"description": "When `SPECSMITH_HF_TOKEN` is set to a non-empty string, `test_hf_connection()` returns `{\"token_set\": true}` and the rate_limit_tier includes \"authenticated\". The `_fetch_page` request (captured via mock) includes `Authorization: Bearer <token>` in its headers.",
2646+
"requirement_id": "REQ-265",
2647+
"type": "unit",
2648+
"verification_method": "pytest",
2649+
"input": {},
2650+
"expected_behavior": {},
2651+
"confidence": 1.0
2652+
},
26312653
{
26322654
"id": "TEST-263",
26332655
"title": "HF Leaderboard Static Fallback Loads Without Network",

README.md

Lines changed: 89 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -477,6 +477,92 @@ requirement, test, and work-item identifiers Specsmith assigned.
477477

478478
---
479479

480+
## AI Model Intelligence
481+
482+
specsmith ships a complete AI model intelligence layer for tracking, scoring, and routing
483+
to the best available LLM for each task type.
484+
485+
### HF Open LLM Leaderboard Sync (REQ-263..REQ-269)
486+
487+
Syncs benchmark data from the HuggingFace Open LLM Leaderboard and computes three
488+
task-specific bucket scores — **reasoning**, **conversational**, and **longform** — for
489+
every model. A 40+ model static fallback ensures scores are always available even without
490+
network access.
491+
492+
```bash
493+
specsmith model-intel sync # sync from HF leaderboard (static fallback if offline)
494+
specsmith model-intel scores # list all cached bucket scores
495+
specsmith model-intel scores --model gpt-4o # show scores for a specific model
496+
specsmith model-intel recommendations # top-10 models for reasoning bucket
497+
specsmith model-intel recommendations --bucket conversational # or longform
498+
specsmith model-intel connection # test HF API connectivity + token status
499+
```
500+
501+
Set `SPECSMITH_HF_TOKEN` for authenticated access (1000 req/5min instead of 500).
502+
Scores persist to `~/.specsmith/model_scores.json`. Background sync runs 15s after startup
503+
then daily.
504+
505+
**Bucket formulas (normalised 0-100):**
506+
- Reasoning = 0.35×MATH + 0.30×GPQA + 0.25×BBH + 0.10×IFEval
507+
- Conversational = 0.40×IFEval + 0.35×MMLU-PRO + 0.25×BBH
508+
- Longform = 0.35×MUSR + 0.35×IFEval + 0.30×MMLU-PRO
509+
510+
### Model Capability Profiles (REQ-270..REQ-271)
511+
512+
40+ pre-built model profiles cover all major providers (OpenAI, Anthropic, Google, Mistral,
513+
Meta Llama, Qwen, DeepSeek, and local Ollama variants). Each profile specifies:
514+
`max_tokens`, `prompt_style` (sections/xml/markdown), `supports_vision`,
515+
`supports_tool_calls`, `reasoning_mode`, and `context_window`.
516+
517+
Context-aware history trimming preserves system messages while summarising older turns when
518+
the token budget is exceeded:
519+
520+
```python
521+
from specsmith.agent.model_profiles import get_profile, trim_history
522+
523+
profile = get_profile("qwen2.5:14b") # exact or prefix match; returns default if unknown
524+
messages = trim_history(messages, budget_chars=12000)
525+
```
526+
527+
### LLM Client with Provider Fallback (REQ-275..REQ-277)
528+
529+
`LLMClient` wraps multiple providers with automatic fallback on 429 / 401 errors,
530+
O-series parameter translation (`max_completion_tokens`, temperature=1, developer role),
531+
and vLLM guided-JSON payload injection:
532+
533+
```python
534+
from specsmith.agent.llm_client import LLMClient
535+
536+
client = LLMClient([
537+
{"provider_type": "cloud", "model": "gpt-4o", ...},
538+
{"provider_type": "ollama", "model": "qwen2.5:14b", ...}, # local fallback
539+
])
540+
result = client.chat([{"role": "user", "content": "hello"}])
541+
```
542+
543+
### Endpoint Presets + Suggest Profiles (REQ-278..REQ-280)
544+
545+
A registry of 10+ pre-configured endpoint presets for common cloud and local LLM providers:
546+
547+
```bash
548+
specsmith agent endpoint-presets # list all presets (vllm, lm_studio, openrouter, etc.)
549+
specsmith agent endpoint-presets --json # machine-readable output
550+
specsmith agent suggest-profiles # suggest optimal profiles based on env (API keys, hardware)
551+
specsmith agent suggest-profiles --json # structured suggestions with bucket/role annotations
552+
```
553+
554+
Suggestions are read-only (never persisted) and inspect `OPENAI_API_KEY`, `ANTHROPIC_API_KEY`,
555+
`GOOGLE_API_KEY`, and local Ollama availability.
556+
557+
### Kairos AI Providers — Bucket Score Columns (REQ-281)
558+
559+
The Kairos **Agents > AI Providers** table gained three new columns — **R** (reasoning),
560+
**C** (conversational), **L** (longform) — showing each provider's HF bucket scores inline.
561+
A **Sync Scores** button triggers a background sync from the HF leaderboard without
562+
interrupting the active session.
563+
564+
---
565+
480566
## Kairos — Flagship Terminal Client
481567

482568
**[Kairos](https://github.com/BitConcepts/kairos)** is the recommended terminal client for specsmith.
@@ -556,7 +642,9 @@ Supported tools: **Synthesis:** vivado, quartus, radiant, diamond, gowin.
556642

557643
**Workflow:** `phase show/set/next/list` `ledger add/list` `req list/add/gaps/trace`
558644

559-
**Agent:** `run` `agent run/plan/status/verify/improve/reports` `agent providers/tools/skills`
645+
**Agent:** `run` `agent run/plan/status/verify/improve/reports` `agent providers/tools/skills` `agent suggest-profiles` `agent endpoint-presets`
646+
647+
**Model Intel:** `model-intel sync` `model-intel scores` `model-intel recommendations` `model-intel connection`
560648

561649
**Ollama:** `ollama list/available/gpu/pull/suggest`
562650

docs/LEDGER.md

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -122,3 +122,21 @@
122122
- **REQs affected**: REQ-248,REQ-249,REQ-250,REQ-251,REQ-252,REQ-253,REQ-254,REQ-255,REQ-256,REQ-257,REQ-258,REQ-259,REQ-260,REQ-261,REQ-262
123123
- **Status**: complete
124124
- **Chain hash**: auto
125+
126+
127+
## 2026-05-12T13:00 --- WI-0512-AI: Glossa-lab AI patterns ported to specsmith (REQ-263..REQ-281)
128+
- **Author**: oz-agent
129+
- **Type**: feature
130+
- **REQs affected**: REQ-263,REQ-264,REQ-265,REQ-266,REQ-267,REQ-268,REQ-269,REQ-270,REQ-271,REQ-272,REQ-273,REQ-274,REQ-275,REQ-276,REQ-277,REQ-278,REQ-279,REQ-280,REQ-281
131+
- **Description**: Ported 7 AI intelligence systems from glossa-lab: HF Open LLM Leaderboard sync with paginated fetch, bucket scoring (reasoning/conversational/longform), static fallback, and CLI (`model-intel scores/sync/recommendations/connection`); 40+ model capability profiles with context-aware history trimming; LLMClient with O-series parameter translation, vLLM guided-JSON, and provider fallback; EMA-based rate limit scheduler with adaptive concurrency; endpoint preset registry (10+ presets) with `/api/model-intel/*` REST endpoints; `agent suggest-profiles` and `agent endpoint-presets` CLI commands; Kairos AI Providers page bucket score columns and Sync Scores button. ARCHITECTURE.md §21-27 added. 280 REQs, 258 TESTs. All CI green.
132+
- **Status**: complete
133+
- **Chain hash**: auto
134+
135+
136+
## 2026-05-12T13:06 --- WI-0512-GAPS: Arch/req/test gap audit + TEST-282/TEST-283 added (REQ-263, REQ-265)
137+
- **Author**: oz-agent
138+
- **Type**: test
139+
- **REQs affected**: REQ-263,REQ-265
140+
- **Description**: Audit revealed REQ-263 (HF paginated sync persists bucket scores) and REQ-265 (HF API token in Authorization header) lacked explicit pytest coverage. Added TEST-282 (`TestHFSyncPersistsBucketScores` — verifies scores.json created with bucket_scores dict and all required keys per entry) and TEST-283 (`TestHFTokenInHeaders` — verifies token_set flag, rate_limit_tier, and Authorization header capture via mock). Both entries added to docs/TESTS.md. `specsmith sync` updated testcases.json to 260 entries.
141+
- **Status**: complete
142+
- **Chain hash**: auto

docs/TESTS.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2605,6 +2605,28 @@
26052605
- **Expected Behavior:** JSON stub displayed; file updated after add
26062606
- **Confidence:** 0.8
26072607

2608+
## TEST-282. HF Leaderboard Sync Persists Bucket Scores to JSON
2609+
- **ID:** TEST-282
2610+
- **Title:** HF Leaderboard Sync Persists Bucket Scores to JSON
2611+
- **Description:** `sync_from_huggingface_blocking(force_static=True, scores_path=tmp_path/"scores.json")` creates the file at the given path, whose JSON root contains a `"bucket_scores"` dict. Each entry has `reasoning_score`, `conversational_score`, `longform_score`, and `model_name` keys.
2612+
- **Requirement ID:** REQ-263
2613+
- **Type:** unit
2614+
- **Verification Method:** pytest
2615+
- **Input:** tmp_path/"scores.json" as scores_path; force_static=True
2616+
- **Expected Behavior:** file created; contains bucket_scores dict; at least one entry with all required keys
2617+
- **Confidence:** 1.0
2618+
2619+
## TEST-283. HF Token Included in Request Headers When Set
2620+
- **ID:** TEST-283
2621+
- **Title:** HF Token Included in Request Headers When Set
2622+
- **Description:** When `SPECSMITH_HF_TOKEN` is set to a non-empty string, `test_hf_connection()` returns `{"token_set": true}` and the rate_limit_tier includes "authenticated". The `_fetch_page` request (captured via mock) includes `Authorization: Bearer <token>` in its headers.
2623+
- **Requirement ID:** REQ-265
2624+
- **Type:** unit
2625+
- **Verification Method:** pytest
2626+
- **Input:** SPECSMITH_HF_TOKEN="hf_test_token" in environment
2627+
- **Expected Behavior:** token_set==true; rate_limit_tier contains "authenticated"
2628+
- **Confidence:** 1.0
2629+
26082630
## TEST-263. HF Leaderboard Static Fallback Loads Without Network
26092631
- **ID:** TEST-263
26102632
- **Title:** HF Leaderboard Static Fallback Loads Without Network

tests/test_ai_intelligence.py

Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -482,3 +482,123 @@ def test_suggestions_are_inert(self) -> None:
482482
assert not providers_path.exists() or True # safe: we didn't create it
483483
else:
484484
assert providers_path.read_text() == content_before
485+
486+
487+
# ---------------------------------------------------------------------------
488+
# TEST-282 — HF sync persists bucket scores to JSON (REQ-263)
489+
# ---------------------------------------------------------------------------
490+
491+
492+
class TestHFSyncPersistsBucketScores:
493+
def test_scores_file_created_with_bucket_scores_key(self, tmp_path: Path) -> None:
494+
"""TEST-282: sync creates file with bucket_scores dict containing required keys."""
495+
import json
496+
497+
from specsmith.agent.hf_leaderboard import sync_from_huggingface_blocking
498+
499+
scores_path = tmp_path / "scores.json"
500+
sync_from_huggingface_blocking(scores_path=scores_path, force_static=True)
501+
502+
assert scores_path.is_file(), "scores.json must be created"
503+
data = json.loads(scores_path.read_text(encoding="utf-8"))
504+
assert "bucket_scores" in data, "file must have 'bucket_scores' dict"
505+
506+
def test_each_entry_has_required_keys(self, tmp_path: Path) -> None:
507+
"""TEST-282: each bucket_scores entry has reasoning_score, conversational_score, etc."""
508+
import json
509+
510+
from specsmith.agent.hf_leaderboard import sync_from_huggingface_blocking
511+
512+
scores_path = tmp_path / "scores.json"
513+
sync_from_huggingface_blocking(scores_path=scores_path, force_static=True)
514+
515+
data = json.loads(scores_path.read_text(encoding="utf-8"))
516+
bucket_scores = data["bucket_scores"]
517+
assert len(bucket_scores) >= 1
518+
519+
required_keys = {"reasoning_score", "conversational_score", "longform_score", "model_name"}
520+
for model_name, entry in bucket_scores.items():
521+
missing = required_keys - set(entry.keys())
522+
assert not missing, f"Entry '{model_name}' missing keys: {missing}"
523+
524+
525+
# ---------------------------------------------------------------------------
526+
# TEST-283 — HF token included in request headers when set (REQ-265)
527+
# ---------------------------------------------------------------------------
528+
529+
530+
class TestHFTokenInHeaders:
531+
def test_token_set_returns_true_when_configured(self) -> None:
532+
"""TEST-283: token_set==True and rate_limit_tier contains 'authenticated'."""
533+
import urllib.error
534+
535+
from specsmith.agent.hf_leaderboard import test_hf_connection
536+
537+
# Mock urlopen to avoid network: raise URLError so the probe short-circuits
538+
with (
539+
patch.dict(os.environ, {"SPECSMITH_HF_TOKEN": "hf_test_token"}, clear=False),
540+
patch(
541+
"specsmith.agent.hf_leaderboard.urllib.request.urlopen",
542+
side_effect=urllib.error.URLError("offline"),
543+
),
544+
):
545+
result = test_hf_connection()
546+
547+
assert result["token_set"] is True
548+
assert "authenticated" in result["rate_limit_tier"]
549+
550+
def test_token_absent_returns_false_and_anonymous_tier(self) -> None:
551+
"""TEST-283: no token → token_set==False and tier contains 'anonymous'."""
552+
import urllib.error
553+
554+
from specsmith.agent.hf_leaderboard import test_hf_connection
555+
556+
env = {k: v for k, v in os.environ.items() if k != "SPECSMITH_HF_TOKEN"}
557+
with (
558+
patch.dict(os.environ, env, clear=True),
559+
patch(
560+
"specsmith.agent.hf_leaderboard.urllib.request.urlopen",
561+
side_effect=urllib.error.URLError("offline"),
562+
),
563+
):
564+
result = test_hf_connection()
565+
566+
assert result["token_set"] is False
567+
assert "anonymous" in result["rate_limit_tier"]
568+
569+
def test_fetch_page_sends_authorization_header(self) -> None:
570+
"""TEST-283: _sync_inner sends Authorization: Bearer <token> when token is set."""
571+
import urllib.error
572+
573+
from specsmith.agent.hf_leaderboard import sync_from_huggingface_blocking
574+
575+
captured_headers: list[dict[str, str]] = []
576+
577+
def fake_urlopen(req: object, **kwargs: object) -> object: # noqa: ANN001
578+
# Capture headers from the request and raise to abort the sync
579+
captured_headers.append(dict(getattr(req, "headers", {})))
580+
raise urllib.error.URLError("offline")
581+
582+
with (
583+
patch.dict(os.environ, {"SPECSMITH_HF_TOKEN": "hf_test_token"}, clear=False),
584+
patch(
585+
"specsmith.agent.hf_leaderboard.urllib.request.urlopen",
586+
side_effect=fake_urlopen,
587+
),
588+
):
589+
# force_static=False so _fetch_page is actually invoked
590+
result = sync_from_huggingface_blocking(force_static=False)
591+
592+
# sync falls back to static when network is unavailable
593+
assert result["errors"] == 0
594+
# Verify that at least one request included the Authorization header
595+
assert captured_headers, "urlopen must have been called at least once"
596+
auth_values = [
597+
v
598+
for hdrs in captured_headers
599+
for k, v in hdrs.items()
600+
if k.lower() == "authorization"
601+
]
602+
assert any(
603+
"Bearer hf_test_token" in v for v in auth_values
604+
), f"No Authorization header found in captured requests: {captured_headers}"

0 commit comments

Comments
 (0)