Skip to content

Commit 62481bb

Browse files
tbitcsoz-agent
andcommitted
feat: glossa-lab AI patterns — HF leaderboard, model profiles, LLM client, pacer v2, endpoint presets
Architecture (§21-27): - HuggingFace Open LLM Leaderboard sync with bucket scoring (reasoning/conversational/longform) and 50+ model static fallback - Model capability profiles with prefix-matching and ctx history trimmer - Multi-provider LLM client with fallback waterfall + O-series translation - AI Model Pacer v2: EMA utilisation, adaptive concurrency, image tokens - Endpoint presets (vLLM/LM Studio/llama.cpp/OpenRouter/Groq etc.) - Suggested profile generation from env/Ollama/BYOE - Model intelligence REST endpoints on governance HTTP server New files: - src/specsmith/agent/hf_leaderboard.py - src/specsmith/agent/model_profiles.py - src/specsmith/agent/llm_client.py - tests/test_ai_intelligence.py - tests/test_ai_client.py Updated: rate_limits.py, provider_registry.py, cli.py, governance_logic.py Governance: REQ-263..REQ-281, TEST-263..TEST-281, ARCHITECTURE.md §21-27 Tests: 705 passed Co-Authored-By: Oz <oz-agent@warp.dev>
1 parent 36acf3e commit 62481bb

13 files changed

Lines changed: 3734 additions & 11 deletions

docs/ARCHITECTURE.md

Lines changed: 124 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -469,3 +469,127 @@ Kairos Settings pages that expose specsmith features via the governance REST API
469469
- Skills, Eval, Teams pages consume `GET /api/skills`, `GET /api/eval/suites`, `GET /api/teams`
470470

471471
All pages use async health polling via `GovernanceClient.get_json()` and follow the monolith SettingsWidget pattern.
472+
473+
## 21. HuggingFace Open LLM Leaderboard Integration
474+
475+
Source: `src/specsmith/agent/hf_leaderboard.py`
476+
477+
Syncs model benchmark data from the HuggingFace Datasets Server (`datasets-server.huggingface.co/rows?dataset=open-llm-leaderboard/contents`). Supports paginated fetch, exponential-backoff 429 handling with `RateLimit: t=` header parsing, optional HF API token (doubles rate limit to 1000 req/5min), and a static fallback of 50+ known models for offline operation.
478+
479+
Background task runs 15 s after startup then every 24 h. Scores are persisted to `~/.specsmith/model_scores.json` under a `bucket_scores` key alongside existing role scores.
480+
481+
Benchmarks mapped: IFEval, BBH, MATH Lvl 5, GPQA, MUSR, MMLU-PRO (HF field names → internal keys).
482+
483+
REST endpoints exposed by governance server:
484+
- `GET /api/model-intel/scores` — all cached scores
485+
- `GET /api/model-intel/scores/{name}` — one model
486+
- `GET /api/model-intel/recommendations?bucket=reasoning` — top-10
487+
- `POST /api/model-intel/sync` — force re-sync
488+
- `POST /api/model-intel/test-hf` — connectivity + token probe
489+
490+
CLI: `specsmith model-intel sync | scores | recommendations | test-hf`
491+
492+
## 22. Bucket Scoring Engine
493+
494+
Source: `src/specsmith/agent/hf_leaderboard.py` (`_compute_bucket_scores`)
495+
496+
Three task-bucket scores computed from raw benchmark values (normalised 0–100):
497+
498+
- **Reasoning** = 0.35×MATH + 0.30×GPQA + 0.25×BBH + 0.10×IFEval
499+
- **Conversational** = 0.40×IFEval + 0.35×MMLU-PRO + 0.25×BBH
500+
- **Longform** = 0.35×MUSR + 0.35×IFEval + 0.30×MMLU-PRO
501+
502+
Ranked recommendation returns the top-10 models for a requested bucket. The engine merges HF-synced data with the existing `BASELINE_SCORES` so both cloud and local Ollama models appear in rankings.
503+
504+
Base+org-prefix deduplication: `Qwen/Qwen3-14B` is stored under both its full name and `Qwen3-14B` so vLLM-style repo-ID model names match correctly.
505+
506+
## 23. Model Capability Profiles
507+
508+
Source: `src/specsmith/agent/model_profiles.py`
509+
510+
Per-model capability descriptors resolved by prefix matching (longest key wins):
511+
512+
| Field | Type | Meaning |
513+
|---|---|---|
514+
| `max_tokens` | int | Max completion tokens to request |
515+
| `temperature` | float | Sampling temperature |
516+
| `ctx_budget` | int | Approx. chars of conversation history to keep |
517+
| `action_capable` | bool | Reliably produces structured actions/JSON |
518+
| `prompt_style` | str | `plain` \| `sections` \| `xml` |
519+
520+
Covers 40+ models across Ollama (Mistral, Qwen, Llama, Gemma, Phi, DeepSeek), cloud (OpenAI o-series, Claude, Mistral API), and a `_DEFAULT` fallback.
521+
522+
Context history trimmer (`trim_history`) summarises dropped turns into a compact `[Earlier conversation summary — N turns condensed]` assistant message to preserve research continuity.
523+
524+
## 24. AI Model Pacer v2
525+
526+
Source: `src/specsmith/rate_limits.py` (upgraded `ModelRateLimitScheduler`)
527+
528+
Enhancements over the existing rolling-window scheduler:
529+
530+
- **EMA utilisation tracking** — exponentially-weighted moving average of RPM/TPM utilisation (`alpha=0.25`) surfaced in `snapshot()`
531+
- **Adaptive concurrency**`dynamic_concurrency` decreases on `on_rate_limit()`, restores after 120 s (incrementally, 60 s between steps)
532+
- **Retry-After parsing**`parse_retry_after_seconds()` extracts `"try again in Xs"` from provider error strings; used when exponential backoff alone is insufficient
533+
- **Image token estimation**`estimate_request_tokens()` accepts `image_count` and multiplies by a per-model `image_token_estimate` (default 4096)
534+
- **Pre-dispatch budget check**`acquire()` blocks until RPM + TPM budgets allow dispatch; `release()` wakes waiting callers
535+
536+
All operations are guarded by a single `threading.Condition` lock so the pacer is safe for concurrent agent sessions.
537+
538+
## 25. Multi-Provider LLM Client with Fallback
539+
540+
Source: `src/specsmith/agent/llm_client.py`
541+
542+
Provider-agnostic chat client that tries a configurable ordered list of providers, falling back on 401/403/429/5xx. No optional packages required — uses `urllib` only.
543+
544+
**LLMProvider ABC**: `name`, `key_name`, `default_model`, `is_configured()`, `chat()`.
545+
546+
Concrete providers: `MistralProvider`, `OpenAIProvider`, `GoogleProvider`, `OllamaProvider`, `MockProvider` (test-only).
547+
548+
**O-series translation**: OpenAI o1/o3/o4 models receive `max_completion_tokens` instead of `max_tokens` and their `system` messages are renamed to `developer`.
549+
550+
**vLLM guided-JSON**: endpoints of type `byoe` or `huggingface` receive `guided_json` + `chat_template_kwargs: {enable_thinking: false}` when a JSON schema is provided.
551+
552+
**Gemini parts extraction**: handles models that return answer text in `parts` rather than `content`.
553+
554+
**JSON extraction helper** (`_extract_json`): tries direct parse → `\`\`\`json` fence → first balanced `{}` block before raising.
555+
556+
Provider fallback decision: `_is_fallback_status(code)` returns True for 401, 403, 404, 408, 409, 425, 429, 5xx.
557+
558+
## 26. Endpoint Preset Registry
559+
560+
Source: `src/specsmith/agent/provider_registry.py` (`ENDPOINT_PRESETS`)
561+
562+
Built-in connection presets for common local and hosted inference backends:
563+
564+
| Preset | Base URL | Key needed |
565+
|---|---|---|
566+
| vLLM (local) | `http://localhost:8000/v1` | No |
567+
| LM Studio | `http://localhost:1234/v1` | No |
568+
| llama.cpp server | `http://localhost:8080/v1` | No |
569+
| OpenRouter | `https://openrouter.ai/api/v1` | Yes |
570+
| Together AI | `https://api.together.xyz/v1` | Yes |
571+
| Groq | `https://api.groq.com/openai/v1` | Yes |
572+
| Fireworks AI | `https://api.fireworks.ai/inference/v1` | Yes |
573+
| DeepInfra | `https://api.deepinfra.com/v1/openai` | Yes |
574+
| Perplexity | `https://api.perplexity.ai` | Yes |
575+
| Azure OpenAI | _(user-supplied)_ | Yes |
576+
577+
Probe function enriches model list with `context_length` (from `max_model_len` on vLLM), `owner`, and `description` fields.
578+
579+
CLI: `specsmith agent endpoint-presets`.
580+
581+
## 27. Suggested Profile Generation
582+
583+
Source: `src/specsmith/agent/provider_registry.py` (`suggest_profiles`)
584+
585+
Generates a list of ready-to-add `ProviderEntry` suggestions by inspecting:
586+
587+
1. Cloud API keys present in environment variables
588+
2. Ollama models currently installed (`/api/tags`)
589+
3. Custom BYOE endpoints in `providers.json`
590+
591+
For each backend, role-tuned parameter sets (temperature, max_tokens) are proposed following the AEE bucket taxonomy: `reasoning`, `conversational`, `longform`.
592+
593+
Suggestions are inert previews — the user calls `specsmith agent providers add` to persist.
594+
595+
CLI: `specsmith agent suggest-profiles`.

docs/REQUIREMENTS.md

Lines changed: 133 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1803,3 +1803,136 @@ ame, command, and rgs fields derived from the description. The stub MUST be val
18031803
- **Description:** The Kairos Agents > MCP servers list page MUST include a collapsible AI Builder card that accepts a natural-language server description, calls specsmith mcp generate <description> --json, displays the generated JSON stub, and offers an 'Add to ~/.specsmith/mcp.json' button that appends the stub to the user's MCP config file.
18041804
- **Source:** ARCHITECTURE.md [Kairos Settings Extensions]
18051805
- **Status:** implemented
1806+
1807+
## 263. HuggingFace Open LLM Leaderboard Sync
1808+
- **ID:** REQ-263
1809+
- **Title:** HuggingFace Open LLM Leaderboard Sync
1810+
- **Description:** specsmith MUST implement `src/specsmith/agent/hf_leaderboard.py` that fetches model benchmark data from the HuggingFace Datasets Server (`datasets-server.huggingface.co/rows?dataset=open-llm-leaderboard/contents`). The sync MUST be paginated (100 rows/page) and persist results to `~/.specsmith/model_scores.json` under a `bucket_scores` key.
1811+
- **Source:** ARCHITECTURE.md §21 [HF-001]
1812+
- **Status:** defined
1813+
1814+
## 264. HF Leaderboard Rate-Limit Handling
1815+
- **ID:** REQ-264
1816+
- **Title:** HF Leaderboard Rate-Limit Handling
1817+
- **Description:** The HF leaderboard sync MUST handle HTTP 429 with exponential-backoff retry (up to 4 attempts). It MUST parse the `RateLimit: "api";r=X;t=Y` header to extract the exact reset window and wait accordingly. A +1 s safety margin MUST be added to the `t=` value.
1818+
- **Source:** ARCHITECTURE.md §21 [HF-002]
1819+
- **Status:** defined
1820+
1821+
## 265. HF API Token Support
1822+
- **ID:** REQ-265
1823+
- **Title:** HF API Token Support
1824+
- **Description:** When `SPECSMITH_HF_TOKEN` or `hf_api_token` is configured, the HF sync MUST include an `Authorization: Bearer <token>` header. The CLI `specsmith model-intel test-hf` MUST validate the token via `huggingface.co/api/whoami-v2` and report whether the Datasets Server is reachable.
1825+
- **Source:** ARCHITECTURE.md §21 [HF-003]
1826+
- **Status:** defined
1827+
1828+
## 266. HF Leaderboard Static Fallback
1829+
- **ID:** REQ-266
1830+
- **Title:** HF Leaderboard Static Fallback
1831+
- **Description:** When HF is unreachable (network error, 5xx, or zero parseable rows), specsmith MUST load built-in static benchmark scores covering at least 40 models (OpenAI GPT-4o/mini, Claude 3.5 sonnet/haiku, Gemini 2.x, Mistral, Qwen, Llama, DeepSeek, Phi). The fallback MUST be transparent to callers.
1832+
- **Source:** ARCHITECTURE.md §21 [HF-004]
1833+
- **Status:** defined
1834+
1835+
## 267. Bucket Scoring Engine
1836+
- **ID:** REQ-267
1837+
- **Title:** Bucket Scoring Engine
1838+
- **Description:** specsmith MUST compute three task-bucket scores from raw benchmark values (0–100 scale): Reasoning = 0.35×MATH + 0.30×GPQA + 0.25×BBH + 0.10×IFEval; Conversational = 0.40×IFEval + 0.35×MMLU-PRO + 0.25×BBH; Longform = 0.35×MUSR + 0.35×IFEval + 0.30×MMLU-PRO. Scores MUST be rounded to 2 decimal places.
1839+
- **Source:** ARCHITECTURE.md §22 [BKT-001]
1840+
- **Status:** defined
1841+
1842+
## 268. Model Intelligence Recommendations
1843+
- **ID:** REQ-268
1844+
- **Title:** Model Intelligence Recommendations
1845+
- **Description:** `specsmith model-intel recommendations [--bucket reasoning|conversational|longform]` MUST return the top-10 models sorted by the requested bucket score. The governance HTTP server MUST expose `GET /api/model-intel/recommendations?bucket=<name>` returning the same data.
1846+
- **Source:** ARCHITECTURE.md §22 [BKT-002]
1847+
- **Status:** defined
1848+
1849+
## 269. Model Intelligence CLI Commands
1850+
- **ID:** REQ-269
1851+
- **Title:** Model Intelligence CLI Commands
1852+
- **Description:** specsmith MUST provide a `model-intel` CLI group with subcommands: `sync` (run HF sync), `scores [--model NAME]` (list/get cached scores), `recommendations [--bucket NAME]` (top-10 per bucket), `test-hf` (connectivity probe). All commands MUST support `--json` flag.
1853+
- **Source:** ARCHITECTURE.md §21 [HF-005]
1854+
- **Status:** defined
1855+
1856+
## 270. Model Capability Profiles
1857+
- **ID:** REQ-270
1858+
- **Title:** Model Capability Profiles
1859+
- **Description:** specsmith MUST implement `src/specsmith/agent/model_profiles.py` with a `ModelProfile` TypedDict containing `max_tokens`, `temperature`, `ctx_budget`, `action_capable`, `prompt_style` fields. A `get_profile(model)` function MUST resolve by prefix matching (longest key first) over ≥40 known models.
1860+
- **Source:** ARCHITECTURE.md §23 [PRF-001]
1861+
- **Status:** defined
1862+
1863+
## 271. Context History Trimmer
1864+
- **ID:** REQ-271
1865+
- **Title:** Context History Trimmer
1866+
- **Description:** `trim_history(messages, budget_chars)` in `model_profiles.py` MUST trim conversation history to fit within `budget_chars`. Oldest turns MUST be summarised into a compact `[Earlier conversation summary — N turns condensed]` assistant message rather than silently dropped. System messages MUST always be preserved.
1867+
- **Source:** ARCHITECTURE.md §23 [PRF-002]
1868+
- **Status:** defined
1869+
1870+
## 272. AI Model Pacer EMA Utilisation
1871+
- **ID:** REQ-272
1872+
- **Title:** AI Model Pacer EMA Utilisation
1873+
- **Description:** The `ModelRateLimitScheduler` MUST track RPM and TPM utilisation as exponentially-weighted moving averages (alpha=0.25) and expose them in `snapshot()` as `rpm_ema` and `tpm_ema` fields.
1874+
- **Source:** ARCHITECTURE.md §24 [PCR-001]
1875+
- **Status:** defined
1876+
1877+
## 273. AI Model Pacer Adaptive Concurrency
1878+
- **ID:** REQ-273
1879+
- **Title:** AI Model Pacer Adaptive Concurrency
1880+
- **Description:** `on_rate_limit(model, error, attempt)` MUST decrease `dynamic_concurrency` by 1 (minimum=1) and set `reduced_until` to now+120 s. Concurrency MUST restore incrementally (1 step per 60 s) once `reduced_until` has passed. The method MUST return a float delay for the caller to sleep.
1881+
- **Source:** ARCHITECTURE.md §24 [PCR-002]
1882+
- **Status:** defined
1883+
1884+
## 274. AI Model Pacer Image Token Estimation
1885+
- **ID:** REQ-274
1886+
- **Title:** AI Model Pacer Image Token Estimation
1887+
- **Description:** `estimate_request_tokens()` MUST accept an `image_count` parameter and include `image_count × image_token_estimate` tokens in the reservation. The default `image_token_estimate` MUST be 4096.
1888+
- **Source:** ARCHITECTURE.md §24 [PCR-003]
1889+
- **Status:** defined
1890+
1891+
## 275. Multi-Provider LLM Client with Fallback
1892+
- **ID:** REQ-275
1893+
- **Title:** Multi-Provider LLM Client with Fallback
1894+
- **Description:** specsmith MUST implement `src/specsmith/agent/llm_client.py` with a `LLMProvider` ABC and `LLMClient` that tries providers in order, falling back on HTTP 401/403/429/5xx. Concrete providers MUST cover Mistral, OpenAI, Google Gemini, and Ollama. A `MockProvider` MUST be available for tests.
1895+
- **Source:** ARCHITECTURE.md §25 [LLM-001]
1896+
- **Status:** defined
1897+
1898+
## 276. LLM Client O-Series Translation
1899+
- **ID:** REQ-276
1900+
- **Title:** LLM Client O-Series Translation
1901+
- **Description:** When the model name starts with `o1`, `o3`, or `o4`, or contains `-o1-`/`-o3-`/`-o4-`, the LLM client MUST use `max_completion_tokens` instead of `max_tokens`, force temperature to 1, and rename `system` role messages to `developer`.
1902+
- **Source:** ARCHITECTURE.md §25 [LLM-002]
1903+
- **Status:** defined
1904+
1905+
## 277. LLM Client vLLM Guided-JSON Mode
1906+
- **ID:** REQ-277
1907+
- **Title:** LLM Client vLLM Guided-JSON Mode
1908+
- **Description:** When a JSON schema is provided and the provider type is `byoe` or `huggingface`, the request MUST include `guided_json` and `chat_template_kwargs: {"enable_thinking": false}` to suppress chain-of-thought tokens and enforce structured output.
1909+
- **Source:** ARCHITECTURE.md §25 [LLM-003]
1910+
- **Status:** defined
1911+
1912+
## 278. Endpoint Preset Registry
1913+
- **ID:** REQ-278
1914+
- **Title:** Endpoint Preset Registry
1915+
- **Description:** `src/specsmith/agent/provider_registry.py` MUST export `ENDPOINT_PRESETS` — a list of built-in connection presets for at least: vLLM (localhost:8000), LM Studio (localhost:1234), llama.cpp (localhost:8080), OpenRouter, Together AI, Groq, Fireworks, DeepInfra, Perplexity, and Azure OpenAI. Each preset MUST include `id`, `label`, `base_url`, `endpoint_kind`, and `needs_key`.
1916+
- **Source:** ARCHITECTURE.md §26 [PRE-001]
1917+
- **Status:** defined
1918+
1919+
## 279. Endpoint Probe Enriched Metadata
1920+
- **ID:** REQ-279
1921+
- **Title:** Endpoint Probe Enriched Metadata
1922+
- **Description:** `probe_openai_compatible()` MUST return a `models_detail` list where each entry includes `id`, `owner`, `context_length` (from `max_model_len` on vLLM, `context_length` or `context_window` otherwise), and `description`. The cap MUST be 200 models.
1923+
- **Source:** ARCHITECTURE.md §26 [PRE-002]
1924+
- **Status:** defined
1925+
1926+
## 280. Suggested Profile Generation
1927+
- **ID:** REQ-280
1928+
- **Title:** Suggested Profile Generation
1929+
- **Description:** `specsmith agent suggest-profiles` MUST inspect available backends (cloud env vars, installed Ollama models, saved BYOE endpoints) and propose ready-to-add `ProviderEntry` suggestions with role-tuned temperature and max_tokens for the reasoning/conversational/longform AEE buckets. Suggestions MUST be inert (not auto-saved).
1930+
- **Source:** ARCHITECTURE.md §27 [SGP-001]
1931+
- **Status:** defined
1932+
1933+
## 281. Kairos AI Settings Bucket Score Display
1934+
- **ID:** REQ-281
1935+
- **Title:** Kairos AI Settings Bucket Score Display
1936+
- **Description:** The Kairos Agents > Providers settings page MUST display bucket scores (reasoning, conversational, longform) retrieved from `GET /api/model-intel/scores/{model}` for each configured provider. Scores MUST be shown as compact numeric badges. A Sync button MUST call `POST /api/model-intel/sync`.
1937+
- **Source:** ARCHITECTURE.md §20–21 [KAI-001]
1938+
- **Status:** defined

0 commit comments

Comments
 (0)