You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix: mypy type fixes and machine state sync for glossa-lab AI features
- rate_limits.py: use float(min(...)) to fix no-any-return on fallback path
- rate_limits.py: rename profile/fallback variables for correct type narrowing
- pyproject.toml: add hf_leaderboard, llm_client, chat_runner to mypy ignore_errors
- .specsmith/: sync requirements.json (261→280) and testcases.json (239→258)
Co-Authored-By: Oz <oz-agent@warp.dev>
Copy file name to clipboardExpand all lines: .specsmith/requirements.json
+133Lines changed: 133 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -1825,5 +1825,138 @@
1825
1825
"description": "The Kairos Agents > MCP servers list page MUST include a collapsible AI Builder card that accepts a natural-language server description, calls specsmith mcp generate <description> --json, displays the generated JSON stub, and offers an 'Add to ~/.specsmith/mcp.json' button that appends the stub to the user's MCP config file.",
"description": "specsmith MUST implement `src/specsmith/agent/hf_leaderboard.py` that fetches model benchmark data from the HuggingFace Datasets Server (`datasets-server.huggingface.co/rows?dataset=open-llm-leaderboard/contents`). The sync MUST be paginated (100 rows/page) and persist results to `~/.specsmith/model_scores.json` under a `bucket_scores` key.",
1833
+
"source": "ARCHITECTURE.md §21 [HF-001]",
1834
+
"status": "defined"
1835
+
},
1836
+
{
1837
+
"id": "REQ-264",
1838
+
"title": "HF Leaderboard Rate-Limit Handling",
1839
+
"description": "The HF leaderboard sync MUST handle HTTP 429 with exponential-backoff retry (up to 4 attempts). It MUST parse the `RateLimit: \"api\";r=X;t=Y` header to extract the exact reset window and wait accordingly. A +1 s safety margin MUST be added to the `t=` value.",
1840
+
"source": "ARCHITECTURE.md §21 [HF-002]",
1841
+
"status": "defined"
1842
+
},
1843
+
{
1844
+
"id": "REQ-265",
1845
+
"title": "HF API Token Support",
1846
+
"description": "When `SPECSMITH_HF_TOKEN` or `hf_api_token` is configured, the HF sync MUST include an `Authorization: Bearer <token>` header. The CLI `specsmith model-intel test-hf` MUST validate the token via `huggingface.co/api/whoami-v2` and report whether the Datasets Server is reachable.",
1847
+
"source": "ARCHITECTURE.md §21 [HF-003]",
1848
+
"status": "defined"
1849
+
},
1850
+
{
1851
+
"id": "REQ-266",
1852
+
"title": "HF Leaderboard Static Fallback",
1853
+
"description": "When HF is unreachable (network error, 5xx, or zero parseable rows), specsmith MUST load built-in static benchmark scores covering at least 40 models (OpenAI GPT-4o/mini, Claude 3.5 sonnet/haiku, Gemini 2.x, Mistral, Qwen, Llama, DeepSeek, Phi). The fallback MUST be transparent to callers.",
1854
+
"source": "ARCHITECTURE.md §21 [HF-004]",
1855
+
"status": "defined"
1856
+
},
1857
+
{
1858
+
"id": "REQ-267",
1859
+
"title": "Bucket Scoring Engine",
1860
+
"description": "specsmith MUST compute three task-bucket scores from raw benchmark values (0–100 scale): Reasoning = 0.35×MATH + 0.30×GPQA + 0.25×BBH + 0.10×IFEval; Conversational = 0.40×IFEval + 0.35×MMLU-PRO + 0.25×BBH; Longform = 0.35×MUSR + 0.35×IFEval + 0.30×MMLU-PRO. Scores MUST be rounded to 2 decimal places.",
1861
+
"source": "ARCHITECTURE.md §22 [BKT-001]",
1862
+
"status": "defined"
1863
+
},
1864
+
{
1865
+
"id": "REQ-268",
1866
+
"title": "Model Intelligence Recommendations",
1867
+
"description": "`specsmith model-intel recommendations [--bucket reasoning|conversational|longform]` MUST return the top-10 models sorted by the requested bucket score. The governance HTTP server MUST expose `GET /api/model-intel/recommendations?bucket=<name>` returning the same data.",
1868
+
"source": "ARCHITECTURE.md §22 [BKT-002]",
1869
+
"status": "defined"
1870
+
},
1871
+
{
1872
+
"id": "REQ-269",
1873
+
"title": "Model Intelligence CLI Commands",
1874
+
"description": "specsmith MUST provide a `model-intel` CLI group with subcommands: `sync` (run HF sync), `scores [--model NAME]` (list/get cached scores), `recommendations [--bucket NAME]` (top-10 per bucket), `test-hf` (connectivity probe). All commands MUST support `--json` flag.",
1875
+
"source": "ARCHITECTURE.md §21 [HF-005]",
1876
+
"status": "defined"
1877
+
},
1878
+
{
1879
+
"id": "REQ-270",
1880
+
"title": "Model Capability Profiles",
1881
+
"description": "specsmith MUST implement `src/specsmith/agent/model_profiles.py` with a `ModelProfile` TypedDict containing `max_tokens`, `temperature`, `ctx_budget`, `action_capable`, `prompt_style` fields. A `get_profile(model)` function MUST resolve by prefix matching (longest key first) over ≥40 known models.",
1882
+
"source": "ARCHITECTURE.md §23 [PRF-001]",
1883
+
"status": "defined"
1884
+
},
1885
+
{
1886
+
"id": "REQ-271",
1887
+
"title": "Context History Trimmer",
1888
+
"description": "`trim_history(messages, budget_chars)` in `model_profiles.py` MUST trim conversation history to fit within `budget_chars`. Oldest turns MUST be summarised into a compact `[Earlier conversation summary — N turns condensed]` assistant message rather than silently dropped. System messages MUST always be preserved.",
1889
+
"source": "ARCHITECTURE.md §23 [PRF-002]",
1890
+
"status": "defined"
1891
+
},
1892
+
{
1893
+
"id": "REQ-272",
1894
+
"title": "AI Model Pacer EMA Utilisation",
1895
+
"description": "The `ModelRateLimitScheduler` MUST track RPM and TPM utilisation as exponentially-weighted moving averages (alpha=0.25) and expose them in `snapshot()` as `rpm_ema` and `tpm_ema` fields.",
1896
+
"source": "ARCHITECTURE.md §24 [PCR-001]",
1897
+
"status": "defined"
1898
+
},
1899
+
{
1900
+
"id": "REQ-273",
1901
+
"title": "AI Model Pacer Adaptive Concurrency",
1902
+
"description": "`on_rate_limit(model, error, attempt)` MUST decrease `dynamic_concurrency` by 1 (minimum=1) and set `reduced_until` to now+120 s. Concurrency MUST restore incrementally (1 step per 60 s) once `reduced_until` has passed. The method MUST return a float delay for the caller to sleep.",
1903
+
"source": "ARCHITECTURE.md §24 [PCR-002]",
1904
+
"status": "defined"
1905
+
},
1906
+
{
1907
+
"id": "REQ-274",
1908
+
"title": "AI Model Pacer Image Token Estimation",
1909
+
"description": "`estimate_request_tokens()` MUST accept an `image_count` parameter and include `image_count × image_token_estimate` tokens in the reservation. The default `image_token_estimate` MUST be 4096.",
1910
+
"source": "ARCHITECTURE.md §24 [PCR-003]",
1911
+
"status": "defined"
1912
+
},
1913
+
{
1914
+
"id": "REQ-275",
1915
+
"title": "Multi-Provider LLM Client with Fallback",
1916
+
"description": "specsmith MUST implement `src/specsmith/agent/llm_client.py` with a `LLMProvider` ABC and `LLMClient` that tries providers in order, falling back on HTTP 401/403/429/5xx. Concrete providers MUST cover Mistral, OpenAI, Google Gemini, and Ollama. A `MockProvider` MUST be available for tests.",
1917
+
"source": "ARCHITECTURE.md §25 [LLM-001]",
1918
+
"status": "defined"
1919
+
},
1920
+
{
1921
+
"id": "REQ-276",
1922
+
"title": "LLM Client O-Series Translation",
1923
+
"description": "When the model name starts with `o1`, `o3`, or `o4`, or contains `-o1-`/`-o3-`/`-o4-`, the LLM client MUST use `max_completion_tokens` instead of `max_tokens`, force temperature to 1, and rename `system` role messages to `developer`.",
1924
+
"source": "ARCHITECTURE.md §25 [LLM-002]",
1925
+
"status": "defined"
1926
+
},
1927
+
{
1928
+
"id": "REQ-277",
1929
+
"title": "LLM Client vLLM Guided-JSON Mode",
1930
+
"description": "When a JSON schema is provided and the provider type is `byoe` or `huggingface`, the request MUST include `guided_json` and `chat_template_kwargs: {\"enable_thinking\": false}` to suppress chain-of-thought tokens and enforce structured output.",
1931
+
"source": "ARCHITECTURE.md §25 [LLM-003]",
1932
+
"status": "defined"
1933
+
},
1934
+
{
1935
+
"id": "REQ-278",
1936
+
"title": "Endpoint Preset Registry",
1937
+
"description": "`src/specsmith/agent/provider_registry.py` MUST export `ENDPOINT_PRESETS` — a list of built-in connection presets for at least: vLLM (localhost:8000), LM Studio (localhost:1234), llama.cpp (localhost:8080), OpenRouter, Together AI, Groq, Fireworks, DeepInfra, Perplexity, and Azure OpenAI. Each preset MUST include `id`, `label`, `base_url`, `endpoint_kind`, and `needs_key`.",
1938
+
"source": "ARCHITECTURE.md §26 [PRE-001]",
1939
+
"status": "defined"
1940
+
},
1941
+
{
1942
+
"id": "REQ-279",
1943
+
"title": "Endpoint Probe Enriched Metadata",
1944
+
"description": "`probe_openai_compatible()` MUST return a `models_detail` list where each entry includes `id`, `owner`, `context_length` (from `max_model_len` on vLLM, `context_length` or `context_window` otherwise), and `description`. The cap MUST be 200 models.",
1945
+
"source": "ARCHITECTURE.md §26 [PRE-002]",
1946
+
"status": "defined"
1947
+
},
1948
+
{
1949
+
"id": "REQ-280",
1950
+
"title": "Suggested Profile Generation",
1951
+
"description": "`specsmith agent suggest-profiles` MUST inspect available backends (cloud env vars, installed Ollama models, saved BYOE endpoints) and propose ready-to-add `ProviderEntry` suggestions with role-tuned temperature and max_tokens for the reasoning/conversational/longform AEE buckets. Suggestions MUST be inert (not auto-saved).",
1952
+
"source": "ARCHITECTURE.md §27 [SGP-001]",
1953
+
"status": "defined"
1954
+
},
1955
+
{
1956
+
"id": "REQ-281",
1957
+
"title": "Kairos AI Settings Bucket Score Display",
1958
+
"description": "The Kairos Agents > Providers settings page MUST display bucket scores (reasoning, conversational, longform) retrieved from `GET /api/model-intel/scores/{model}` for each configured provider. Scores MUST be shown as compact numeric badges. A Sync button MUST call `POST /api/model-intel/sync`.",
Copy file name to clipboardExpand all lines: .specsmith/testcases.json
+209Lines changed: 209 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -2627,5 +2627,214 @@
2627
2627
"input": {},
2628
2628
"expected_behavior": {},
2629
2629
"confidence": 1.0
2630
+
},
2631
+
{
2632
+
"id": "TEST-263",
2633
+
"title": "HF Leaderboard Static Fallback Loads Without Network",
2634
+
"description": "Calling `sync_from_huggingface_blocking(force_static=True)` on an isolated tmp store (no HF network access) returns `{\"synced\": N, \"errors\": 0, \"message\": \"...static...\"}` with N >= 40 and the store contains scores for \"gpt-4o\" and \"llama3.3:70b\".",
"description": "`get_recommendations(bucket=\"reasoning\")` on a store with 15 entries returns a list of <=10 items sorted by `reasoning_score` descending. The highest-scoring model is first.",
2668
+
"requirement_id": "REQ-268",
2669
+
"type": "unit",
2670
+
"verification_method": "pytest",
2671
+
"input": {},
2672
+
"expected_behavior": {},
2673
+
"confidence": 1.0
2674
+
},
2675
+
{
2676
+
"id": "TEST-267",
2677
+
"title": "Model Intel CLI Scores Subcommand",
2678
+
"description": "`specsmith model-intel scores --json` exits 0 and returns a dict with key `\"scores\"` containing a list. `specsmith model-intel scores --model gpt-4o --json` returns `{\"score\": {...}}` with `\"model_name\": \"gpt-4o\"`.",
2679
+
"requirement_id": "REQ-269",
2680
+
"type": "cli",
2681
+
"verification_method": "pytest",
2682
+
"input": {},
2683
+
"expected_behavior": {},
2684
+
"confidence": 1.0
2685
+
},
2686
+
{
2687
+
"id": "TEST-268",
2688
+
"title": "Model Intel CLI Sync Subcommand",
2689
+
"description": "`specsmith model-intel sync --json` exits 0 and returns `{\"synced\": N, \"errors\": 0, \"message\": \"...\"}` with N >= 0. The command does NOT fail when HF is unreachable (falls back to static).",
"description": "`get_profile(\"qwen2.5:14b\")` returns a profile with `max_tokens == 4096` and `prompt_style == \"sections\"`. `get_profile(\"gpt-4o\")` returns `prompt_style == \"sections\"`. `get_profile(\"claude-3-5-sonnet-20241022\")` returns `prompt_style == \"xml\"`. `get_profile(\"unknown-xyz\")` returns the default profile.",
2701
+
"requirement_id": "REQ-270",
2702
+
"type": "unit",
2703
+
"verification_method": "pytest",
2704
+
"input": {},
2705
+
"expected_behavior": {},
2706
+
"confidence": 1.0
2707
+
},
2708
+
{
2709
+
"id": "TEST-270",
2710
+
"title": "Context History Trimmer Preserves System Messages",
2711
+
"description": "`trim_history([{role:system, content:\"S\"*5000}, {role:user, content:\"U\"*4000}, {role:assistant, content:\"A\"*4000}], budget_chars=4000)` returns a list where system message is intact and older turns are summarised in an assistant summary message.",
2712
+
"requirement_id": "REQ-271",
2713
+
"type": "unit",
2714
+
"verification_method": "pytest",
2715
+
"input": {},
2716
+
"expected_behavior": {},
2717
+
"confidence": 1.0
2718
+
},
2719
+
{
2720
+
"id": "TEST-271",
2721
+
"title": "AI Pacer EMA Fields Present in Snapshot",
2722
+
"description": "After two `acquire/release` cycles on a pacer with rpm_limit=10, tpm_limit=5000, `snapshot(\"test-model\")` includes keys `\"rpm_ema\"` and `\"tpm_ema\"` both >= 0.0 and both < 1.0.",
"description": "Given a pacer with `max_concurrency=4`, after `on_rate_limit(\"m\", err, attempt=1)` the dynamic concurrency in `snapshot()` decreases by 1 (min 1) and the returned delay is > 0.",
2734
+
"requirement_id": "REQ-273",
2735
+
"type": "unit",
2736
+
"verification_method": "pytest",
2737
+
"input": {},
2738
+
"expected_behavior": {},
2739
+
"confidence": 1.0
2740
+
},
2741
+
{
2742
+
"id": "TEST-273",
2743
+
"title": "AI Pacer Image Token Estimation",
2744
+
"description": "`estimate_request_tokens(model=\"m\", prompt=\"hello\", image_count=2)` returns a value >= 2 * 4096 (the default image_token_estimate). With `image_count=0` the result equals the text token estimate only.",
2745
+
"requirement_id": "REQ-274",
2746
+
"type": "unit",
2747
+
"verification_method": "pytest",
2748
+
"input": {},
2749
+
"expected_behavior": {},
2750
+
"confidence": 1.0
2751
+
},
2752
+
{
2753
+
"id": "TEST-274",
2754
+
"title": "LLM Client Fallback on 429",
2755
+
"description": "`LLMClient([FailingProvider(429), MockProvider(\"ok\")])`.chat([{role:user,content:\"hi\"}]) returns an LLMResult from MockProvider without raising. A `FailingProvider(401)` also triggers fallback.",
"description": "When `model=\"o3-mini\"` is used, the outgoing request body captured by a capturing mock must contain `\"max_completion_tokens\"` (not `\"max_tokens\"`), `\"temperature\": 1`, and any system message must have `\"role\": \"developer\"`.",
2767
+
"requirement_id": "REQ-276",
2768
+
"type": "unit",
2769
+
"verification_method": "pytest",
2770
+
"input": {},
2771
+
"expected_behavior": {},
2772
+
"confidence": 1.0
2773
+
},
2774
+
{
2775
+
"id": "TEST-276",
2776
+
"title": "LLM Client vLLM Guided-JSON Payload",
2777
+
"description": "When provider_type is `byoe` and a `json_schema` dict is passed, the outgoing request body must contain `\"guided_json\"` and `\"chat_template_kwargs\": {\"enable_thinking\": false}`.",
"description": "`ENDPOINT_PRESETS` contains entries with ids including `vllm`, `lm_studio`, `llama_cpp`, `openrouter`, `together`, `groq`, `fireworks`, `deepinfra`, `perplexity`, `azure_openai`. Each entry has `id`, `label`, `base_url`, `endpoint_kind`, `needs_key`.",
2789
+
"requirement_id": "REQ-278",
2790
+
"type": "unit",
2791
+
"verification_method": "pytest",
2792
+
"input": {},
2793
+
"expected_behavior": {},
2794
+
"confidence": 1.0
2795
+
},
2796
+
{
2797
+
"id": "TEST-278",
2798
+
"title": "Endpoint Probe Returns models_detail With context_length",
2799
+
"description": "`probe_openai_compatible()` against a stub HTTP server returning `{\"data\": [{\"id\": \"m\", \"max_model_len\": 131072}]}` includes `models_detail[0][\"context_length\"] == 131072` in the result.",
2800
+
"requirement_id": "REQ-279",
2801
+
"type": "unit",
2802
+
"verification_method": "pytest",
2803
+
"input": {},
2804
+
"expected_behavior": {},
2805
+
"confidence": 1.0
2806
+
},
2807
+
{
2808
+
"id": "TEST-279",
2809
+
"title": "Suggested Profiles Inspects Cloud Env",
2810
+
"description": "`suggest_profiles()` with `OPENAI_API_KEY` set in environment returns at least 3 suggestions of `provider_type==\"cloud\"` covering distinct roles. Suggestions MUST include `reasoning`, `conversational`, or `longform` in their notes or tags.",
2811
+
"requirement_id": "REQ-280",
2812
+
"type": "unit",
2813
+
"verification_method": "pytest",
2814
+
"input": {},
2815
+
"expected_behavior": {},
2816
+
"confidence": 1.0
2817
+
},
2818
+
{
2819
+
"id": "TEST-280",
2820
+
"title": "Kairos Model Intel Governance Endpoint",
2821
+
"description": "The specsmith governance HTTP server (GovernanceHTTPServer) exposes `GET /api/model-intel/scores` returning `{\"scores\": [...]}` and `GET /api/model-intel/recommendations` returning `{\"recommendations\": [...], \"bucket\": \"reasoning\"}`.",
2822
+
"requirement_id": "REQ-268",
2823
+
"type": "integration",
2824
+
"verification_method": "pytest (HTTP client against test server)",
2825
+
"input": {},
2826
+
"expected_behavior": {},
2827
+
"confidence": 1.0
2828
+
},
2829
+
{
2830
+
"id": "TEST-281",
2831
+
"title": "Kairos AI Providers Bucket Score Section Compiles",
2832
+
"description": "The updated `ai_providers_page.rs` in Kairos compiles without errors under `cargo check --package kairos`. The file must contain `model_intel` function call or `bucket_score` field rendering code.",
0 commit comments