Skip to content

Commit a8fb83b

Browse files
committed
docs update
1 parent 3cee669 commit a8fb83b

1 file changed

Lines changed: 92 additions & 89 deletions

File tree

docs/changelog.mdx

Lines changed: 92 additions & 89 deletions
Original file line numberDiff line numberDiff line change
@@ -3,144 +3,147 @@ title: "Changelog"
33
description: "Product updates and release notes for HUD SDK and Platform."
44
---
55

6-
<Update label="March 16, 2026" description="v0.5.29 – v0.5.33">
6+
<Update label="May 6, 2026">
7+
## Models, Tasksets, Templates & Sharing
8+
9+
### Platform
10+
11+
- **Models directory refresh**`/models` is a single unified list with Private and Trainable filters and a live usage column on every row.
12+
- **Taskset analytics tab** — dedicated analytics view on tasksets with charts and richer summaries.
13+
- **Multi-environment taskset selection** — pick multiple environments at once when configuring a taskset run.
14+
- **Run from suggested tasksets** — kick off an evaluation from a model's suggested-taskset row with the model already locked in.
15+
- **Templates and workflow orchestration** — templates settings page and a right-click workflow entry point for repeatable runs.
16+
- **Resource sharing** — invite users or whole teams to traces, jobs, evalsets, models, registry items, and collections with a unified accept flow.
17+
- **Trace grader info** — evaluation cards on traces show the grader that produced each result.
18+
</Update>
19+
20+
<Update label="March 16, 2026">
721
## A2A Chat, Citations, GPT-5 & CLI Sync
822

9-
- **A2A chat orchestrator** — agent-to-agent communication for multi-agent workflows with input handling and follow-up turns
10-
- **`hud sync tasks`** — new CLI command to sync task definitions from Python files or directories to the platform
11-
- **`hud sync env`** — new CLI command replacing `hud link`, syncing local environment configs with collision detection
12-
- **`hud eval` accepts Python files** — run evaluations directly from `.py` files and directories containing `Task` objects
13-
- **Chat class** — new `Chat` abstraction in the SDK for managing multi-turn agent conversations
14-
- **GPT-5 support**`ResponseAgent` defaults to `gpt-5`, with ToolSearch tool support
15-
- **Citations** — citation support for Claude, Gemini, and OpenAI responses in chat and agent traces
16-
- **JPEG compression for screenshots** — reduces token usage for Anthropic computer use with configurable quality
17-
- **Interactive deploy collision handling**`hud deploy` now prompts when environment names collide instead of silently overwriting
18-
- **Configurable bash timeout** — computer tool bash sessions support custom timeout values (previously hardcoded)
23+
- **A2A chat orchestrator** — agent-to-agent communication for multi-agent workflows with input handling and follow-up turns.
24+
- **`hud sync tasks`** — sync task definitions from Python files or directories to the platform.
25+
- **`hud sync env`** — sync local environment configs with collision detection (replaces `hud link`).
26+
- **`hud eval` accepts Python files** — run evaluations directly from `.py` files and directories containing `Task` objects.
27+
- **Chat class** — manage multi-turn agent conversations from a single SDK abstraction.
28+
- **GPT-5 support**`ResponseAgent` defaults to `gpt-5`, with ToolSearch tool support.
29+
- **Citations** — citation support for Claude, Gemini, and OpenAI responses in chat and agent traces.
1930

2031
### Platform
2132

22-
- **Click & scroll coordinate overlays** — computer use traces render click coordinates and scroll actions directly on screenshots
23-
- **Trace-level QA workflows** — run QA workflows across all tasks from the trace table, with screenshot input and per-task status tracking
24-
- **Evalset environment filtering** — filter results by environment version, with earliest-version-only toggle
25-
- **EvaluationResult info viewer** — inspect the full `info` field of evaluation results directly in the UI
26-
- **Individual user spend** — usage page now shows per-user spend alongside team totals
27-
- **Inline job renaming** — rename jobs directly from the jobs page
28-
- **Resizable task name column** — longer task slugs visible with a resizable column and higher character limit
29-
- **Vendor portal** — new vendor-facing site for RFP intake and bid management
30-
- **Modal integration** — run environments on Modal compute infrastructure
31-
- **Resources section** — new `/resources` page with published articles
33+
- **Click & scroll coordinate overlays** — computer use traces render click coordinates and scroll actions directly on screenshots.
34+
- **Trace-level QA workflows** — run QA workflows across all tasks from the trace table, with screenshot input and per-task status.
35+
- **Evalset environment filtering** — filter results by environment version, with an earliest-version-only toggle.
36+
- **EvaluationResult info viewer** — inspect the full `info` field of evaluation results directly in the UI.
37+
- **Individual user spend** — usage page shows per-user spend alongside team totals.
38+
- **Inline job renaming** — rename jobs directly from the jobs page.
39+
- **Modal integration** — run environments on Modal compute infrastructure.
40+
- **Resources section** — new `/resources` page with published articles.
3241
</Update>
3342

34-
<Update label="February 16, 2026" description="v0.5.18 – v0.5.28">
43+
<Update label="February 16, 2026">
3544
## Opus 4.6 Computer Use, Streaming & Deploy Improvements
3645

37-
- **Opus 4.6 computer tool** — native support for Claude Opus 4.6 computer use with zoom and screenshot gating
38-
- **Fine-grained tool streaming** — opt-in streaming for individual tool results during agent execution
39-
- **`hud deploy` build args & secrets** — pass build arguments and secrets to environment container builds
40-
- **`allowed_tools` in `@env.scenario`** — scope tool access per evaluation scenario via the decorator
41-
- **Retry logic for MCP errors** — automatic retry with backoff for 5xx errors from `mcp.hud.ai`
42-
- **Checkpoint configs** — configure checkpoint behavior for long-running evaluations
43-
- **Subagent instrumentation** — telemetry now captures subagent spans for nested agent workflows
46+
- **Opus 4.6 computer tool** — native support for Claude Opus 4.6 computer use with zoom and screenshot gating.
47+
- **Fine-grained tool streaming** — opt-in streaming for individual tool results during agent execution.
48+
- **`hud deploy` build args & secrets** — pass build arguments and secrets to environment container builds.
49+
- **`allowed_tools` in `@env.scenario`** — scope tool access per evaluation scenario via the decorator.
50+
- **Checkpoint configs** — configure checkpoint behavior for long-running evaluations.
4451

4552
### Platform
4653

47-
- **Billing refactor** — auto top-up, redesigned billing page, and per-key pricing for HUD-managed API keys
48-
- **Trace viewer enhancements** — strip review mode, inline run switching, file attachment display
49-
- **System prompt in trace viewer** — system prompt visible (collapsed by default) in the trace sidebar
50-
- **Trace comments** — add and edit comments on individual traces, visible as a dedicated column in taskset view
51-
- **Training jobs dashboard** — dedicated section for RL training jobs with detail pages
52-
- **Native binarization toggle** — pass/fail binarization for taskset evaluations, built into the platform
53-
- **Column ordering** — reorder columns in the taskset table view
54-
- **Model & environment sorting** — sort taskset results by model, environment, and environment version
54+
- **Billing refactor** — auto top-up, redesigned billing page, and per-key pricing for HUD-managed API keys.
55+
- **Trace viewer enhancements** — strip review mode, inline run switching, and file attachment display.
56+
- **Trace comments** — add and edit comments on individual traces, with a dedicated column in taskset view.
57+
- **Training jobs dashboard** — dedicated section for RL training jobs with detail pages.
58+
- **Native binarization toggle** — pass/fail binarization for taskset evaluations, built into the platform.
59+
- **Column ordering** — reorder columns in the taskset table view.
60+
- **Model & environment sorting** — sort taskset results by model, environment, and environment version.
5561
</Update>
5662

57-
<Update label="January 12, 2026" description="v0.5.5 – v0.5.17">
63+
<Update label="January 12, 2026">
5864
## CLI Refinements & Leaderboard Redesign
5965

60-
- **Build args for `hud deploy`** — pass custom build arguments to environment container builds
61-
- **Subagent telemetry** — telemetry instrumentation for subagent spans within nested workflows
62-
- **Server output validation** — runtime validation of MCP server responses
63-
- **Wildcard tools** — environments can expose `*` to allow all tools without explicit registration
64-
- **CLI mode distinction**`hud build` and `hud analyze` distinguish between HTTP and stdio modes
66+
- **Build args for `hud deploy`** — pass custom build arguments to environment container builds.
67+
- **Wildcard tools** — environments can expose `*` to allow all tools without explicit registration.
68+
- **CLI mode distinction**`hud build` and `hud analyze` distinguish between HTTP and stdio modes.
6569

6670
### Platform
6771

68-
- **Leaderboard redesign** — redesigned leaderboards with publishing flow, public visibility, and embedding support
69-
- **Slack bot** — Slack integration for job notifications and external integration provider support
70-
- **Trace compact view** — compact trace view with column reorder, inline comments, and truncated task names
71-
- **BYOK API keys** — bring-your-own-key support with `use_hud_key` option for user-managed API keys
72-
- **Per-key pricing** — individual pricing tiers for HUD-managed API keys
73-
- **Jobs page improvements** — compact job list view, stats section updates
72+
- **Leaderboard redesign** — redesigned leaderboards with publishing flow, public visibility, and embedding support.
73+
- **Slack bot** — Slack integration for job notifications and external integration providers.
74+
- **Trace compact view** — compact trace view with column reorder, inline comments, and truncated task names.
75+
- **BYOK API keys** — bring-your-own-key support with a `use_hud_key` option for user-managed API keys.
76+
- **Per-key pricing** — individual pricing tiers for HUD-managed API keys.
77+
- **Jobs page improvements** — compact job list view and refreshed stats.
7478
</Update>
7579

76-
<Update label="December 17, 2025" description="v0.5.0 – v0.5.4">
80+
<Update label="December 17, 2025">
7781
## v0.5.0: MCP-First Architecture
7882

79-
- **Environments decoupled** — environment definitions moved to separate repos, enabling independent versioning and community contributions
80-
- **Unified scenario/tool/prompt/resource handling** — single abstraction layer for MCP servers and client-side tools, with caching and hot-reload
81-
- **New telemetry**OpenTelemetry-based instrumentation with trace IDs, subagent spans, and structured logging
82-
- **Scenario decorator**`@env.scenario` for defining evaluation scenarios with typed configuration
83-
- **RL training** — initial support for reinforcement learning training via the CLI
83+
- **Environments decoupled** — environment definitions moved to separate repos, enabling independent versioning and community contributions.
84+
- **Unified scenario/tool/prompt/resource handling** — single abstraction layer for MCP servers and client-side tools, with caching and hot-reload.
85+
- **Telemetry** — trace IDs, subagent spans, and structured logging for agent runs.
86+
- **Scenario decorator**`@env.scenario` for defining evaluation scenarios with typed configuration.
87+
- **RL training** — initial support for reinforcement learning training via the CLI.
8488

8589
### Platform
8690

87-
- **Inference API usage tracking** — track inference API usage on the usage page
88-
- **HUD-managed API keys** — platform-side API key management with `set api_key` support
91+
- **Inference API usage tracking** — track inference API usage on the usage page.
92+
- **HUD-managed API keys** — platform-side API key management with `set api_key` support.
8993
</Update>
9094

91-
<Update label="October 1, 2025" description="v0.4.49 – v0.4.74">
95+
<Update label="October 1, 2025">
9296
## Bedrock, Gemini & Expanded Model Support
9397

94-
- **AWS Bedrock**`hud-python[bedrock]` extra for running Claude agents via AWS Bedrock
95-
- **Gemini CUA** — Gemini computer use agent support with checkpoint management
96-
- **Qwen computer tool** — QwenComputerTool for Qwen-series models
97-
- **MCP server support** — use HUD environments as MCP servers, integrating with any MCP-compatible client
98-
- **Telemetry tracing** — structured telemetry for agent runs with trace export
98+
- **AWS Bedrock**`hud-python[bedrock]` extra for running Claude agents via AWS Bedrock.
99+
- **Gemini CUA** — Gemini computer use agent support with checkpoint management.
100+
- **Qwen computer tool** — QwenComputerTool for Qwen-series models.
101+
- **MCP server support** — use HUD environments as MCP servers, integrating with any MCP-compatible client.
102+
- **Telemetry tracing** — structured telemetry for agent runs with trace export.
99103

100104
### Platform
101105

102-
- **Text trace viewer** — view text-only agent traces with dedicated viewer
103-
- **Leaderboard embeds** — embed leaderboards in external pages
104-
- **Versioned models** — unified evalsets and leaderboards with versioned model support
105-
- **Usage tracking & billing**Stripe integration, subscription management, and usage analytics
106+
- **Text trace viewer** — view text-only agent traces with a dedicated viewer.
107+
- **Leaderboard embeds** — embed leaderboards in external pages.
108+
- **Versioned models** — unified evalsets and leaderboards with versioned model support.
109+
- **Usage tracking & billing**usage analytics and subscription management.
106110
</Update>
107111

108-
<Update label="August 23, 2025" description="v0.3.0 – v0.4.48">
112+
<Update label="August 23, 2025">
109113
## CLI & Claude Agent
110114

111-
- **`hud` CLI** — full CLI for the development lifecycle: `init`, `dev`, `build`, `deploy`, `eval`, `analyze`, `debug`
112-
- **Claude agent with prompt caching** — built-in Claude agent with Anthropic prompt caching for reduced latency and cost
113-
- **Pre-filtered tools** — agents receive only the tools relevant to their current scenario
114-
- **User-provided system prompts** — custom system prompts for tasksets and individual tasks
115+
- **`hud` CLI** — full CLI for the development lifecycle: `init`, `dev`, `build`, `deploy`, `eval`, `analyze`, `debug`.
116+
- **Claude agent with prompt caching** — built-in Claude agent with reduced latency and cost.
117+
- **Pre-filtered tools** — agents receive only the tools relevant to their current scenario.
118+
- **User-provided system prompts** — custom system prompts for tasksets and individual tasks.
115119

116120
### Platform
117121

118-
- **Trace viewer** — full trace exploration UI with step-by-step replay of agent actions and screenshots
119-
- **Leaderboards & scorecards** — evalset leaderboards with scorecard breakdowns
120-
- **Jobs & runs display** — view agent runs with step-by-step screenshots and action metadata
121-
- **Public trace sharing** — publish and share individual traces publicly
122+
- **Trace viewer** — full trace exploration UI with step-by-step replay of agent actions and screenshots.
123+
- **Leaderboards & scorecards** — evalset leaderboards with scorecard breakdowns.
124+
- **Jobs & runs display** — view agent runs with step-by-step screenshots and action metadata.
125+
- **Public trace sharing** — publish and share individual traces publicly.
122126
</Update>
123127

124-
<Update label="April 18, 2025" description="v0.1.5 – v0.2.0">
128+
<Update label="April 18, 2025">
125129
## Environment Controllers & Docker Support
126130

127-
- **Client-side environment management** — local Docker-based environment execution with copy-to/from support
128-
- **Claude adapter** — built-in adapter for Anthropic Claude computer use and Operator
129-
- **Gymnasium wrapper**`gym.make()` compatibility for RL-style agent training loops
130-
- **Evaluator framework** — pluggable evaluators with structured logging and result export
131+
- **Client-side environment management** — local Docker-based environment execution with copy-to/from support.
132+
- **Claude adapter** — built-in adapter for Anthropic Claude computer use and Operator.
133+
- **Gymnasium wrapper**`gym.make()` compatibility for RL-style agent training loops.
134+
- **Evaluator framework** — pluggable evaluators with structured logging and result export.
131135

132136
### Platform
133137

134-
- **Platform launch** — dashboard at hud.ai with authentication and evalset browsing
135-
- **API keys management** — create and manage API keys from the dashboard
136-
- **Profile & team pages** — user profiles with team membership and settings
138+
- **Platform launch** — dashboard at hud.ai with authentication and evalset browsing.
139+
- **API keys management** — create and manage API keys from the dashboard.
140+
- **Profile & team pages** — user profiles with team membership and settings.
137141
</Update>
138142

139-
<Update label="March 3, 2025" description="v0.1.0">
143+
<Update label="March 3, 2025">
140144
## Initial Release
141145

142-
- **Open-source SDK**`pip install hud-python` for AI agent evaluation and RL environments
143-
- **Core primitives** — environments, tasks, evaluators, and runs as first-class objects
144-
- **Computer use actions** — keyboard, mouse, scroll, keyup/keydown, and hold-key actions for desktop environments
145-
- **Mintlify docs** — documentation site at docs.hud.ai
146+
- **Open-source SDK**`pip install hud-python` for AI agent evaluation and RL environments.
147+
- **Core primitives** — environments, tasks, evaluators, and runs as first-class objects.
148+
- **Computer use actions** — keyboard, mouse, scroll, keyup/keydown, and hold-key actions for desktop environments.
146149
</Update>

0 commit comments

Comments
 (0)