Agenta-AI
diff --git a/‎api/pyproject.toml‎
Lines changed: 1 addition & 1 deletion b/‎api/pyproject.toml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎api/uv.lock‎
Lines changed: 3 additions & 3 deletions b/‎api/uv.lock‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎clients/python/pyproject.toml‎
Lines changed: 1 addition & 1 deletion b/‎clients/python/pyproject.toml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎clients/python/uv.lock‎
Lines changed: 1 addition & 1 deletion b/‎clients/python/uv.lock‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/design/annotation-queue-v2/rfc-v2.md‎
Lines changed: 54 additions & 22 deletions b/‎docs/design/annotation-queue-v2/rfc-v2.md‎
Lines changed: 54 additions & 22 deletions
diff --git a/‎docs/designs/observability-cell-preview/rfc.md‎
Lines changed: 13 additions & 13 deletions b/‎docs/designs/observability-cell-preview/rfc.md‎
Lines changed: 13 additions & 13 deletions
diff --git a/‎hosting/kubernetes/helm/Chart.yaml‎
Lines changed: 2 additions & 2 deletions b/‎hosting/kubernetes/helm/Chart.yaml‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎sdks/python/pyproject.toml‎
Lines changed: 1 addition & 1 deletion b/‎sdks/python/pyproject.toml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎sdks/python/uv.lock‎
Lines changed: 2 additions & 2 deletions b/‎sdks/python/uv.lock‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎services/pyproject.toml‎
Lines changed: 1 addition & 1 deletion b/‎services/pyproject.toml‎
Lines changed: 1 addition & 1 deletion
@@ -1,6 +1,6 @@
 [project]
 name = "api"
-version = "0.101.0"
+version = "0.101.1"
 description = "Agenta API"
 requires-python = ">=3.11,<3.14"
 authors = [
 
@@ -1,6 +1,6 @@
 [project]
 name = "agenta-client"
-version = "0.101.0"
+version = "0.101.1"
 description = "Fern-generated Python client for the Agenta API."
 requires-python = ">=3.11,<3.14"
 authors = [
 
@@ -129,7 +129,7 @@ The consumer layer is a **convenience API** and a **set of UI views** that orche
    - An EvaluationQueue with optional user assignments
 4. Annotator works through rows → fills in labels → submits
 5. On submit: same annotation creation + result linking as today
-6. **Write-back step** (separate action): User clicks "Save annotations to test set" → creates a new test set revision with annotation values as new columns
+6. **Write-back step** (separate action): User clicks "Add to Testset" → each annotated row is matched to its existing test-set row (by testcase id, falling back to `testcase_dedup_id`), updated **in place** with the annotation columns, and committed as a new revision. See [Write Back / Save as Test Set](#write-back--save-as-test-set) for the identity model and matching rules.
 
 **Key design choice: annotating ≠ modifying the test set.** The annotation step creates annotation traces (OTel spans). These reference the test cases but don't modify them. Writing back to the test set is a separate, explicit action that creates a new revision. This preserves test case immutability and versioning.
 
@@ -264,26 +264,58 @@ Uses existing endpoints — no change needed:
 
 ### Write Back / Save as Test Set
 
-```
-POST /annotation-queues/{queue_id}/export
-{
-  // For testset-sourced queues: create new revision with annotation columns
-  "target": "testset_revision",
-  "column_mapping": {
-    "correctness": "is_correct",
-    "quality": "quality_score"
-  }
-
-  // For trace-sourced queues: create new test set from annotated traces
-  // "target": "new_testset",
-  // "name": "Curated Q1 traces",
-  // "include_annotations_as_columns": true
-}
-```
-
-The endpoint name is `export` rather than `write-back` to better reflect that it works for both directions: writing annotations back to an existing test set (new revision) or creating an entirely new test set from annotated traces.
-
-**Who triggers this:** The queue creator/admin, not individual annotators. It's a one-time action available on the queue detail page.
+The user clicks **"Add to Testset"** on the queue and either appends to an
+existing test set (new revision) or creates a new one. **As implemented this is a
+client-side operation**, not a backend export: the FE resolves the target's
+latest revision, computes a row delta, and commits a new revision via
+`POST /testsets/revisions/commit`. (The originally-proposed
+`POST /annotation-queues/{queue_id}/export` endpoint was not built — the FE owns
+the delta.)
+
+**Identity model.** FE-created annotation queues are **testcase-id-backed**, not
+testset-revision-backed: the queue references each row by its testcase id and the
+testcase blob's stable `testcase_dedup_id`. Test cases are immutable, so any
+update mints a new testcase id — `testcase_dedup_id` is the only key that survives
+across revisions, and only if it is preserved on every write.
+
+**Behavior (existing test set):**
+
+1. Base the commit on the test set's **latest _non-archived_ revision**.
+2. Match each annotated row to an existing row by **testcase id, falling back to
+   `testcase_dedup_id`** — the id match works on the first save; the dedup
+   fallback carries the match after a prior save reassigned the id.
+3. **Replace** on match, **add** on miss, and **preserve `testcase_dedup_id`** on
+   every replaced row so the lineage stays matchable for the next save.
+4. **Skip unchanged rows** (deep-equal vs the base row), and skip the commit
+   entirely when the resulting delta is empty — re-saving with nothing changed is
+   a no-op (no churn revision).
+
+The annotated row is updated **in place** (new testcase id, same dedup); the row
+count stays stable instead of growing.
+
+**For trace-sourced queues** there is no source test set, so the action always
+creates a new test set from the annotated rows.
+
+**Who triggers this:** the queue creator/admin, not individual annotators. It's a
+one-time action available on the queue detail page.
+
+#### Status & known constraints (AGE-3761)
+
+- **Fixed.** The first implementation committed with blind `add`, appending every
+  annotated row → duplicates. Two further traps were fixed: base rows were read
+  through `normalizeRevision`, which strips `testcase_dedup_id` (so the dedup
+  fallback silently never fired and the second save duplicated) — base rows are
+  now read **raw**; and "latest" was resolved via `retrieve {testset_ref}`, which
+  returns **archived** revisions — it's now resolved via the archived-excluding
+  `query` path.
+- **Not FE-fixable.** A test set whose rows already carry **duplicate or missing
+  `testcase_dedup_id`s** from earlier corruption cannot be cleaned by FE matching
+  (the dedup→row mapping is ambiguous). The durable fix is backend-owned: a stable,
+  unique testcase identity preserved on every write, or an upsert-by-stable-key
+  primitive so the FE never computes the delta.
+- **Deferred — multi-testset queues.** When a queue's rows originate from more than
+  one test set, routing each annotated row back to its source test set is future
+  work; the current single-target modal is kept.
 
 ---
 
@@ -442,7 +474,7 @@ The metadata-on-traces approach (tagging spans with review status) was considere
 
 3. **Queue visibility in eval runs:** When an eval run has human evaluators, is the auto-created queue visible in the eval run detail view? Or is it fully hidden?
 
-4. **Write-back granularity:** When writing annotations back to a test set, does the user choose which annotation fields become columns? Or do all fields from all evaluators get written back?
+4. **Write-back granularity:** ~~When writing annotations back to a test set, does the user choose which annotation fields become columns?~~ **Resolved (as shipped):** all annotation outputs from the queue-scoped evaluators are written back as columns (keyed per evaluator, e.g. `quality-rating`); there is no per-field picker. The export is scoped to the active queue's annotations so other queues' annotations on the same testcase don't bleed in.
 
 5. **Queue lifecycle:** Do annotation queues have a lifecycle (draft → active → completed)? Or are they always active and implicitly complete when all items are annotated?
 
 
@@ -16,7 +16,7 @@ workflow state. The user has to open the drawer to see anything useful.
 This RFC adds a second heuristic layer between chat detection and the raw JSON
 fallback. For span values whose shape we recognize, an extractor pulls a small
 subset of fields and the cell renders only that subset using the existing
-beautified key/value view. Everything else still falls through to raw JSON.
+pretty key/value view. Everything else still falls through to raw JSON.
 
 The two existing detector calls (one for chat, one implicit for JSON) become
 internal rules of a single dispatcher. The cell stops chaining nullable checks
@@ -44,9 +44,9 @@ both the data to render and the renderer to use.
 
 ```ts
 type Preview =
-  | { renderer: "chat";       data: unknown[];                  source: string }
-  | { renderer: "beautified"; data: Record<string, unknown>;    source: string }
-  | { renderer: "json";       data: unknown;                    source: string }
+  | { renderer: "chat";   data: unknown[];               source: string }
+  | { renderer: "pretty"; data: Record<string, unknown>; source: string }
+  | { renderer: "json";   data: unknown;                 source: string }
 
 export function extractPreview(
   value: unknown,
@@ -60,9 +60,9 @@ The cell becomes a single switch.
 function SmartCellContent({value}: {value: unknown}) {
   const preview = extractPreview(value)
   switch (preview.renderer) {
-    case "chat":       return <ChatCell value={preview.data} />
-    case "beautified": return <BeautifiedJsonCell value={preview.data} />
-    case "json":       return <JsonCell value={preview.data} />
+    case "chat":   return <ChatCell value={preview.data} />
+    case "pretty": return <PrettyJsonCell value={preview.data} />
+    case "json":   return <JsonCell value={preview.data} />
   }
 }
 ```
@@ -84,7 +84,7 @@ type Rule =
       extract: (v: unknown, ctx: { side?: "input" | "output" }) => unknown[] | null
     }
   | {
-      kind: "beautified"
+      kind: "pretty"
       name: string
       extract: (v: unknown, ctx: { side?: "input" | "output" }) => Record<string, unknown> | null
     }
@@ -158,10 +158,10 @@ behavior automatically because they go through `SmartCellContent`.
 `extractChatMessages` stays as an internal helper that the chat rule wraps.
 External callers that import it directly keep working unchanged.
 
-`JsonCellContent.beautified` already exists. The dispatcher uses it as the
-"beautified" renderer. The `SmartCellContent.beautifyJson` prop remains as a
+`JsonCellContent.pretty` already exists. The dispatcher uses it as the
+"pretty" renderer. The `SmartCellContent.prettyJson` prop remains as a
 caller opt-in for the JSON fallback path (used by `ScenarioListView` to force
-beautified rendering on the raw-JSON branch).
+pretty rendering on the raw-JSON branch).
 
 `LastInputMessageCell` uses the dispatcher and renders only the last message
 when the chat rule matches. For non-chat values it delegates to
@@ -173,8 +173,8 @@ when the chat rule matches. For non-chat values it delegates to
   hookable comes later.
 - Rules that need span type or other span context. Shape-only for the first
   pass.
-- Aligning the cell's beautified styling with the drawer's
-  `BeautifiedJsonView`. The cell version is the lightweight one we already
+- Aligning the cell's pretty styling with the drawer's
+  `PrettyJsonView`. The cell version is the lightweight one we already
   have. Visual parity with the drawer is a separate piece of work.
 - Sharing rules with the drawer. The drawer has its own structure today. If
   the rule set proves valuable, we lift it out and reuse it.
 
@@ -2,8 +2,8 @@ apiVersion: v2
 name: agenta
 description: A Helm chart for deploying Agenta (OSS or EE) on Kubernetes
 type: application
-version: 0.101.0
-appVersion: "v0.101.0"
+version: 0.101.1
+appVersion: "v0.101.1"
 keywords:
   - agenta
   - llm
 
@@ -1,6 +1,6 @@
 [project]
 name = "agenta"
-version = "0.101.0"
+version = "0.101.1"
 description = "The SDK for agenta is an open-source LLMOps platform."
 readme = "README.md"
 requires-python = ">=3.11,<3.14"
 
@@ -1,6 +1,6 @@
 [project]
 name = "services"
-version = "0.101.0"
+version = "0.101.1"
 description = "Agenta Services (Chat & Completion)"
 requires-python = ">=3.11,<3.14"
 authors = [