You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/design/annotation-queue-v2/rfc-v2.md
+54-22Lines changed: 54 additions & 22 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -129,7 +129,7 @@ The consumer layer is a **convenience API** and a **set of UI views** that orche
129
129
- An EvaluationQueue with optional user assignments
130
130
4. Annotator works through rows → fills in labels → submits
131
131
5. On submit: same annotation creation + result linking as today
132
-
6.**Write-back step** (separate action): User clicks "Save annotations to test set" → creates a new test set revision with annotation values as new columns
132
+
6.**Write-back step** (separate action): User clicks "Add to Testset" → each annotated row is matched to its existing test-set row (by testcase id, falling back to `testcase_dedup_id`), updated **in place**with the annotation columns, and committed as a new revision. See [Write Back / Save as Test Set](#write-back--save-as-test-set) for the identity model and matching rules.
133
133
134
134
**Key design choice: annotating ≠ modifying the test set.** The annotation step creates annotation traces (OTel spans). These reference the test cases but don't modify them. Writing back to the test set is a separate, explicit action that creates a new revision. This preserves test case immutability and versioning.
// For testset-sourced queues: create new revision with annotation columns
271
-
"target": "testset_revision",
272
-
"column_mapping": {
273
-
"correctness": "is_correct",
274
-
"quality": "quality_score"
275
-
}
276
-
277
-
// For trace-sourced queues: create new test set from annotated traces
278
-
// "target": "new_testset",
279
-
// "name": "Curated Q1 traces",
280
-
// "include_annotations_as_columns": true
281
-
}
282
-
```
283
-
284
-
The endpoint name is `export` rather than `write-back` to better reflect that it works for both directions: writing annotations back to an existing test set (new revision) or creating an entirely new test set from annotated traces.
285
-
286
-
**Who triggers this:** The queue creator/admin, not individual annotators. It's a one-time action available on the queue detail page.
267
+
The user clicks **"Add to Testset"** on the queue and either appends to an
268
+
existing test set (new revision) or creates a new one. **As implemented this is a
269
+
client-side operation**, not a backend export: the FE resolves the target's
270
+
latest revision, computes a row delta, and commits a new revision via
271
+
`POST /testsets/revisions/commit`. (The originally-proposed
272
+
`POST /annotation-queues/{queue_id}/export` endpoint was not built — the FE owns
273
+
the delta.)
274
+
275
+
**Identity model.** FE-created annotation queues are **testcase-id-backed**, not
276
+
testset-revision-backed: the queue references each row by its testcase id and the
277
+
testcase blob's stable `testcase_dedup_id`. Test cases are immutable, so any
278
+
update mints a new testcase id — `testcase_dedup_id` is the only key that survives
279
+
across revisions, and only if it is preserved on every write.
280
+
281
+
**Behavior (existing test set):**
282
+
283
+
1. Base the commit on the test set's **latest _non-archived_ revision**.
284
+
2. Match each annotated row to an existing row by **testcase id, falling back to
285
+
`testcase_dedup_id`** — the id match works on the first save; the dedup
286
+
fallback carries the match after a prior save reassigned the id.
287
+
3.**Replace** on match, **add** on miss, and **preserve `testcase_dedup_id`** on
288
+
every replaced row so the lineage stays matchable for the next save.
289
+
4.**Skip unchanged rows** (deep-equal vs the base row), and skip the commit
290
+
entirely when the resulting delta is empty — re-saving with nothing changed is
291
+
a no-op (no churn revision).
292
+
293
+
The annotated row is updated **in place** (new testcase id, same dedup); the row
294
+
count stays stable instead of growing.
295
+
296
+
**For trace-sourced queues** there is no source test set, so the action always
297
+
creates a new test set from the annotated rows.
298
+
299
+
**Who triggers this:** the queue creator/admin, not individual annotators. It's a
300
+
one-time action available on the queue detail page.
301
+
302
+
#### Status & known constraints (AGE-3761)
303
+
304
+
-**Fixed.** The first implementation committed with blind `add`, appending every
305
+
annotated row → duplicates. Two further traps were fixed: base rows were read
306
+
through `normalizeRevision`, which strips `testcase_dedup_id` (so the dedup
307
+
fallback silently never fired and the second save duplicated) — base rows are
308
+
now read **raw**; and "latest" was resolved via `retrieve {testset_ref}`, which
309
+
returns **archived** revisions — it's now resolved via the archived-excluding
310
+
`query` path.
311
+
-**Not FE-fixable.** A test set whose rows already carry **duplicate or missing
312
+
`testcase_dedup_id`s** from earlier corruption cannot be cleaned by FE matching
313
+
(the dedup→row mapping is ambiguous). The durable fix is backend-owned: a stable,
314
+
unique testcase identity preserved on every write, or an upsert-by-stable-key
315
+
primitive so the FE never computes the delta.
316
+
-**Deferred — multi-testset queues.** When a queue's rows originate from more than
317
+
one test set, routing each annotated row back to its source test set is future
318
+
work; the current single-target modal is kept.
287
319
288
320
---
289
321
@@ -442,7 +474,7 @@ The metadata-on-traces approach (tagging spans with review status) was considere
442
474
443
475
3.**Queue visibility in eval runs:** When an eval run has human evaluators, is the auto-created queue visible in the eval run detail view? Or is it fully hidden?
444
476
445
-
4.**Write-back granularity:** When writing annotations back to a test set, does the user choose which annotation fields become columns? Or do all fields from all evaluators get written back?
477
+
4.**Write-back granularity:**~~When writing annotations back to a test set, does the user choose which annotation fields become columns?~~**Resolved (as shipped):**all annotation outputs from the queue-scoped evaluators are written back as columns (keyed per evaluator, e.g. `quality-rating`); there is no per-field picker. The export is scoped to the active queue's annotations so other queues' annotations on the same testcase don't bleed in.
446
478
447
479
5.**Queue lifecycle:** Do annotation queues have a lifecycle (draft → active → completed)? Or are they always active and implicitly complete when all items are annotated?
0 commit comments