Skip to content

Commit 548a66a

Browse files
rstrahanclaude
andauthored
fix(test-studio): return partial result instead of raising when metrics not cached (#358) (#368)
get_test_results() in the test_results_resolver Lambda raised an unhandled ValueError ("Test run ... processing completed, evaluating results") when a run reached a terminal state but the evaluation aggregation never wrote testRunResult — i.e. aggregation is still running, timed out, or failed silently (reproduced on a 3463-document run). The exception surfaced as an opaque error and Test Studio spun on "Loading..." indefinitely. Return a structured partial TestRun (true status, file counts, and metadata; metric fields omitted) instead of raising. The GraphQL TestRun type already makes every metric field nullable, so the partial response is schema-valid. This also prevents a single not-yet-aggregated run from failing an entire compareTestRuns request. The deeper question of why aggregation can stall on very large runs is left as a documented follow-up. Add a regression unit test covering the terminal-status / no-cached- metrics path. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent 52cd78e commit 548a66a

3 files changed

Lines changed: 71 additions & 5 deletions

File tree

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@ SPDX-License-Identifier: MIT-0
3232

3333
### Fixed
3434

35+
- **Test Studio results error for runs stuck in evaluation (#358)**`getTestRun` (the `test_results_resolver` Lambda) raised an unhandled `ValueError` ("Test run … processing completed, evaluating results") when a run reached a terminal state but the evaluation-aggregation step never cached `testRunResult` — e.g. when aggregation is still running, timed out, or failed silently on a large run (the reporter hit this with 3 463 documents). The exception surfaced as an opaque error and the run spun on "Loading…" forever in Test Studio. The resolver now returns a structured partial `TestRun` (true status plus file counts and metadata, metric fields omitted) instead of raising, so the UI renders the in-progress/terminal state gracefully. This also stops a single not-yet-aggregated run from failing an entire `compareTestRuns` request. (The separate question of *why* aggregation can stall on very large runs is tracked as a follow-up.)
3536
- **Configuration version list silently truncated past the first page (#354)**`ConfigurationManager.list_config_versions()` performed a single unpaginated `table.scan()` on the ConfigurationTable. Because a DynamoDB scan returns at most 1 MB per call, deployments with many config versions (e.g. 230+) only ever saw the ~58 that fit on the first page — uploaded-via-CLI and autotune-agent configs were invisible in the UI's View/Edit Configuration page and the upload-document config-version dropdown (the configs still worked when referenced by name). The method now paginates through `LastEvaluatedKey` so every version is returned. Fixes all callers (`update_configuration`, the AppSync `configuration_resolver`, `rules_discovery`, and the SDK).
3637

3738
- **Build Info "update available" indicator broke against the public release bucket** — The `getLatestPublishedVersion` resolver discovered the newest published version by calling `ListObjectsV2` on the public artifacts bucket and parsing `idp-main_<version>.yaml` keys. That bucket grants `GetObject` only (no listing), so the check failed on real public deployments. `idp-cli publish` now writes a small pointer object — `<prefix>/idp-main-latest.json` (`{version, templateUrl}`) — at the version-stripped prefix on every release, and the resolver reads that one known key with a single `GetObject` (unsigned, falling back to signed), with a conventional `idp-main_<version>.yaml` URL fallback if the pointer omits one. No version parsing or `ListObjectsV2`. The check stays disabled when `PUBLIC_ARTIFACTS_BUCKET` is unset.

lib/idp_common_pkg/tests/unit/test_results_resolver.py

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -213,6 +213,51 @@ def test_build_config_comparison():
213213
assert "temperature" in [item["setting"] for item in config_diff]
214214

215215

216+
@pytest.mark.unit
217+
def test_get_test_results_missing_metrics_returns_partial_not_raises():
218+
"""When processing reached a terminal state but the evaluation aggregation
219+
never cached testRunResult (timed out / failed silently on a large run),
220+
get_test_results returns a structured partial TestRun instead of raising an
221+
opaque ValueError that leaves the UI spinning on "Loading..." (issue #358)."""
222+
test_run_id = "TEST-SET-ID"
223+
metadata = {
224+
"PK": f"testrun#{test_run_id}",
225+
"SK": "metadata",
226+
# Already terminal, so the status-refresh branch is skipped and we fall
227+
# straight through to the "no cached metrics" else branch.
228+
"Status": "COMPLETE",
229+
"TestSetId": "set-1",
230+
"TestSetName": "big-classification-set",
231+
"FilesCount": 3463,
232+
"CompletedFiles": 3460,
233+
"FailedFiles": 3,
234+
"CreatedAt": "2025-01-01T00:00:00Z",
235+
"Context": "ctx",
236+
"ConfigVersion": "v7",
237+
# No "testRunResult" key -> aggregation hasn't written metrics yet.
238+
}
239+
240+
mock_table = Mock()
241+
mock_table.get_item.return_value = {"Item": metadata}
242+
243+
with (
244+
patch.dict(os.environ, {"TRACKING_TABLE": "tracking"}),
245+
patch.object(index.dynamodb, "Table", return_value=mock_table),
246+
):
247+
result = index.get_test_results(test_run_id)
248+
249+
assert result["testRunId"] == test_run_id
250+
# Reports the true terminal status rather than fabricating one.
251+
assert result["status"] == "COMPLETE"
252+
assert result["filesCount"] == 3463
253+
assert result["completedFiles"] == 3460
254+
assert result["failedFiles"] == 3
255+
assert result["testSetId"] == "set-1"
256+
assert result["configVersion"] == "v7"
257+
# Metric fields are absent (not yet computed) but must not be required.
258+
assert "overallAccuracy" not in result or result["overallAccuracy"] is None
259+
260+
216261
@pytest.mark.unit
217262
def test_handler_field_routing():
218263
"""Test GraphQL field routing"""

nested/appsync/src/lambda/test_results_resolver/index.py

Lines changed: 25 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -408,16 +408,36 @@ def get_test_results(test_run_id):
408408
"config": _get_test_run_config(test_run_id),
409409
}
410410
else:
411-
# Provide more specific message for ABORTED status
411+
# No aggregate metrics have been cached yet. This happens when all
412+
# files finished processing but the evaluation aggregation step hasn't
413+
# written testRunResult (still running, or it timed out / failed on a
414+
# large run). Don't raise — that surfaces as an opaque error and the UI
415+
# spins on "Loading..." forever. Return a structured partial TestRun so
416+
# the UI can render the in-progress status instead.
412417
if current_status == "ABORTED":
413-
raise ValueError(
414-
f"Test run {test_run_id} aborted, evaluating results for completed documents"
418+
logger.info(
419+
f"Test run {test_run_id} aborted; aggregate metrics not yet available"
415420
)
416421
else:
417-
raise ValueError(
418-
f"Test run {test_run_id} processing completed, evaluating results"
422+
logger.info(
423+
f"Test run {test_run_id} processing complete; "
424+
"aggregate metrics not yet available (evaluation in progress)"
419425
)
420426

427+
return {
428+
"testRunId": test_run_id,
429+
"testSetId": metadata.get("TestSetId"),
430+
"testSetName": metadata.get("TestSetName"),
431+
"status": current_status,
432+
"filesCount": metadata.get("FilesCount", 0),
433+
"completedFiles": metadata.get("CompletedFiles", 0),
434+
"failedFiles": metadata.get("FailedFiles", 0),
435+
"createdAt": _format_datetime(metadata.get("CreatedAt")),
436+
"completedAt": _format_datetime(metadata.get("CompletedAt")),
437+
"context": metadata.get("Context"),
438+
"configVersion": metadata.get("ConfigVersion"),
439+
}
440+
421441

422442
def _query_test_runs_from_gsi(table, start_iso, end_iso):
423443
"""Query test runs from TypeDateIndex GSI instead of scanning the full table.

0 commit comments

Comments
 (0)