add Optimize-experiment skill for Claude by yangm2 · Pull Request #375 · codeforpdx/tenantfirstaid

yangm2 · 2026-06-30T08:10:01Z

What type of PR is this? (check all applicable)

Description

This adds an /optimize-experiment <experiment-name> skill with the associated tools so that Claude can do a static analysis of traces from an experiment. It can also compare the tool calls/results from an experiment on a new datastore to determine whether hacks/workarounds in the system prompt can be removed.

Related Tickets & Documents

Related Issue weak dependency on remove some STOPGAPs #373
Closes #

QA Instructions, Screenshots, Recordings

Please replace this line with instructions on how to test your changes, a note on the devices and browsers this has been tested on, as well as any relevant images for UI changes.

Added/updated tests?

Yes
No, and this is why: please replace this line with details on why tests have not been included
I need help with writing tests

Documentation

If this PR changes the system architecture, Architecture.md has been updated

[optional] Are there any post deployment tasks we need to perform?

…dule import to first use This fixes proxy issues seen when Claude runs tests in it's sandbox

yangm2 · 2026-06-30T08:16:31Z

@claude code-review

claude · 2026-06-30T08:17:11Z

Claude finished @yangm2's task in 2m 36s —— View job

Code Review

Read PR diff and changed files
Review langsmith_dataset.py (main additions)
Review evaluators / variance / runner changes
Review SKILL.md and skill docs
Review tests

Overall this is a well-structured, well-documented, and thoroughly-tested PR. The lazy-evaluator refactor is a clean fix for the import-time network side effect, and the STOPGAP retrieval-check tooling is carefully reasoned (the trace-check vs. live-requery short-circuit logic is genuinely thoughtful). Below are mostly minor observations — nothing blocking.

Lazy evaluator construction (`langsmith_evaluators.py`, `measure_evaluator_variance.py`, `run_langsmith_evaluation.py`)

✅ Converting _EVALUATOR_JUDGE and the three LLM-judge evaluators into @cached factory functions correctly defers the httpx client construction to first use. The measure_evaluator_variance.py change to build evaluators only after confirming there's work to do (after the no-runs / no-matching-scenario early returns) is a nice touch, and the comments explain the why well.

Minor: the active evaluator list in run_langsmith_evaluation.py now mixes call-style (legal_correctness_evaluator()) with the still-commented object-style heuristic evaluators (# citation_format_evaluator,). That's intentional (heuristics have no network side effect), but a future maintainer uncommenting one might be unsure whether to add (). A one-line comment noting "LLM judges are factories, heuristics are plain objects" would prevent the confusion.

`_datastore_last_update_time` — silent broad exception (`langsmith_dataset.py`)

    except Exception:
        return None

Falling open to None (and thus re-querying) is the right safety behavior, but swallowing every exception with no logging means a real misconfiguration (bad import path, auth failure, API change) silently disables the skip optimization forever with no diagnostic — the command just quietly does extra live Vertex calls. Consider logging at debug/warning so the degradation is observable. Note this function doesn't receive the log param that its sibling _datastore_unchanged_since_experiment has.

`_parse_stopgaps` — tight coupling to prompt formatting (`langsmith_dataset.py`)

The parser depends on exact markdown: the label must end with .**, and continuation targets are only picked up from lines starting with exactly two spaces (lines[j].startswith(" ")). If system_prompt.md is ever reformatted (tabs, different indent, or a label with no trailing period), STOPGAPs would be missed. The zero-match case fails loudly (sys.exit(1)), but a partial parse (label found, continuation targets silently dropped) would fail quietly and skew hit rates. Given the format is controlled in-repo this is acceptable, but a brief note in the docstring that the parser assumes the canonical STOPGAP format would help.

ORS relevance filter is a loose substring match (`_check_retrieval_from_traces`)

ors_match = re.search(r"\b(\d+\.\d+)", sg["label"])
...
if any(ors_num in fact for fact in facts)

ors_num (e.g. 90.425) is matched as a bare substring of each fact, so a fact mentioning 90.4251 would also match 90.425. Low risk given real ORS numbers, but a word-boundary match (e.g. re.search(rf"\b{re.escape(ors_num)}\b", fact)) would be more precise.

Tests

✅ Excellent coverage — the _collect_experiment_traces, _check_retrieval_from_traces, and datastore short-circuit tests cover the meaningful branches (repetition grouping, orphan trace_ids, empty outputs, content-key fallback, whitespace normalization, coverage-gap signaling, naive/aware UTC, fall-open). The lambda: mock_evaluator updates in test_measure_evaluator_variance.py correctly adapt to the new factory contract.

Note: I was unable to execute the test suite (sandbox blocked uv run pytest), so the above is from static review only.

These are all minor/non-blocking. Nice work — the documentation quality (both code comments and the SKILL.md) is notably high, and the shmoo / search references in the skill check out against scripts/vertex_ai_search.py.
· optimize-experiment-skill

yangm2 self-assigned this Jun 30, 2026

yangm2 added enhancement New feature or request backend Bot implementation and other backend concerns performance Improvements to performance labels Jun 30, 2026

yangm2 added 5 commits June 30, 2026 01:12

add a skill for optimization based on LangSmith experiment

9a5c686

code review fixes

5a84055

defer construction of ChatGoogleGenerativeAI (network client) from mo…

70d291a

…dule import to first use This fixes proxy issues seen when Claude runs tests in it's sandbox

code review fixes and refactoring

4a5e75f

formatting after rebase/merge

32429a9

yangm2 force-pushed the optimize-experiment-skill branch from 0435f58 to 32429a9 Compare June 30, 2026 08:14

github-advanced-security AI found potential problems Jun 30, 2026

View reviewed changes

Comment thread backend/tests/conftest.py Dismissed

Comment thread backend/tests/conftest.py Dismissed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add Optimize-experiment skill for Claude#375

add Optimize-experiment skill for Claude#375
yangm2 wants to merge 5 commits into
codeforpdx:mainfrom
yangm2:optimize-experiment-skill

yangm2 commented Jun 30, 2026 •

edited

Loading

Uh oh!

yangm2 commented Jun 30, 2026

Uh oh!

Uh oh!

Uh oh!

claude Bot commented Jun 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

yangm2 commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this? (check all applicable)

Description

Related Tickets & Documents

QA Instructions, Screenshots, Recordings

Added/updated tests?

Documentation

[optional] Are there any post deployment tasks we need to perform?

Uh oh!

yangm2 commented Jun 30, 2026

Uh oh!

Uh oh!

Uh oh!

claude Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review

Lazy evaluator construction (langsmith_evaluators.py, measure_evaluator_variance.py, run_langsmith_evaluation.py)

_datastore_last_update_time — silent broad exception (langsmith_dataset.py)

_parse_stopgaps — tight coupling to prompt formatting (langsmith_dataset.py)

ORS relevance filter is a loose substring match (_check_retrieval_from_traces)

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yangm2 commented Jun 30, 2026 •

edited

Loading

claude Bot commented Jun 30, 2026 •

edited

Loading

Lazy evaluator construction (`langsmith_evaluators.py`, `measure_evaluator_variance.py`, `run_langsmith_evaluation.py`)

`_datastore_last_update_time` — silent broad exception (`langsmith_dataset.py`)

`_parse_stopgaps` — tight coupling to prompt formatting (`langsmith_dataset.py`)

ORS relevance filter is a loose substring match (`_check_retrieval_from_traces`)