Skip to content

add Optimize-experiment skill for Claude#375

Open
yangm2 wants to merge 5 commits into
codeforpdx:mainfrom
yangm2:optimize-experiment-skill
Open

add Optimize-experiment skill for Claude#375
yangm2 wants to merge 5 commits into
codeforpdx:mainfrom
yangm2:optimize-experiment-skill

Conversation

@yangm2

@yangm2 yangm2 commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

What type of PR is this? (check all applicable)

  • Refactor
  • Feature
  • Bug Fix
  • Optimization
  • Documentation Update
  • Infrastructure
  • Maintenance

Description

This adds an /optimize-experiment <experiment-name> skill with the associated tools so that Claude can do a static analysis of traces from an experiment. It can also compare the tool calls/results from an experiment on a new datastore to determine whether hacks/workarounds in the system prompt can be removed.

Related Tickets & Documents

QA Instructions, Screenshots, Recordings

Please replace this line with instructions on how to test your changes, a note on the devices and browsers this has been tested on, as well as any relevant images for UI changes.

Added/updated tests?

  • Yes
  • No, and this is why: please replace this line with details on why tests have not been included
  • I need help with writing tests

Documentation

  • If this PR changes the system architecture, Architecture.md has been updated

[optional] Are there any post deployment tasks we need to perform?

@yangm2 yangm2 self-assigned this Jun 30, 2026
@yangm2 yangm2 added enhancement New feature or request backend Bot implementation and other backend concerns performance Improvements to performance labels Jun 30, 2026
@yangm2 yangm2 force-pushed the optimize-experiment-skill branch from 0435f58 to 32429a9 Compare June 30, 2026 08:14
@yangm2

yangm2 commented Jun 30, 2026

Copy link
Copy Markdown
Contributor Author

@claude code-review

Comment thread backend/tests/conftest.py Dismissed
Comment thread backend/tests/conftest.py Dismissed
@claude

claude Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Claude finished @yangm2's task in 2m 36s —— View job


Code Review

  • Read PR diff and changed files
  • Review langsmith_dataset.py (main additions)
  • Review evaluators / variance / runner changes
  • Review SKILL.md and skill docs
  • Review tests

Overall this is a well-structured, well-documented, and thoroughly-tested PR. The lazy-evaluator refactor is a clean fix for the import-time network side effect, and the STOPGAP retrieval-check tooling is carefully reasoned (the trace-check vs. live-requery short-circuit logic is genuinely thoughtful). Below are mostly minor observations — nothing blocking.


Lazy evaluator construction (langsmith_evaluators.py, measure_evaluator_variance.py, run_langsmith_evaluation.py)

✅ Converting _EVALUATOR_JUDGE and the three LLM-judge evaluators into @cached factory functions correctly defers the httpx client construction to first use. The measure_evaluator_variance.py change to build evaluators only after confirming there's work to do (after the no-runs / no-matching-scenario early returns) is a nice touch, and the comments explain the why well.

Minor: the active evaluator list in run_langsmith_evaluation.py now mixes call-style (legal_correctness_evaluator()) with the still-commented object-style heuristic evaluators (# citation_format_evaluator,). That's intentional (heuristics have no network side effect), but a future maintainer uncommenting one might be unsure whether to add (). A one-line comment noting "LLM judges are factories, heuristics are plain objects" would prevent the confusion.

_datastore_last_update_time — silent broad exception (langsmith_dataset.py)

    except Exception:
        return None

Falling open to None (and thus re-querying) is the right safety behavior, but swallowing every exception with no logging means a real misconfiguration (bad import path, auth failure, API change) silently disables the skip optimization forever with no diagnostic — the command just quietly does extra live Vertex calls. Consider logging at debug/warning so the degradation is observable. Note this function doesn't receive the log param that its sibling _datastore_unchanged_since_experiment has.

_parse_stopgaps — tight coupling to prompt formatting (langsmith_dataset.py)

The parser depends on exact markdown: the label must end with .**, and continuation targets are only picked up from lines starting with exactly two spaces (lines[j].startswith(" ")). If system_prompt.md is ever reformatted (tabs, different indent, or a label with no trailing period), STOPGAPs would be missed. The zero-match case fails loudly (sys.exit(1)), but a partial parse (label found, continuation targets silently dropped) would fail quietly and skew hit rates. Given the format is controlled in-repo this is acceptable, but a brief note in the docstring that the parser assumes the canonical STOPGAP format would help.

ORS relevance filter is a loose substring match (_check_retrieval_from_traces)

ors_match = re.search(r"\b(\d+\.\d+)", sg["label"])
...
if any(ors_num in fact for fact in facts)

ors_num (e.g. 90.425) is matched as a bare substring of each fact, so a fact mentioning 90.4251 would also match 90.425. Low risk given real ORS numbers, but a word-boundary match (e.g. re.search(rf"\b{re.escape(ors_num)}\b", fact)) would be more precise.

Tests

✅ Excellent coverage — the _collect_experiment_traces, _check_retrieval_from_traces, and datastore short-circuit tests cover the meaningful branches (repetition grouping, orphan trace_ids, empty outputs, content-key fallback, whitespace normalization, coverage-gap signaling, naive/aware UTC, fall-open). The lambda: mock_evaluator updates in test_measure_evaluator_variance.py correctly adapt to the new factory contract.

Note: I was unable to execute the test suite (sandbox blocked uv run pytest), so the above is from static review only.


These are all minor/non-blocking. Nice work — the documentation quality (both code comments and the SKILL.md) is notably high, and the shmoo / search references in the skill check out against scripts/vertex_ai_search.py.
· optimize-experiment-skill

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend Bot implementation and other backend concerns enhancement New feature or request performance Improvements to performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants