evo2 SAE steering analysis: dose-response / selectivity harness + tested metrics#1635
evo2 SAE steering analysis: dose-response / selectivity harness + tested metrics#1635polinabinder1 wants to merge 5 commits into
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR introduces a complete SAE steering harness for Evo2: a pure metrics module for quantifying feature-driven text changes, a CLI script that orchestrates encoding, generation sweeps, and causal analysis, and corresponding unit tests. The workflow measures dose-response and selectivity of target feature clamping against baseline and control features. ChangesSAE Steering Harness Implementation
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
@coderabbitai review |
✅ Action performedReview finished.
|
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (2)
bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/tests/test_steer_analysis.py (1)
27-50: ⚡ Quick winConsider adding edge case tests.
The current tests cover happy paths well. Consider adding tests for edge cases such as:
- Empty strings in
divergence(e.g.,divergence("", "")ordivergence("A", ""))- Empty
control_steereddict inselectivity(verifies the edge case flagged insteer_analysis.py)dose_responsewith emptysteered_by_strengthdict🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/tests/test_steer_analysis.py` around lines 27 - 50, Add unit tests covering edge cases: call divergence with empty strings and with one empty vs non-empty (e.g., divergence("", "") and divergence("A", "")) and assert expected behavior; add a test for selectivity that passes an empty control_steered dict to ensure the function handles/raises as intended; and add a test for dose_response with an empty steered_by_strength dict to verify it returns an empty/appropriate response. Place these new tests in the existing test_steer_analysis.py alongside the other tests and reference the same functions divergence, selectivity, and dose_response so edge behaviors are validated.bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/steer_analysis.py (1)
33-68: ⚡ Quick winExpand docstrings to Google-style format.
The coding guidelines require Google-style docstrings with Args, Returns, and (if applicable) Raises sections. The current brief docstrings are functional but could be more explicit about parameter types, return structure, and edge cases.
📚 Example expansion for `divergence`
def divergence(a: str, b: str) -> tuple[int, float]: """Return divergence metrics between two strings over their shared prefix. Args: a: First string to compare. b: Second string to compare. Returns: A tuple of (first_differing_index, fraction_differing) where: - first_differing_index is the index of the first character that differs, or the shared prefix length if strings are identical up to that point. - fraction_differing is the fraction of characters that differ within the shared prefix length (min of the two string lengths). """🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/steer_analysis.py` around lines 33 - 68, Update the short docstrings for divergence, dose_response, and selectivity to full Google-style docstrings: for divergence, add an Args section documenting a: str and b: str, a Returns section describing the tuple (first_differing_index: int, fraction_differing: float) and explain edge cases (identical prefix => index == shared prefix length, fraction computed over min(len(a), len(b)) and uses max(1, n) to avoid division by zero); for dose_response, document Args (baseline: str, steered_by_strength: dict[float, str]) and Returns (list[dict] with keys "strength": float, "first_divergence": int, "frac_changed": float) and note rows are sorted by ascending strength; for selectivity, document Args (baseline: str, target_steered: str, control_steered: dict[int, str]) and Returns (dict with keys "target_frac_changed", "control_frac_changed", "mean_control_frac_changed", "selectivity_ratio") and mention handling of empty controls (mean_control_frac_changed==0 and selectivity_ratio uses denominator max(mean_c, 1e-9)). Ensure each docstring uses Google-style sections (Args, Returns, and mention Raises if applicable) and references the existing function names divergence, dose_response, and selectivity so reviewers can locate the changes.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/steer_analysis.py`:
- Around line 54-68: The selectivity() function currently divides by a tiny
epsilon when control_steered is empty, producing a misleading huge
selectivity_ratio; modify selectivity() to detect when controls is empty (e.g.,
if not control_steered or if len(controls)==0) and set "selectivity_ratio" to
None (or another sentinel) instead of computing target / max(mean_c, 1e-9); keep
the other fields the same (you may also set "mean_control_frac_changed" to None
if you prefer), and update any type hints or callers of selectivity to accept a
nullable selectivity_ratio.
- Line 32: The top-level imports in steer_analysis.py currently have only one
blank line following them; update the module by inserting a second blank line
after the import block so there are exactly two blank lines before the first
subsequent statement (e.g., before any module-level constants, function or class
definitions) to satisfy the isort configuration.
In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/steer.py`:
- Line 38: The import block in steer.py currently has only one blank line after
the imports which violates the isort rule requiring two blank lines; update the
top-of-file imports section by adding one additional newline so there are two
blank lines separating the imports from the rest of the code (i.e., ensure the
imports block in steer.py is followed by two blank lines).
- Around line 77-78: The parsing of comma-separated inputs for controls and
strengths (variables controls and strengths derived from a.controls and
a.strengths) lacks error handling and will raise ValueError on malformed tokens;
wrap the list comprehensions in a try/except that catches ValueError, validate
each token with .strip() before converting (skip empty tokens), convert controls
with int() and strengths with float() inside the try block, verify the two lists
have the expected matching lengths and raise or exit with a clear error message
including the offending token and the original input (use the argparse object a
for context) so the script fails gracefully instead of crashing.
- Around line 71-76: The code can leave target as None when no labeled features
are in the top activations; after the loop that inspects vals, ids and sets
target from a.feature or eng.labels, add a validation: if target is still None,
either set a sensible fallback (e.g., target = int(ids[0]) to pick the
highest-activation feature) or raise a clear error asking the user to provide
--feature; ensure the check references target, a.feature, vals, ids and
eng.labels so downstream code that uses target as a dict key/for formatting
won't get a TypeError.
In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/tests/test_steer_analysis.py`:
- Line 21: The file test_steer_analysis.py has only one blank line after the
import block which violates the isort config requiring two blank lines; open
test_steer_analysis.py, locate the import section at the top of the module (the
import statements in the test_steer_analysis module), and add one additional
blank line so there are two consecutive blank lines immediately following the
imports before the first test or code block.
---
Nitpick comments:
In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/steer_analysis.py`:
- Around line 33-68: Update the short docstrings for divergence, dose_response,
and selectivity to full Google-style docstrings: for divergence, add an Args
section documenting a: str and b: str, a Returns section describing the tuple
(first_differing_index: int, fraction_differing: float) and explain edge cases
(identical prefix => index == shared prefix length, fraction computed over
min(len(a), len(b)) and uses max(1, n) to avoid division by zero); for
dose_response, document Args (baseline: str, steered_by_strength: dict[float,
str]) and Returns (list[dict] with keys "strength": float, "first_divergence":
int, "frac_changed": float) and note rows are sorted by ascending strength; for
selectivity, document Args (baseline: str, target_steered: str, control_steered:
dict[int, str]) and Returns (dict with keys "target_frac_changed",
"control_frac_changed", "mean_control_frac_changed", "selectivity_ratio") and
mention handling of empty controls (mean_control_frac_changed==0 and
selectivity_ratio uses denominator max(mean_c, 1e-9)). Ensure each docstring
uses Google-style sections (Args, Returns, and mention Raises if applicable) and
references the existing function names divergence, dose_response, and
selectivity so reviewers can locate the changes.
In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/tests/test_steer_analysis.py`:
- Around line 27-50: Add unit tests covering edge cases: call divergence with
empty strings and with one empty vs non-empty (e.g., divergence("", "") and
divergence("A", "")) and assert expected behavior; add a test for selectivity
that passes an empty control_steered dict to ensure the function handles/raises
as intended; and add a test for dose_response with an empty steered_by_strength
dict to verify it returns an empty/appropriate response. Place these new tests
in the existing test_steer_analysis.py alongside the other tests and reference
the same functions divergence, selectivity, and dose_response so edge behaviors
are validated.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: ea342482-73ae-48e5-b854-6dc5e460dff3
📒 Files selected for processing (3)
bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/steer.pybionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/steer_analysis.pybionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/tests/test_steer_analysis.py
e567efc to
9bedf2b
Compare
…d onto migrated #1622 Re-lands #1635 on the post-#1633 layout, on top of migrated #1622: the steering-eval harness (scripts/{steer,steer_analysis}.py) over #1622's generate(), with model-agnostic metrics. Validated: tests/test_steer_analysis.py -> 3 passed (CPU); harness scripts compile. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
74ed67c to
0189d98
Compare
…d onto migrated #1622 Re-lands #1635 on the post-#1633 layout, on top of migrated #1622: the steering-eval harness (scripts/{steer,steer_analysis}.py) over #1622's generate(), with model-agnostic metrics. Validated: tests/test_steer_analysis.py -> 3 passed (CPU); harness scripts compile. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
0189d98 to
7461e2b
Compare
474650d to
2dcc5c9
Compare
…d onto migrated #1622 Clamp an SAE feature via the production Evo2SAE.generate path and quantify the causal effect: dose-response (effect vs strength) + selectivity (target vs control features), persisted to a structured steering_results.json. Metric / robustness: * normalized edit (Levenshtein) distance, not positional Hamming. Greedy decode is autoregressive, so one early flipped token shifts every downstream base and pins Hamming at ~1.0 — erasing the dose curve. Edit distance is shift-robust; first_divergence (shared-prefix length) is the complementary monotone signal. Tested with the shift case. * surface the clamp cap: generate() silently caps |strength| to MAX_CLAMP_STRENGTH, so two requests above it produce an identical clamp (a fake plateau). run_steering warns, steers at the effective value, and records max_clamp_strength + capped_strengths. Consolidation: * harness + metrics live in the package (src/evo2_sae/steer_analysis.py), engine injected, so they import as a normal torch-free module like evo2_sae.fasta — dropped all four sys.path inserts. scripts/steer.py is now a thin CLI (matches train.py/extract.py). * pick_target reuses Evo2SAE.top_features (the CLI/server ranking) instead of re-deriving topk. * one CPU test file (metrics + fake-engine harness) instead of two; fake stays local, not in conftest, to avoid colliding with the sibling server PR's engine fixtures. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
Relocate the steering dose-response / selectivity metrics from evo2_sae.steer_analysis into the evo2_sae.eval package (src/evo2_sae/steer_analysis.py -> src/evo2_sae/eval/steering.py), alongside the eval/probing harness. Update the importers (scripts/steer.py CLI + tests/test_steer_analysis.py) to evo2_sae.eval.steering. The CI lane is dropped via the rebase onto the updated #1622. Pure-CPU tests, no GPU/model. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
2dcc5c9 to
d6b6813
Compare
Re-add .github/workflows/unit-tests-interpretability-recipes.yaml (removed in c31d4e9 "defer CI for now"), recovered verbatim from 9bedf2b. Path-gated GPU lane (L4 + megatron squashed image) that builds the evo2 SAE recipe via .ci_build.sh and runs pytest tests/ — including the GPU steering/encode tests. Rooted on this PR (#1622), so the stacked SAE PRs (#1623/#1629/#1635) inherit it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: polinabinder1 <pbinder@nvidia.com>
Signed-off-by: polinabinder1 <pbinder@nvidia.com>
… the recipe The steering clamp hook moved sae.steering -> evo2_sae.steering (#1622). This merges that base in and fixes the one remaining reference (a docstring in eval/steering.py). No import change was needed — the harness is engine-injected and never imported sae.steering directly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: polinabinder1 <pbinder@nvidia.com>
Summary
Evo2 SAE steering analysis — a dose-response + selectivity harness over the decode-only steering hook, with pure, CPU-tested metrics. Now lives in
evo2_sae.eval.steering(alongsideevo2_sae.eval.probing).Stacked on #1622 (uses the
Evo2SAEengine + itssae.steeringclamp hook).Contents —
evo2_sae.eval.steeringcommon_prefix_len,edit_distance,divergence,dose_response,selectivity— string math, no model.pick_target(top labeled feature on a sequence) andrun_steering(sweep clamp strengths on a target + control features → dose-response + selectivity).scripts/steer.py— thin CLI wrapper (builds anEvo2SAE, runs the sweep); imports the harness fromevo2_sae.eval.steering.Design decisions (the non-trivial ones)
temperature=0), so a single early insertion/shift makes every downstream position differ — Hamming would saturate to ~1.0 from one change and report a fixed sweep as maximally divergent.divergenceuses normalized edit distance (doesn't saturate on a shift) plusfirst_divergence(shared-prefix length), so the metric tracks the real magnitude of the effect.pick_target/run_steeringtake the engine as an argument, so the whole harness is torch-free and runs against a fake in tests — no GPU, no checkpoints, noevo2_sae.coreimport. (The fake is intentionally local to the test file, not a conftest fixture, so it doesn't collide with evo2 SAE serve: FastAPI server + CLI (on the engine) #1637's GPU fixtures.)Evo2SAE.generatepath (decode-only clamp) — the analysis can't drift from how steering actually behaves in the server. It also forces greedy/deterministic decode (temperature=0,top_k=1), so strength-0 reproduces the baseline exactly and dose-response isn't polluted by sampling noise.max_clampmirrorscore._sanitize_steering— requested strengths beyondMAX_CLAMP_STRENGTHare capped, and the result surfaces the capped values, so the sweep matches what the engine would actually apply.pick_targetprefers the top labeled feature (the interpretable target) while still surfacing the top-active feature in the returned table — so steering a meaningful concept is the default, not whatever fired hardest.How to run
No dedicated CI lane (deferred — see #1622; CI should fold into the repo-wide recipe lane later).
Tests
9 passed (CPU, no model): the string metrics (
divergence/dose_response/selectivity/edit_distance) against hand-computed references, pluspick_target/run_steeringdriven against an injected torch-free fake engine. No GPU tests (the real-engine steering path is covered bytest_steering.pyin #1622).