Skip to content

Commit 4b1d019

Browse files
committed
v0.49.0: vstack-diagnose --baseline CI ratchet + fix vdiff on real reports
- Feature: vstack-diagnose --baseline <report.json> gates (with --fail-on) only on findings NEW vs a saved baseline — the CI ratchet (don't fail on pre-existing findings). Prints 'vs baseline: N new, M pre-existing'. - Fix: vstack.vdiff.diff_reports assumed per_pattern was a name-keyed dict, but real DiagnoseReports carry it as a list -> diff_reports / vstack-vdiff crashed on genuine vstack-diagnose output. Now normalizes both shapes (+regression test). - _gate_exit_code now takes severities (gates all findings or only new ones). 3,236 tests.
1 parent 0e24ee2 commit 4b1d019

8 files changed

Lines changed: 180 additions & 21 deletions

File tree

CHANGELOG.md

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,36 @@ project adheres to [Semantic Versioning](https://semver.org/) from
66
`1.0.0` onward. During the `0.x` series, minor bumps may include
77
breaking changes (see API stability promise in `vstack/__init__.py`).
88

9+
## [0.49.0] — 2026-06-23
10+
11+
The CI ratchet — gate only on *new* findings — plus a `vdiff` correctness fix.
12+
13+
### Added
14+
15+
- **`vstack-diagnose --baseline <report.json>`** — compare the current run
16+
against a saved diagnose report and, with `--fail-on`, gate **only on
17+
findings that are new relative to the baseline**. This is the standard
18+
ratchet: a gate won't fail on pre-existing, already-accepted findings, and
19+
it tightens as you re-baseline. Prints a `vs baseline: N new, M
20+
pre-existing` summary to stderr.
21+
22+
### Fixed
23+
24+
- **`vstack.vdiff.diff_reports` crashed on real reports.** It assumed
25+
`per_pattern` was a name-keyed dict, but actual `DiagnoseReport`s (and
26+
their JSON) carry `per_pattern` as a *list*. `diff_reports` (and therefore
27+
the `vstack-vdiff` CLI) raised `TypeError` on genuine `vstack-diagnose`
28+
output; it now normalizes both shapes. Regression test added.
29+
30+
### Changed
31+
32+
- `_gate_exit_code` now takes a list of severities (so the gate can score
33+
either all findings or only the new ones).
34+
35+
### Compatibility
36+
37+
- All tests pass. `--baseline` is opt-in; no breaking changes.
38+
939
## [0.48.0] — 2026-06-23
1040

1141
SARIF output — vstack findings now flow into GitHub code scanning (Security

README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -542,8 +542,12 @@ Not on GitHub Actions? The same gating works from any shell — the core CLI is
542542
```bash
543543
vstack-diagnose --trace run.json --fail-on high # exit 3 if any finding ≥ high
544544
vstack-diagnose --trace run.json --sarif > vstack.sarif # SARIF 2.1.0 for any code-scanning tool
545+
vstack-diagnose --trace run.json --fail-on high \
546+
--baseline last-good.json # ratchet: only fail on findings NEW vs the baseline
545547
```
546548

549+
The `--baseline` ratchet gates on *new* findings only, so a CI gate won't fail on pre-existing, already-accepted findings — save a report with `--json` once, commit it as the baseline, and the gate tightens over time.
550+
547551
## Framework adapters
548552

549553
Same patterns, native to your framework:

_diagnose/lib/cli.py

Lines changed: 35 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -306,6 +306,15 @@ def main(argv: Sequence[str] | None = None) -> int:
306306
"Use to gate CI directly on the diagnosis. Omit to never fail on findings."
307307
),
308308
)
309+
parser.add_argument(
310+
"--baseline",
311+
default=None,
312+
metavar="REPORT.json",
313+
help=(
314+
"Path to a saved diagnose report (JSON). When set, --fail-on gates only "
315+
"on findings that are NEW relative to the baseline — the CI ratchet."
316+
),
317+
)
309318
args = parser.parse_args(argv)
310319

311320
if args.list:
@@ -381,23 +390,42 @@ def main(argv: Sequence[str] | None = None) -> int:
381390
else:
382391
print(report.to_markdown())
383392

384-
# CI gate: exit non-zero when a finding reaches the --fail-on threshold.
385-
return _gate_exit_code(report.findings, args.fail_on)
393+
# CI gate. With --baseline, gate only on findings NEW vs the baseline
394+
# (the "ratchet" — don't fail CI on pre-existing, accepted findings).
395+
gated_severities = [f.severity for f in report.findings]
396+
if args.baseline is not None:
397+
from vstack.vdiff import diff_reports
398+
399+
try:
400+
baseline = json.loads(Path(args.baseline).read_text())
401+
except (OSError, json.JSONDecodeError) as e:
402+
print(f"vstack-diagnose: could not read --baseline: {e}", file=sys.stderr)
403+
return 2
404+
delta = diff_reports(baseline, report)
405+
new = delta.added
406+
gated_severities = [d.severity_after for d in new if d.severity_after]
407+
print(
408+
f"vs baseline: {len(new)} new finding(s), "
409+
f"{len(report.findings) - len(new)} pre-existing.",
410+
file=sys.stderr,
411+
)
412+
413+
return _gate_exit_code(gated_severities, args.fail_on)
386414

387415

388-
def _gate_exit_code(findings: "list[Any]", fail_on: str | None) -> int:
389-
"""Return 3 if any finding is at/above ``fail_on``, else 0.
416+
def _gate_exit_code(severities: list[str], fail_on: str | None) -> int:
417+
"""Return 3 if any severity is at/above ``fail_on``, else 0.
390418
391419
``fail_on=None`` never gates. Factored out for direct testing.
392420
"""
393421
if fail_on is None:
394422
return 0
395423
threshold = severity_rank(fail_on)
396-
above = [f for f in findings if severity_rank(f.severity) >= threshold]
424+
above = [s for s in severities if severity_rank(s) >= threshold]
397425
if above:
398-
worst = max(above, key=lambda f: severity_rank(f.severity))
426+
worst = max(above, key=severity_rank)
399427
print(
400-
f"vstack-diagnose: gate failed — found {worst.severity} finding (>= {fail_on}).",
428+
f"vstack-diagnose: gate failed — found {worst} finding (>= {fail_on}).",
401429
file=sys.stderr,
402430
)
403431
return 3

_diagnose/tests/test_diagnose_cli.py

Lines changed: 68 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -136,19 +136,79 @@ def test_runs_diagnose_with_none_client_markdown_output() -> None:
136136

137137

138138
def test_gate_exit_code_logic() -> None:
139-
from vstack.diagnose import Finding
140139
from vstack.diagnose.cli import _gate_exit_code
141140

142-
findings = [
143-
Finding(pattern="p", severity="high", title="t"),
144-
Finding(pattern="q", severity="low", title="u"),
145-
]
146-
assert _gate_exit_code(findings, "high") == 3 # a high finding at/above 'high'
147-
assert _gate_exit_code(findings, "critical") == 0 # nothing reaches 'critical'
148-
assert _gate_exit_code(findings, None) == 0 # no gate
141+
severities = ["high", "low"]
142+
assert _gate_exit_code(severities, "high") == 3 # a high severity at/above 'high'
143+
assert _gate_exit_code(severities, "critical") == 0 # nothing reaches 'critical'
144+
assert _gate_exit_code(severities, None) == 0 # no gate
149145
assert _gate_exit_code([], "high") == 0 # no findings
150146

151147

148+
def test_baseline_ratchet_passes_on_preexisting_findings(tmp_path) -> None:
149+
# Baseline already has a high finding; the current run (client none) finds
150+
# nothing new, so --fail-on high passes because nothing is NEW.
151+
baseline = tmp_path / "baseline.json"
152+
baseline.write_text(
153+
json.dumps(
154+
{
155+
"shape": "individual",
156+
"findings": [
157+
{
158+
"pattern": "aar",
159+
"severity": "high",
160+
"title": "known",
161+
"evidence": "",
162+
"intervention": "",
163+
}
164+
],
165+
"errors": {},
166+
"per_pattern": [
167+
{"pattern": "aar", "elapsed_seconds": 0, "finding_count": 1, "error": None}
168+
],
169+
}
170+
)
171+
)
172+
payload = {
173+
"agent_id": "a",
174+
"goal": "g",
175+
"steps": [{"timestamp": "2026-01-01T00:00:00Z", "type": "observation", "content": "x"}],
176+
"outcome": "o",
177+
"success": False,
178+
}
179+
code, _out, err = _run(
180+
[
181+
"--client",
182+
"none",
183+
"--shape",
184+
"individual",
185+
"--fail-on",
186+
"high",
187+
"--baseline",
188+
str(baseline),
189+
],
190+
stdin=json.dumps(payload),
191+
)
192+
assert code == 0
193+
assert "new finding" in err # the baseline summary line
194+
195+
196+
def test_baseline_missing_file_returns_2(tmp_path) -> None:
197+
payload = {
198+
"agent_id": "a",
199+
"goal": "g",
200+
"steps": [{"timestamp": "2026-01-01T00:00:00Z", "type": "observation", "content": "x"}],
201+
"outcome": "o",
202+
"success": False,
203+
}
204+
code, _out, err = _run(
205+
["--client", "none", "--baseline", str(tmp_path / "nope.json")],
206+
stdin=json.dumps(payload),
207+
)
208+
assert code == 2
209+
assert "--baseline" in err
210+
211+
152212
def test_fail_on_no_findings_exits_zero() -> None:
153213
# With --client none on a thin trace there are no findings, so the gate
154214
# passes (exit 0) even at the strictest threshold.

_packaging/vstack/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,7 @@
7373

7474
from __future__ import annotations
7575

76-
__version__ = "0.48.0"
76+
__version__ = "0.49.0"
7777

7878
# The diagnose() function and PATTERNS registry are lazy-imported below
7979
# so that ``import vstack`` itself stays cheap. Pattern sub-packages

_vdiff/lib/_diff.py

Lines changed: 18 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -18,11 +18,25 @@ def _get_findings(report: Any) -> list[dict[str, Any]]:
1818

1919

2020
def _get_per_pattern(report: Any) -> dict[str, Any]:
21+
"""Return ``{pattern_name: entry}`` regardless of report shape.
22+
23+
A real ``DiagnoseReport`` (and its JSON form) carries ``per_pattern`` as a
24+
*list* of per-pattern results (objects or dicts, each with a ``pattern``
25+
field); older/synthetic reports use a dict keyed by pattern name. Normalize
26+
both so diffing works on actual ``vstack-diagnose`` output.
27+
"""
2128
if isinstance(report, dict):
22-
return dict(report.get("per_pattern", {}))
23-
if hasattr(report, "per_pattern"):
24-
return dict(report.per_pattern)
25-
return {}
29+
raw = report.get("per_pattern", [])
30+
else:
31+
raw = getattr(report, "per_pattern", [])
32+
if isinstance(raw, dict):
33+
return dict(raw)
34+
out: dict[str, Any] = {}
35+
for item in raw or []:
36+
name = item.get("pattern") if isinstance(item, dict) else getattr(item, "pattern", None)
37+
if name is not None:
38+
out[str(name)] = item
39+
return out
2640

2741

2842
def _finding_key(finding: Any) -> tuple[str, str]:

_vdiff/tests/test_diff.py

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -210,3 +210,26 @@ class FakeReport:
210210

211211
delta = diff_reports(FakeReport(), FakeReport())
212212
assert delta.before_count == 1
213+
214+
215+
def test_diff_handles_real_list_shaped_per_pattern():
216+
"""Regression: real DiagnoseReport JSON carries per_pattern as a LIST of
217+
dicts (not a name-keyed dict). diff_reports must not crash on it."""
218+
from vstack.vdiff import diff_reports
219+
220+
before = {
221+
"shape": "individual",
222+
"findings": [],
223+
"per_pattern": [{"pattern": "lewin", "finding_count": 0, "error": None}],
224+
}
225+
after = {
226+
"shape": "individual",
227+
"findings": [{"pattern": "aar", "severity": "high", "title": "new"}],
228+
"per_pattern": [
229+
{"pattern": "lewin", "finding_count": 0, "error": None},
230+
{"pattern": "aar", "finding_count": 1, "error": None},
231+
],
232+
}
233+
delta = diff_reports(before, after)
234+
assert "aar" in delta.patterns_added
235+
assert [d.title for d in delta.added] == ["new"]

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[project]
22
name = "valanistack"
3-
version = "0.48.0"
3+
version = "0.49.0"
44
description = "Organizational behavior, practiced on AI agents."
55
readme = "README.md"
66
requires-python = ">=3.11"

0 commit comments

Comments
 (0)