Skip to content

feat(skills): add nemoclaw-maintainer-verify-stale#3063

Open
prekshivyas wants to merge 45 commits intomainfrom
feat/verify-stale-skill
Open

feat(skills): add nemoclaw-maintainer-verify-stale#3063
prekshivyas wants to merge 45 commits intomainfrom
feat/verify-stale-skill

Conversation

@prekshivyas
Copy link
Copy Markdown
Contributor

@prekshivyas prekshivyas commented May 5, 2026

Summary

Adds maintainer skill nemoclaw-maintainer-verify-stale — automates the manual loop of: spin up a Brev box → install latest NemoClaw → try to reproduce an old bug → comment with findings. Tag-only (never auto-closes); a maintainer pulls the trigger after reviewing.

The skill has been driven end-to-end against 9 real candidates during development, surfacing the gaps and skill changes summarised below.

Issues tackled

Issue Reported Verified Verdict Score What this run surfaced (skill change)
#2791 v0.0.21 v0.0.35 wontfix-by-design By-design prototype. Step 8.5 detection framework (3 signals: maintainer attribution / removal commit / symbol absent). Maintainer reviewed the analysis and closed the issue manually.
#2168 v0.0.21 v0.0.35 wontfix-by-design (tagged) Repo-label vocabulary mismatch (status: wont-fix, not wontfix); markdown-linked citations in comments.
#2007 v0.0.18 v0.0.35 fixed-on-latest (tagged) 84/100 Docker-group sg, brev exec PATH gotcha, sandbox-build-rot reframe, multi-axis architectural-drift check.
#2592 v0.0.28 v0.0.35 fixed-on-latest (tagged) 84/100 Sentinel-file SSH-drop guard, openshell sandbox exec -n NAME -- syntax footgun, reproducer-extraction regex extension (openclaw|openshell).
#2519 v0.0.26 v0.0.35 fixed-on-latest (closed by maintainer mid-run) 84/100 Step 5 provider classification + API-key prompt, install.sh env-var passthrough, sandbox-name validator.
#2604 v0.0.28 v0.0.36 fixed-on-latest (tagged) 84/100 File-based API key propagation (kill argv exposure local→Brev). 300-word hard ceiling for comments + comment-authoring principle in Step 10.
#2611 v0.0.28 v0.0.36 fixed-on-latest (tagged) 70/100 On-box subshell argv leak (extend file-based pattern to Brev→inner shell). First non-cap-at-84 score (the −30 LLM-synth penalty pushed raw below the cap).
#2513 v0.0.26 maintainer self-verified mid-batch Gap-queue: skill should re-check state == OPEN before posting. Second mid-batch close-by-maintainer (alongside #2519) → promote from queue to a real Step 1/3 check.
#2166 v0.0.21 still-reproduces (source-only) Gap-queue: when pre-flight is conclusive, allow source-only verdicts to skip Brev entirely (Verification mode: static analysis at <tag>).

Effective touched count: 7 issues with skill-rendered comments + 2 side-discoveries. Two of the seven landed wontfix-by-design, five landed fixed-on-latest. Sandbox-build rot capped four of the five at 84 (the modal verdict shape with reporter @-mention). #2611 was the first run to land below the cap, validating the −30 LLM-synth penalty path.

v1 scope

In Linux only (Brev), bug issues against CLI / Sandbox / OpenShell / Docker / Getting Started / Ubuntu / DGX Spark / GB10
Out macOS, Windows/WSL, integration bugs needing third-party tokens, service-account bots, auto-close

Verification flow (one-line per step)

Step Purpose
6.5 — Preconditions Brev auth + install-URL reachability fail-fast (no Brev cost on dead deps)
6.7 — Local-first Pure-CLI bugs (no sandbox/Docker/GPU) run on the maintainer's local install — same evidence, zero cost
7 — Reuse-or-provision brev ls → reuse a verify-stale-* box if running, else brev create with a runtime-picked CPU SKU
8 — Validate-on-baseline, verify-on-latest Install reported version first, confirm the reproducer actually exposes the bug, reset, then install latest
8.5 — By-design detection Three signals (maintainer attribution / removal commit / symbol absent) short-circuit Brev cost on intentional removals
9 — Confidence scoring +50 / +25 / +25 / −30 / −50, clamped to [0,100]. ≥85 silent · 60–84 + reporter @-mention · <60 inconclusive
10 — Comment + label Comprehensive redaction (JWT / NVAPI / PAT / base64 / internal hostnames / emails) with HTML→text pre-pass for QA bodies. 300-word hard ceiling for fixed/wontfix; 30–80 for still-reproduces.
11 — Infra failure Sandbox-build rot is the dominant failure mode for any version >5–7 patches behind; cap-at-84 with reporter @-mention is by design
12 — Activity log Append per-issue + per-session entries to ~/development/daily-rhythm/activity/nemoclaw-verify-stale-log.md

Idempotency

Every posted comment carries <!-- nemoclaw-verify-stale v1 YYYY-MM-DD -->. Candidate filter excludes any issue whose comments match <!-- nemoclaw-verify-stale v\d+ YYYY-MM-DD --> within the last 7 days OR carries fixed-on-latest / verify-inconclusive / status: wont-fix. Release sweep clears the first two on each tag cut.

Active-discussion handling (two-tier)

Maintainer's unanswered comment age Skill behaviour
≤ 7 days Skip — running on top of fresh discussion would conflict with the maintainer's framing
> 7 days Proceed with the unanswered-question variant: lead with a markdown link to the unanswered comment, dual @-mention of both maintainer and reporter

Companion changes

  • nemoclaw-skills-guide/SKILL.md — adds the new skill to the maintainer skills table.
  • nemoclaw-maintainer-cut-release-tag/SKILL.md — release-time sweep of fixed-on-latest and verify-inconclusive from every open issue. Without it, "latest" drifts and verifications go silently stale. Verification records stay in comments; only labels reset.

CI status

Check Status
commit-lint, dco-check, installer-hash-check, legacy-path-guard, pr/changes
markdown-links ✅ after 3abffd72 (placeholder [label](url) examples in SKILL.md were being parsed as real broken links — restructured as fenced-block templates) and dc5d7143 (merge of main brought in the deletion of docs/get-started/brev-web-ui-quickstart.md per #3050, resolving the merge-side missing-file failure)

Test plan

Signed-off-by: Prekshi Vyas prekshiv@nvidia.com

Summary by CodeRabbit

  • Documentation

    • Added a post-release verification step that conditionally clears verification labels on open issues so they’re re-evaluated against the new release; wont-fix status and prior verification comments are preserved.
    • Added a comprehensive “verify stale” maintainer workflow and execution reference (batch processing limits, idempotency markers, verification/rubric templates, infra and comment-handling guidance).
  • Chores

    • Updated maintainer skill and role counts to include the new verify-stale skill.

Adds a maintainer skill that automates verifying whether old bug reports
still reproduce against the latest NemoClaw release. The skill picks
candidate issues opened against older versions, reuses or provisions a
Brev Linux box (CPU or GPU based on the bug profile), runs the extracted
reproducer, scores confidence, and posts an evidence-backed comment with
either a `fixed-on-latest` or `verify-inconclusive` label. Tag-only —
never auto-closes.

v1 scope is intentionally narrow: Linux only, no third-party integration
credentials, no service-account bot. Each maintainer runs it under their
own gh credentials.

Companion changes:

- Adds the new skill to the maintainer table in `nemoclaw-skills-guide`
  (count 7 -> 8 maintainer skills, 18 -> 19 total).
- Adds Step 7 to `nemoclaw-maintainer-cut-release-tag` that sweeps
  `fixed-on-latest` and `verify-inconclusive` labels off all open issues
  at release time, so the next verify-stale run re-evaluates against the
  new latest. Verification records remain in comment history; only the
  labels are reset.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 5, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 5, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR adds a new nemoclaw-maintainer-verify-stale skill to the NemoClaw maintainer skills catalog and expands the release workflow with a post-release label cleanup step that removes verification-related labels from open issues, allowing the next verification run to re-evaluate them against the new release.

Changes

NemoClaw Skills Catalog and Release Workflow

Layer / File(s) Summary
Skill Catalog Entry
.agents/skills/nemoclaw-skills-guide/SKILL.md
Introduces nemoclaw-maintainer-verify-stale skill to the maintainer skills table, increments maintainer skill bucket count from 7 to 8, and updates Roles table to reflect 19 total maintainer skills.
Verify-Stale Skill
.agents/skills/nemoclaw-maintainer-verify-stale/SKILL.md
Adds a tag-only maintainer skill with front-matter, single/batch modes, latest-tag discovery, candidate filtering, reported-version parsing, classification, reproducer extraction, preflight checks, and a local-first short-circuit; hooks to Steps 7–12 reference.
Verify-Stale Execution Reference
.agents/skills/nemoclaw-maintainer-verify-stale/reference/execution-and-comment.md
Adds Steps 7–12: box reuse/provisioning, two-pass baseline→latest validation, dependency bootstrapping and environment quirks, architectural-drift checks, performance/rebuild rubrics, by-design branch, scoring, comment templates/redaction, infra handling, and activity logging.
Release Workflow Integration
.agents/skills/nemoclaw-maintainer-cut-release-tag/SKILL.md
Adds Step 7: conditional sweep of fixed-on-latest and verify-inconclusive labels on open issues after release tagging, excluding status: wont-fix, using Project 199 status when available, marker-age (10+ days), or regression-touch checks against paths cited in the verify marker.

Sequence Diagram(s)

sequenceDiagram
  participant Maintainer
  participant GitHub
  participant VerifyStale
  participant Brev
  Maintainer->>GitHub: discover candidate issues & latest tag
  GitHub-->>Maintainer: return issues, labels, comments
  Maintainer->>VerifyStale: request filtering/prioritization
  VerifyStale->>Brev: provision or reuse verification box
  Brev-->>VerifyStale: run baseline then latest reproducer
  VerifyStale-->>Maintainer: deliver transcripts, score, verdict
Loading

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 New skills hop into the spread,
A verify-stale step clears the bed,
Labels swept clean after tags are made,
Old marks stay in comments, history saved,
Fresh checks awaken on the repo's glade.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'feat(skills): add nemoclaw-maintainer-verify-stale' clearly and specifically summarizes the main change—introducing a new maintainer skill named nemoclaw-maintainer-verify-stale.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/verify-stale-skill

Warning

Review ran into problems

🔥 Problems

Timed out fetching pipeline failures after 30000ms

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

Three flaws surfaced when dry-running the Step 3 candidate filter
against the live open-issue list (141 open bugs).

1. Step 2 used `gh release view` to detect "latest". NemoClaw tags but
   does not publish GitHub releases, so this returned empty. Added a
   fallback that picks the highest semver from `git ls-remote --tags`,
   which is the load-bearing path today.

2. Step 3 said "2 or more minor versions behind". Wrong vocabulary for
   `0.0.x` repos — patch is what's iterating. Reworded to "at least 2
   versions behind in the rightmost-incrementing component", with a
   concrete `0.0.x` example and a forward-compat note for when NemoClaw
   moves to `0.1.x`.

3. Step 4 specified a naive `v?0\.0\.\d+` regex that matched IP-like
   strings in bug bodies (Ollama bind `0.0.0.0:11434`, loopback
   `127.0.0.1`, etc.) and produced phantom v0.0.0 / v0.0.1 candidates.
   Tightened to require `v` prefix + word boundaries first, with a
   context-anchored fallback that only matches `0.0.X` on lines
   containing `nemoclaw` or `version`. Added a clamp that rejects any
   parsed version greater than `$LATEST` (catches roadmap labels like
   `v0.0.35` from being treated as "reported on").

After these fixes the dry-run produces 33 plausible candidates out of
141 open bugs, with credible reported-version assignments verified
against issue bodies.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Two more issues surfaced from a deeper second-pass dry-run.

1. The Step 4 regex was hardcoded to v0.0.x. NemoClaw is on that line
   today, but the skill should keep working when it moves to v0.1.x or
   higher. Generalized to `\bv\d+\.\d+\.\d+\b` with a context-anchored
   `\b\d+\.\d+\.\d+\b` fallback on lines containing nemoclaw or version
   (so the loose form doesn't match `0.0.0.0:11434` and `127.0.0.1`).

2. Some bug reports cite NemoClaw versions that were never tagged
   (e.g. `v0.1.0` × 3 issues, calver `2026.3.11` × 1). Without a tag
   existence check the skill would point Brev verification at a version
   that does not exist. Added a `git ls-remote --tags` validation pass
   that drops the version if no matching tag is found.

Also added an implementer note documenting a real failure mode caught
during the dry-run: a naive `[scan(primary)] | first | .[0] | tonumber
// [scan(fallback)] | ...` pipeline silently dropped 9 valid candidates
because `null | first` errors when scan returns empty, and the `//`
chain did not propagate cleanly to the fallback. The SKILL.md regex
itself was correct — the failure was implementation-level fragility in
how the two passes were composed. The note tells implementers to bind
each pass to a named variable, coalesce at the end, and explicitly test
the empty-match path.

After these changes the live dry-run produces 47 plausible candidates
(out of 141 open bugs) and the 4 residual drops are all reporter typos
that cite a version that was never released — correctly unverifiable.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Third dry-run round surfaced a precision issue. The previous regex
matched any `\bv\d+\.\d+\.\d+\b` anywhere in the body, then validated
against the tag list. That picked up versions belonging to *other*
products that share the v0.0.x line — most often OpenShell, but also
Node.js (v22.x.x), and other dependencies. Tag validation then
correctly rejected the non-NemoClaw ones, but on bodies where multiple
products were listed (e.g. `Environment: openshell 0.0.4, nemoclaw
0.1.0, Node.js v22.16.0`) the parser would land on OpenShell's
tag-valid v0.0.4 and report it as the NemoClaw version, producing a
false-positive candidate. 12 such collisions exist in the current
backlog.

Fix: require the version to follow `nemoclaw` (case-insensitive) within
80 non-letter, non-newline characters. The anchored regex
`(?i)nemoclaw[^a-z\n]{0,80}v?(\d+\.\d+\.\d+)` matches `NemoClaw
v0.0.32`, `nemoclaw 0.0.28`, `- NemoClaw: v0.0.16`, but not `openshell
0.0.4` followed later by `nemoclaw 0.1.0`. Combined with the existing
tag-list validation, this collapses the four error classes — typos,
calver dates, roadmap labels, and product collisions — into one
"version must be a real NemoClaw release tag, mentioned next to
NemoClaw" check.

Also expanded the implementer note to cover three concrete failure
modes hit during the dry-run:

- Empty-match handling (`[]` flowing into `first | .[0]` errors).
- Capture-group inconsistency between branches in a parser pipeline.
- Variable-scoping bug in `select($tags | index(.))` where `.` rebinds
  to `$tags` and silently passes invalid labels.

After this change the live dry-run produces 33 candidates out of 141
open bugs — same count as the very first round, but every candidate's
parsed version is now anchored to NemoClaw and validated against the
real tag list. The earlier wider counts (47, 48) included roughly 12-15
issues that would have been wasted Brev runs.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
…test

A clean run on latest is ambiguous on its own:
- Scenario A: bug really got fixed (script worked, latest passes)
- Scenario B: script was junk all along (script never triggered the
  bug; latest passes for the same reason baseline would)

These look identical from the latest pass alone, which means the
previous flow could land `fixed-on-latest` on a script that never had
a chance — a false positive that the +25 commit and +25 PR signals
only partially mitigate.

This change adds a baseline pass before the latest pass:

- Step 8a: install reported version on the box.
- Step 8b: run reproducer, compare output to issue's actual-result
  description. Match = script validated.
- Step 8c: if no match (script silent OR errored on the wrong thing),
  synth-repro using the issue body PLUS the baseline transcript so
  the LLM can react to the actual failure mode. Apply -30. Retry.
  Still no match -> verify-inconclusive, skip latest entirely.
- Step 8d: install latest, run validated reproducer. Score from this.

A single trigger drives synthesis: "the script didn't expose the bug
on baseline." Whether it was silent or noisy doesn't matter; both
mean the script needs work. This collapses the previous separate
"narrative-only -> synth" and "verbatim-failed -> ???" paths into
one rule.

Step 6 simplified accordingly: just extract verbatim if available,
otherwise carry the issue body forward to Step 8b for on-demand
synthesis. No more "give up before provisioning" branch — that
decision moves into Step 8c where it has more context to work with.

Step 9 adds a baseline-validation gating rule. If the reported-
version install fails (old releases rot — installer URLs, deps, OS
images drift), the score is capped at 84 unless commit-area or PR-
mention evidence also fires. That forces the @-mention-reporter band
when we lack independent corroboration, which is the honest position
when we couldn't validate the script ourselves.

Step 11 split into two failure types:
- Latest-install fail or harness error -> hard infra failure (no
  label, optional comment, move on).
- Baseline-install fail -> degraded mode (skip baseline gate, run
  latest anyway, apply Step 9 cap).

Wallclock budget bumped from 15 to 25 minutes per verification to
accommodate two installs.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Tier-1 (blocking for E2E):

- Add the still-reproduces-on-latest path. If the latest run output
  matches the issue symptom, the bug is still live — the skill posts
  a "still reproducible" comment with both transcripts, applies no
  label, and includes a dated marker `<!-- ... v1 YYYY-MM-DD -->`.
  No new label per user direction; a 7-day TTL on the marker handles
  re-verification on the next weekly cron.
- Update the comment template for the two-pass flow. The previous
  template described a single transcript; the new one has Baseline
  and Latest sections plus a dedicated still-reproduces template,
  and the idempotency marker now carries today's date.
- Update the Step 12 log template to record baseline-install,
  baseline-match, latest-install, and latest-result independently.
- Clarify in Step 4 that REPORTED_VERSION must be the full tag string
  ("v0.0.32"), not just the patch number (32). Step 8a's installer
  expects the full tag.

Tier-2:

- Spell out an explicit match rubric in Step 8b: exit code agreement,
  symptom-phrase match (LLM-judged semantic equivalence counts), and
  distinguishing the bug from infra noise (DNS / auth / rate limits).
- Replace the minimal `rm -rf ~/.nemoclaw` reset with a comprehensive
  one that stops nemoclaw/openshell processes, removes spawned Docker
  containers, frees common ports (8080, 18789, 9119), and removes the
  installed binaries and lib paths. Run before each install in 8a and
  8d; idempotent so it's safe on a fresh box.
- Add interactive-subcommand handling in 8b: auto-detect `nemoclaw
  onboard` / `configure` and try `--non-interactive`, then
  `--dangerously-skip-prompts`, then stdin pre-feed. Fall through to
  Step 8c (synth) if none work.

Tier-3 / opportunistic:

- Fix the actual installer command (gap 9). The installer accepts
  the target ref via `NEMOCLAW_INSTALL_TAG` env var, NOT a `--version`
  flag (verified against install.sh source). Updated 8a accordingly.
- Update Step 3 idempotency: drop on either label OR a marker
  comment within the last 7 days. Previously a marker excluded the
  issue forever; the release sweep only clears labels, so the marker
  alone made re-verification impossible. The 7-day TTL also handles
  the still-reproduces case naturally.
- Extend wallclock budget to 60 min when issue body contains
  time-sensitive keywords (`after N minutes`, `eventually`, `memory
  leak`, etc.). Hard ceiling at 60 — bugs that require hours fall
  out of v1 scope.
- Keep provisioned boxes for 30 minutes on verify-inconclusive
  outcomes so a maintainer can `brev shell` in and triage. Reused
  boxes always stay.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
@wscurran wscurran added Platform: Brev Support for Brev deployment enhancement: skill Improvements to NemoCall repository hygiene or user functionality with skills. labels May 6, 2026
prekshivyas added 20 commits May 6, 2026 14:48
Five concrete fixes from the pre-E2E gap analysis, plus a verified
finding on gap 2 that lowered its risk profile.

Gap 1 — sudo behavior. The reset script's `sudo rm -f` and `sudo rm
-rf` could hang on a password prompt if the Brev image's default user
lacks passwordless sudo. Switched all sudo calls to `sudo -n` (non-
interactive) so they fail fast instead. Added a precondition note
explaining the assumption: Brev stock images allow passwordless sudo,
custom images may not. The user-local install path (~/.nemoclaw) is
fully reset regardless of sudo state, so the worst case is a stale
/usr/local/bin/nemoclaw symlink that the next install overwrites.

Gap 2 — verified the install architecture supports old-version pinning
natively. The bootstrap clones the requested ref and runs that tag's
own install.sh (with a payload-marker fallback for legacy ones), so
NEMOCLAW_INSTALL_TAG=v0.0.32 works architecturally. Risk reduces to
"does v0.0.32's own installer still work on a 2026 OS image" — handled
by Step 11's degraded-mode fall-through. No SKILL.md change needed.

Gap 4 — path extraction was hand-waved. The +25 commits-touched-area
weight referenced `git log v<reported>..$LATEST -- <path>` without
specifying how to determine `<path>`. Added a three-tier extraction
procedure: stack-trace path mentions parsed from the body and mapped
to repo paths, then a component-label-to-directory map (NemoClaw CLI →
bin/, Sandbox → src/lib/sandbox/, etc.), then title-keyword fallbacks.
If none yields a path, skip the +25 signal entirely rather than
guessing — floating the weight would inflate scores meaninglessly.

Gap 5 — PR-search query was hand-waved. The +25 PR-mention weight had
no concrete query. Added two-stage search: first `gh pr list --search
"$ISSUE_NUMBER"` filtered for actual `#NNNN` references in body or
title, then a symptom-phrase fallback if direct reference returns
nothing. PRs that merged before the issue was filed are excluded via
`mergedAt > tag-date(REPORTED_VERSION)`.

Gap 6 — issues without an "Actual result" section had no match path.
Behavioral / configuration bugs (e.g., #1242 "should default to a
stable released version") describe a wrong default rather than a
runtime error. Added a fallback to Step 8b's match rubric: use the
title + full body as the symptom, match if reproducer's outcome
contradicts the issue's expected behavior. If neither error string
nor expected-behavior contradiction can be identified, route to Step
8c (synth-repro) so the LLM produces a more diagnostic script.

Gap 7 — transcript redaction was thin. Step 10 only covered tokens,
paths, and basic-auth URLs. Real transcripts leak more: Authorization
headers, JWTs, GitHub PATs, AWS keys, NVIDIA API keys, internal
hostnames, emails (PII), long base64 blobs. Expanded the redaction
table with concrete patterns for each. Added an order note:
longest/most-specific patterns first so the generic base64 catchall
doesn't mask what was actually redacted. Made it explicit that
redaction runs on every quoted chunk in the comment, not just issue-
body excerpts.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
…l SKU pick

Preflight Brev auth and install-URL reachability before paying any cost
(Step 6.5). For pure-CLI bugs with no sandbox, Docker, or GPU dependency,
try the reproducer locally on the maintainer's `nemoclaw` install before
provisioning a Brev box (Step 6.7) — same evidence, zero cost.

Replace the `<your-team's-CPU-SKU>` placeholder in Step 7 with a runtime
pick via `brev search cpu --sort price --json | jq` so the skill keeps
working as SKUs change, with a `VERIFY_STALE_CPU_TYPE` env override for
teams that need to pin.

Drop the spurious `--yes` flag from `brev delete` — it does not exist on
that subcommand and would error. `brev delete` is non-interactive by
default. Print the instance name before the trap registers so manual
cleanup is possible if the trap doesn't fire.

Thread `INSTALL_URL` through Step 8a/8d so the URL override from Step 6.5
applies to both the baseline and latest installs.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Some bugs are filed against behavior that was deliberately removed or
changed in a merged PR. Running the standard rubric on these produces
a misleading verdict — the symptom "still reproduces" but the right
answer is "won't fix, see PR #X." Issue #2791 is the prototype:
`config set` was removed in #2227, the reporter tested a version that
already had it gone, and the standard rubric would have buried that
context under a low-confidence `verify-inconclusive` label.

Add Step 8.5 with three independent detection signals:

- Maintainer-attribution comment phrasing ("removed in #N", "by design",
  "wontfix", "intentional") from MEMBER/OWNER/COLLABORATOR authors.
- A removal commit between reported version and `$LATEST` whose subject
  matches `\b(remove|delete|drop|deprecate)\b` and whose diff deletes the
  symbol implicated by the reproducer.
- The implicated symbol absent from both reported version and `$LATEST`,
  meaning it was already gone when the issue was filed.

When any signal fires, skip Step 9 scoring and Brev provisioning, apply
a new `wontfix-by-design` label, and post a focused comment that links
to the responsible PR. Detection on signals 2 and 3 can run as soon as
the reported version is parsed, saving Brev cost on the entire class.

Generalize the Step 3 idempotency marker regex from `v1` to `v\d+` so
future skill versions can re-verify older-marked issues by tightening
the regex to require a specific marker version.

Add `wontfix-by-design` to the idempotency label list, the activity log
entry options, the session summary, and the release sweep in
`nemoclaw-maintainer-cut-release-tag`. Update the skill's frontmatter
description to mention the new label and the local-first short-circuit.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
…var log path

The issue body that comes back from `gh issue view` for NV QA bugs is
HTML, not markdown — `<pre>...</pre>` blocks and tables, with HTML-encoded
entities. The previous Step 6 only mentioned `<pre>` in prose without an
extractor that actually parsed it. Add a Python extractor that handles
markdown fences (triple-backtick and tilde), HTML `<pre>`, and entity
unescape in one pass, so verbatim reproducers from QA-filed issues land
cleanly instead of falling through to LLM synthesis with a -30 penalty.

Step 10 already had a comprehensive redaction regex table, but the
patterns assumed plain text — tokens nested in HTML tags or attributes
slipped through unredacted. Add an HTML to text pre-pass for issue-body
excerpts before the regex table runs, so quoted excerpts from QA bodies
get the same redaction coverage transcripts already do.

Step 5's GPU keyword list flagged `inference` and `model serving`, both
of which false-positive on CPU bugs (`models.providers.inference.baseUrl`
is a config path, not a GPU need). Tighten to whole-word matches and
swap in `vllm`, `tensorrt`, `L40S`, `L4`, `T4`.

Step 12's activity log was hard-coded to one maintainer's personal
organizer path. Read from `VERIFY_STALE_LOG_DIR` with the existing path
as default, and require the skill to create the directory rather than
assuming it exists, so the skill works in CI / shared-volume / other
environments.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Path-extraction map in Step 9 was written from assumption rather than
verification — `src/lib/sandbox/`, `src/lib/openshell/`, and
`src/lib/policy/` don't exist in the current repo, and OpenShell lives
in a separate repo entirely. The +25 commits-touched-component signal
would have silently misfired on every Sandbox / OpenShell / policy
issue, underscoring real fixes. Replace with paths verified against
the current tree (`nemoclaw/src/blueprint/`, `nemoclaw-blueprint/`,
`nemoclaw/src/commands/`, etc.) and explicitly note the cross-repo
OpenShell case is out of scope for v1.

Step 8 reset wiped `~/.nemoclaw` but not `~/.openclaw`. Sandbox state
has been writable-by-default under `.openclaw` since #2227, so it
persists across the baseline → latest reinstall and contaminates the
latest run. Add it to the reset.

Anchor the `pkill -9 -f nemoclaw|openshell` patterns to a leading
slash (`/nemoclaw`, `/openshell`) so the kill matches actual installed
paths but not unrelated processes that mention these strings — including
the agent harness running this skill if its working directory contains
the word.

Reorder the Step 10 redaction table to match the stated "longest, most
specific patterns first" rule. The previous order had the generic
`(token|secret|password|...)` pattern executing before JWT/AWS/NVIDIA
patterns, which would have masked specific-token redactions with
generic ones — losing the signal of *what kind* of credential leaked.

Replace `git ls-remote git@github.com:NVIDIA/...` (Steps 2 and 4) with
`gh api repos/NVIDIA/NemoClaw/tags --paginate --jq '.[].name'`. The
SSH path required keys to be configured; `gh api` reuses whatever auth
the user already has.

Add a CLI-deps precondition check (`command -v gh brev jq python3 curl`)
to Step 6.5 so the skill fails fast if any required dependency is
missing, instead of failing late and confusingly mid-run.

Step 11's keep-box-on-inconclusive said "delay cleanup by 30 minutes"
but had no implementation. A backgrounded `sleep && brev delete`
doesn't survive session end. Replace with skip-the-trap-entirely plus
an explicit manual-cleanup reminder printed to the run output.

Out-of-Scope was contradicting itself on macOS — Step 6.7 explicitly
allows macOS for local-first runs. Clarify: Brev path is Linux-only,
the local-first short-circuit on a maintainer's laptop works on macOS
for manual single-issue runs.

Add a `local (no Brev — Step 6.7 short-circuit)` option to the Step 12
log entry's Box field so local-first runs round-trip the activity log
correctly.

Frontmatter description said "latest release" but NemoClaw uses tags,
not GitHub Releases — fix to "latest tag" to match Step 2.

Step 4's implementer note said "Two real failure modes" but listed
three; fix the off-by-one.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
…-verification

Step 8.5 was producing comments that were correct in direction but
hand-wavy on substance. A side-by-side against an independent analysis
of #2168 surfaced the gap: that analysis cited specific file:line
locations, separated the literal bug-as-filed from a related failure
mode that still exists on latest, called out existing CI coverage of
the new workflow, and self-corrected an overstatement when prompted to
re-verify. Today the skill required none of that.

Restructure Step 8.5 into substeps so the rigor is mechanical, not
prompt-dependent:

- 8.5a: signal detection now requires verifiable evidence per signal —
  comment URL + quoted phrase (signal 1), commit SHA + diff range
  (signal 2), grep commands + outputs (signal 3). Add a sub-case for
  vestigial deprecation shims so a stub doesn't silently fail signal 3
  by appearing in latest.
- 8.5b: pre-check for related failure modes. Grep latest for the bug's
  symptom keywords (not the removed symbol) and require the comment to
  call out anything that surfaces. This is the section the side-by-side
  identified as missing — saying "the bug as filed can't reproduce" is
  not the same as "every bug shaped like this is fixed."
- 8.5c: check existing test coverage for the new workflow and cite up
  to three test paths if found. Strengthens the comment from "trust me,
  it was removed" to "the new workflow is exercised by these tests."
- 8.5d: explicit self-verification pass — re-run every cited command
  before composing the comment; bail to verify-inconclusive on any
  discrepancy. LLMs confidently overstate; mechanical re-verification
  catches it without needing a human prompt.

Replace the by-design comment template with one that has mandatory
sections matching the substeps: "What's structurally fixed", "Vestigial
references", "What's not literally the same bug", "Existing CI
coverage", "Recommendation". Each section either has concrete content
or is explicitly omitted — no hand-wavy claims slip through.

Add NVBugs cross-reference extraction to Step 4 (`grep -oE
'\[NVB#[0-9]+\]'`) so the by-design template can append the standard
"NVBugs#NNNNNNN will need a separate update; closing this GitHub issue
won't propagate" reminder when the issue body carries that footer (most
NV QA bugs do).

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
The previous signal-2 candidate query narrowed by commit subject
(`remove|delete|drop|deprecate`) before checking whether the diff
deleted the implicated symbol. That filter excludes the common case
where a removal lands inside a `refactor(...)` or `feat(...)` commit —
e.g., PR #2227 removed `--dangerously-skip-permissions` under a
"refactor(sandbox): default to mutable config" subject. A literal
follower of the previous rule would have concluded signal 2 doesn't
fire on issue #2168 even though the flag was deliberately removed.

Switch the primary candidate lookup to `git log -S<symbol>` pickaxe
search, which finds every commit whose diff changes the count of the
symbol regardless of subject wording. Keep the subject-keyword query
as a supplementary lookup for narrowing when the pickaxe returns many
commits. Note explicitly that the commit's actual subject doesn't need
to mention removal.

Surfaced while testing the new Step 8.5 rigor against #2168.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
…horing

Two changes that surfaced from running Step 8.5 against issue #2168.

Use the existing repo `wontfix` label for the by-design path instead
of creating a new `wontfix-by-design` label. The repo already has
`wontfix` and it's already in Step 3's issue-type skip list, so applying
it gives idempotency for free without proliferating labels. The
by-design verdict is encoded in the marker comment, not the label
vocabulary.

Important consequence: the release sweep in
`nemoclaw-maintainer-cut-release-tag` does NOT clear `wontfix` — that
label is also applied for non-skill reasons (scope, priority, dup
decisions made by maintainers), and sweeping it would erase human
triage work. Sweep stays scoped to `fixed-on-latest` and
`verify-inconclusive`, which are skill-only.

Add a `**Verification mode:**` line to the by-design comment template
that explicitly says "static analysis at the verified-on tag — no
runtime reproduction." This was the honesty an independent side-by-side
analysis of #2168 caught: the previous template implied runtime
confirmation even though Step 8.5 short-circuits Brev provisioning by
design.

Add a tag-anchoring rule above the template too — every `file:line`
cite must refer to the verified-on tag, not the maintainer's HEAD,
since line numbers drift between tags and main and we want comments
to stay reproducible by anyone reading them later.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
…ion check

Bare `file:line` paths in comments force the reader to navigate
manually — that's a usability bug, not a stylistic preference. Comments
posted by the skill should be self-serving artifacts: a maintainer or
QA reader should be able to click any citation and land on the exact
content at the verified-on tag.

Extend the tag-anchoring rule above the by-design template into a
"tag-anchoring + linking rule" with concrete formats for file blobs,
file:line, file:line-range, commit SHAs, and test paths. Note that
bare `#NNNN` PR/issue references already auto-link in same-repo
comments.

Strengthen Step 8.5d into two passes: evidence (re-run cited commands)
and link (resolve at least one rendered link per evidence section via
`gh api .../contents/<path>?ref=<tag>` or `curl -fsI`). A 404 on a
citation suggests verification that didn't actually happen — worse
than no link at all. Bail to verify-inconclusive on any failure.

Surfaced while reviewing the rendered #2168 comment: the citations
were correct and tag-anchored but not clickable. Better caught now
than after the first batch run posts a wall of bare paths.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Discovered while applying the by-design label to issue #2168: the skill
referenced `wontfix` and `needs-info` as bare strings, but the repo's
canonical labels are `status: wont-fix` and `status: needs-info` (with
prefix and hyphen). Same shape in Step 3's issue-type skip list — issues
labelled `status: wont-fix` or `status: needs-info` were NOT being
filtered out at candidate-time as the spec intended, because the skip
list keys didn't match the live labels.

Three coordinated fixes:

1. Step 3 issue-type skip list now uses `status: wont-fix` and
   `status: needs-info`. Annotate the rule so future readers know not to
   substitute the bare forms.

2. Step 8.5 by-design action applies `status: wont-fix` (with quoting
   on the CLI). Step 12 log entry, session summary, frontmatter
   description, and Companion Behavior section in both the verify-stale
   skill and `nemoclaw-maintainer-cut-release-tag` updated to match.

3. Step 6.5 preconditions now verifies all expected labels exist on the
   repo with `gh label list` before any later step tries to apply them.
   This is the gap that bit us on #2168 — the comment posted but the
   label couldn't be applied because the spec name didn't match. Failing
   fast in 6.5 means the run aborts before any Brev cost or comment
   posting can happen against a misconfigured label vocabulary.

Bare `wontfix` / `won't fix` / etc. are still in Signal 1's detection
regex — that's intentional, those are SEARCH PATTERNS for what
maintainers might write in free-text comments, not label names.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
…s for faithfulness

Two changes that landed before the first real Brev E2E run (#2007).

The previous wallclock cap split — 25 min default with a 60 min extension
keyed off "memory leak" / "over time" / "eventually" keywords — optimised
for the wrong constraint. Most issues fit comfortably under 60 min once
you account for two full install passes plus comprehensive resets, and
the keyword-based extension forced re-runs whenever a real install or
bootstrap took longer than the optimistic 25 min budget. Bump the cap
to a single 60 min default and drop the keyword extension; bugs that
genuinely require more than an hour fall out of v1 scope.

Add a new Step 8a.5 for bootstrapping reproducer dependencies that the
stock Brev image doesn't ship — local model servers (Ollama, vLLM),
provider runtimes, third-party CLIs. Default policy is maximum
faithfulness: install the actual dependency the reporter used rather
than substituting a stub. Substituting trades faithfulness for speed,
and on a 60 min budget that trade is rarely worth it; it almost always
introduces a confound that makes the verdict less trustworthy.

Document the canonical Ollama and vLLM bootstraps with specific
attention to daemon survival between `brev exec` calls (Ollama's
installer registers a systemd service; vLLM uses nohup + log file).
Bootstrap runs once before Step 8b's baseline and is reused for Step
8d's latest run — model downloads are expensive and external state, so
the comprehensive reset must explicitly leave them alone.

Carve out one substitution case: providers that require API keys the
skill cannot safely supply (NIM, OpenAI, Anthropic). Stubbing a key
defeats faithfulness anyway, and a real key shouldn't sit in a
verify-stale run. Substitution applies the same -30 LLM-synth penalty
as Step 8b and must be documented in the rendered comment.

Bootstrap failure escalates to Step 11 infra failure — do NOT silently
substitute, since the maintainer opted into faithfulness for a reason.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
…get on every comment

Two requirements surfaced from the #2007 e2e run that should be enforced
by the skill, not by per-session memory or by the prompter remembering.

Mandatory reporter @-mention with confirmation language. The skill cannot
independently confirm a closed-as-fixed verdict — only the reporter
knows whether their original symptom is gone in their environment. The
@-mention is what converts a "skill says it is fixed" claim into
actionable confirmation work for QA. Add the explicit closing block
(canonical wording: "please confirm the symptom is gone on a recent
build and reopen with a fresh reproducer if you observe otherwise") to
all three Step 10 templates: fixed-on-latest, still-reproduces, and the
Step 8.5 by-design template. The previous "If this verification is
wrong, please reopen..." line was passive; the new line names the
reporter and asks them to act.

Length target. Default rendered comments to 400-500 words. The evidence
table or by-design fixed/vestigial sections are the hero; everything
else has to either change the reader's mind about the verdict or be
deleted. The first #2007 draft ran ~750 words and the user pushed back
explicitly; encoding the cap into Step 10 means future agents do not
start from scratch on length each run.

Surfaced from: e2e Brev verification run on issue #2007 (the first
real Brev-track exercise of the skill end to end).

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Six concrete gaps that surfaced during the first end-to-end Brev run
(issue #2007). Each almost produced a wrong verdict or a wasted run; each
is now encoded so the next agent doesn't have to re-discover.

1. Step 9 baseline-validation cap-removal rule was too permissive. The
   `unless commits-touched OR PR-mention also fires` escape hatch let
   inferred fix evidence override the absence of a baseline run, which
   produced a misleading 100/100 on #2007 despite zero baseline
   confirmation. Tightened: cap at 84 holds regardless of corroboration;
   PR-search signals raise the score within the cap, never past it.
   Step 10 now requires an explicit one-line caveat naming the cap and
   the reason in the rendered Verdict section.

2. Step 6.5 install URL default switched from `nemoclaw.nvidia.com`
   (NVIDIA-internal, doesn't resolve from Brev) to `www.nvidia.com/
   nemoclaw.sh` (public Akamai 301 redirect). Brev runs were dying at
   bootstrap on the wrong host.

3. Step 7 CPU SKU picker now biases by reproducer-implied memory needs.
   On #2007 the cheapest 2 GB SKU couldn't load a 4.8 GiB Ollama probe;
   onboard failed at provider validation and we burned ~25 min before
   re-provisioning a 16 GB box. Adds a `CPU_RAM_FLOOR` env var and a
   memory-floor heuristic (16 GB when reproducer references a model
   server, 8 GB for sandbox onboarding without a model, 4 GB for
   pure-CLI bugs).

4. Step 11 failure taxonomy now has a "baseline-build rot" bucket
   distinct from "binary install rot." Same `BASELINE_INSTALL_FAILED=1`
   flag and same downstream cap-and-degrade behavior, but failures at
   the in-image Dockerfile build phase (what we hit on v0.0.18 — the
   `.openclaw-data/workspace/media` symlink layer, removed entirely by
   #2227) get a separate label so reviewers can see *why* the old
   image no longer builds without re-running.

5. New Step 8a.5b documents the two non-obvious `brev exec` quirks that
   reproducer scripts have to handle every time: PATH does not include
   `~/.local/bin` in non-login shells (so reproducers must `export PATH`
   at the top, or callers must wrap with `bash -lc`); and the docker
   group requires `sg docker -c '...'` because adding the user via
   `usermod -aG` doesn't take effect within the same Brev session.

6. New Step 8d.5 architectural-drift check. When the diff between
   `$REPORTED_VERSION` and `$LATEST` touches the *tool* the reproducer's
   expected output depends on (e.g. `openshell forward` between v0.0.18
   and v0.0.35), don't trust the reproducer's surface alone — multi-axis
   verification on OS-level surfaces (host listeners, NAT rules, docker
   ports, SSH tunnels, etc.) is required before claiming fixed-on-latest.
   This is the five-axis pattern we used to confirm #2007 wasn't a
   false positive; codifies it as a check the skill applies whenever
   pickaxe shows the reproducer's tool was reworked.

Surfaced from: end-to-end Brev verification of issue #2007. Build the
skill, exercise it on a real issue, fix what breaks, repeat — every
fix here came from a concrete failure mode in one real run.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
…ll host

Two small UX fixes surfaced from a thorough re-read.

Step 6.5's brev-auth error printed three numbered options without a
clear recommendation. From a non-TTY agent harness (Claude Code or any
unattended runner) the right answer is option 2 — open a separate
terminal, run `brev login`, complete the browser flow, come back. The
previous message buried that path inside option 2's inline comment.
Replace with a directive recipe that names the recommended flow first
(open separate terminal, browser auth, re-run the skill) and lists the
headless alternatives below as "when option 1 isn't available." Also
explicitly notes that credentials persist to ~/.brev/credentials.json,
so re-running the skill picks them up automatically.

Step 6.5's install-URL error still suggested checking
`https://nemoclaw.nvidia.com` even after the default was switched to
`https://www.nvidia.com/nemoclaw.sh` in commit 296f7cd. Update the
suggestion to match the new default and add an explicit
"then re-run this skill" so the maintainer knows the flow is
fix-and-retry, not abort.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Five gaps verified against the source during a thorough re-read pass.
The audit also caught two false-positive claims I'd made earlier
(`NEMOCLAW_INSTALL_TAG` env var and `comments` field in batch fetch)
that didn't survive verification — those are not landed because there
was nothing to fix.

Step 1 batch discovery query bumped from `--limit 100` to `--limit 1000`
so the skill doesn't silently drop issues beyond the page. The recent
candidate triage on this repo found 129 open bugs; the previous limit
would have lost 29 of them. The 20-issue per-run processing cap is
downstream of Step 3/4 filters and unaffected.

Step 3's 7-day comment-marker TTL had the rule but no implementation.
Add a portable `gh issue view --json comments --jq` snippet that
extracts marker comments newer than the cutoff. macOS and Linux
date(1) syntax differ for relative-date math, so the cutoff line
tries both. Without this, the first cron run after any posting will
silently re-verify everything previously marked, posting duplicate
comments every Monday.

Step 6.5 preconditions now checks gh identity via `gh api user --jq
.login` and prints the resolved login before any later step runs.
Comments posted by Step 10 land under this account; surfacing it
explicitly catches the multi-token / wrong-tab / mid-session-reauth
case before a public comment lands under the wrong handle.

Step 10 now requires a `**Verification mode:**` header line in every
template (was previously by-design only). Reader should never have to
guess whether a verdict came from real install logs or from static
analysis. Filled in concrete defaults for the standard
fixed/inconclusive template ("runtime reproduction; baseline + latest
installed and run") and the still-reproduces template ("runtime
reproduction; bug confirmed live").

Step 10 also requires a link-pass self-verification on every
template, not just by-design's Step 8.5d. The "Tag-anchoring + linking
rule" already declared every citation must be a clickable markdown
link to the verified-on tag, but the verification step that resolves
those links (`gh api .../contents/<path>?ref=<tag>` or `curl -fsI`)
was scoped only to the by-design path. Same 404-cite risk applies to
the standard template — broken citation links advertise verification
work that didn't happen, and that's worse than no citation at all.

Surfaced from: end-to-end audit after the #2007 e2e run, with the
specific `gh issue view`/`grep` commands run against the SKILL.md to
confirm each claim before listing it as a gap.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Two changes that together turn the per-run batch cap from policy into code.

Lower the per-batch processing cap from 20 to 15. Sequential execution
with 1-2 reused Brev boxes works fine for runs in this size range; ~2-3
hours wallclock for a full 15-issue batch is the comfortable budget
before either the per-plan approval gate breaks down or the maintainer
session needs to span multiple sittings.

Add the slicing code that actually enforces the cap. Previously Step 1
declared "Cap at N issues per run" as policy and nothing applied it —
candidate sets larger than the cap would just process to completion
silently. Move the slice to the end of Step 4 (after Step 3 label
filters and the Step 4 version+candidate-rule filters narrow the pool)
and sort by `(-versions_behind, -age_days)` so the most stale come
first. Spillover beyond 15 stays eligible for the next run via the
marker-comment 7-day TTL added in 8976176.

Single-issue mode bypasses the cap entirely; the maintainer named the
issue explicitly.

Cadence section updated from "≤20 issues" to "≤15 issues" to match.

Surfaced from the post-#2007 audit: of the 13 gaps initially called
out, the cap-enforcement one was operational toil (skill could chew
through cost on a runaway batch). Lowering and enforcing closes that
toil path.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Looked at the 24 candidates surfaced by the recent triage run by *kind*
of bug rather than by version, and found six gaps where the standard
rubric would produce a wrong verdict or hang. All six landed in this
commit.

Step 3 platform skip: drop `Platform: Jetson AGX Thor/Orin`. Brev has
no equivalent silicon (embedded/edge ARM with integrated GPU is not
in the SKU catalog), so any Brev verification of a Jetson-only bug
produces a misleading "fixed-on-x86" verdict. `Platform: DGX Spark`
and `Platform: GB10` stay in scope but Step 10 now requires a
hardware-substitution caveat naming the Brev SKU we substituted.

Step 3 TUI / interactive-UI skip: drop issues whose title contains
TUI / dashboard UI / chat UI / keystroke / key press, or whose body
describes interactive UI behavior without a non-interactive reproducer.
`brev exec` does not allocate a real TTY by default; TUI reproducers
hang or silently fail at the first prompt. v1 documents this as
out-of-scope; v1.1 may add a script(1) / expect / tmux harness.

Step 5 now classifies bug-class in addition to CPU/GPU. Four classes:
`performance`, `rebuild-cycle`, `log-only`, `functional` (default).
Detection heuristics from the issue body (latency thresholds, "across
rebuilds" / "after restart", "see lots of error in <X> log"). Each
class routes to a different Step 8 rubric.

Step 8b: log-scraping. When `BUG_CLASS=log-only`, also pull
`~/.openclaw/logs/*.log` and `/var/log/nemoclaw/*.log` from inside the
sandbox after the reproducer runs and search them for the issue's
symptom phrase. Some bugs describe symptoms in internal log files,
not the reproducer's stdout; the previous match rubric only checked
the transcript.

Step 8b: flake-detection retry. For functional bugs, run baseline
three times if the first run shows the symptom inconsistently. Mixed
results (1 or 2 of 3 reproduce) trigger a "flake suspected" caveat,
−25 score adjustment, and downgrade `+50 latest clean` to `+25` so
a lucky-clean-latest-run on an intermittent bug doesn't silently
become a fixed-on-latest verdict.

New Step 8e: performance-bug verification. Multi-run latency
distribution rubric — N=10 runs each side, compute p50/p90, parse
SLA from issue body, match latest's distribution against the SLA
rather than against a symptom phrase. Cap at 60 unless the bug is
silicon-independent because Brev SKUs aren't faithful to DGX Spark
or GB10 hardware for performance-shape bugs.

New Step 8f: rebuild-cycle verification. Run-rebuild-rerun harness
for bugs that only manifest across destroy/recreate boundaries
(#2701-shape). Capture artifacts pre-rebuild, trigger
`nemoclaw destroy --all --force && nemoclaw onboard`, re-capture
post-rebuild, diff to determine whether the artifact persisted as the
issue expects.

Step 10 mandatory hardware-substitution caveat: when DGX Spark or
GB10 is in the issue's labels and Step 7 substituted with a different
silicon class, the comment metadata block must name the substitution
explicitly so silicon-shape bugs get the right reader expectation.

Surfaced from: stress-testing the skill against the actual shapes of
bugs in the 24-candidate set, not just against the bugs we already
verified.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Six gaps surfaced from the second e2e run (issue #2592) and the broader
audit of how the skill handles provider classification.

Step 5 now classifies provider in addition to CPU/GPU and bug-class.
Detection signals from labels and body keywords identify NIM, Gemini,
Anthropic, Bedrock, or Ollama. When the reproducer references a
non-Ollama provider AND actually exercises inference, the skill stops
at Step 5 and prompts the maintainer interactively before any Brev
cost — three options: provide the API key, accept Ollama substitution
with the -30 penalty, or skip to verify-inconclusive. Pure-CLI /
pure-sandbox bugs are exempt because the provider doesn't matter.

Step 6 reproducer-extraction regex extended to also match `openclaw`
and `openshell` invocations as anchor words. Issue #2592's reproducer
was `openclaw channels add telegram` run inside the sandbox; the
previous `nemoclaw`-only regex would have missed the verbatim block.

Step 8a now passes the provider env vars through to install.sh's
bundled onboard step, so it doesn't fall back to the default `build`
(NIM) provider and fail with the misleading NVIDIA_API_KEY missing
error. NEMOCLAW_PROVIDER=ollama (the default case) makes the bundled
onboard use the local Ollama set up in Step 8a.5; NVIDIA_API_KEY
only flows through if the maintainer provided one at Step 5's prompt.
The bundled onboard creates a throwaway sandbox that gets destroyed
before the reproducer runs. The reproducer's own onboard should pass
`--fresh` so a half-built install-sandbox doesn't trip the "previous
session failed" guard.

Step 8a.5 now has an Ollama-coverage table making explicit which bug
classes Ollama covers faithfully (CLI, sandbox, networking) vs which
ones it doesn't (provider-specific behavior, model-specific behavior,
silicon-dependent perf). Step 5's prompt keys off this table.

Step 8a.5b documents the openshell-sandbox-exec syntax footgun. The
correct form is `openshell sandbox exec -n NAME -- CMD`; the wrong
form silently auto-detects the sandbox by "last used" and stuffs the
leftover positional into bash's $0. Same section gets a brev-exec
re-execution guard via a sentinel file at ~/.verify-stale-running to
prevent the double-onboard scenario when SSH drops mid-run.

Step 11 reframes baseline-build-rot as the dominant failure mode for
any reported version >5-7 patches behind, not an edge case. Both
Brev e2e runs hit it. Cap-at-84 with reporter @-mention is the modal
verdict shape, not the exception — pre-flight carries more weight
than baseline runtime evidence for older bugs.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
…sandbox name

Two fixes from a rot-debugging investigation that nearly produced a
wrong conclusion.

When testing whether old NemoClaw versions actually rot under the
literal end-user invocation (`https://www.nvidia.com/nemoclaw.sh`),
the first test run was botched by a shell-scoping mistake:
`NEMOCLAW_INSTALL_TAG=v0.0.26 curl ... | bash` scopes the env var to
curl, not to the downstream bash reading from the pipe. The bash side
defaulted to `latest`, resolved to v0.0.36, and the install proceeded
for several minutes producing convincing-looking output before the
mistake surfaced. The install script had no log line saying which
version it actually resolved, so the silent failure looked like a
working install of v0.0.26.

Skill-side fix: after each install (Step 8a baseline and Step 8d
latest), echo the resolved `nemoclaw --version` and case-match it
against the requested version. Mismatch in baseline sets
BASELINE_INSTALL_FAILED=1 to prevent verifying against the wrong
version. Mismatch in latest logs a WARN that gets surfaced in the
final comment. Cheapest possible guard against the class "I asked for
X, got Y" — and the principle generalizes: print the resolved state,
never trust the requested state.

Also fix the sandbox name in the install.sh-passthrough block: change
`NEMOCLAW_SANDBOX_NAME=__verify_stale_install__` (rejected by NemoClaw's
name validator: "Allowed format: lowercase, starts with a letter,
letters/numbers/internal hyphens only, ends with letter/number") to
`NEMOCLAW_SANDBOX_NAME=verify-stale-install`. Surfaced during the #2519
e2e run; net-zero impact on that run because install.sh fell through
on name validation rather than reaching the buggy code path, but the
spec was wrong.

The deeper context for the resolved-version-echo fix: when the rot
hypothesis was being tested, an independent agent's logical analysis
correctly identified that the public install URL DOES support
`NEMOCLAW_INSTALL_TAG=<ref>` and demanded an empirical re-test. The
re-test (with the env var on the bash side of the pipe) installed
v0.0.26 correctly and STILL hit `Patch 4 (replaceConfigFile EACCES)
not applied` — so the cap-at-84 framing held empirically, just not
for the original "git-clone is the only path" reason. The full
investigation is documented in findings.md Part 6's closing section
("Closing investigation: is baseline-build rot real, or our setup
artifact?").

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Add a Step 3 skip rule for issues where a MEMBER/OWNER/COLLABORATOR
comment was posted within the last 7 days AND the original reporter
hasn't replied since. The skill is for *stale* issues; actively-
discussed ones are in-flight, and posting a verify-stale verdict on
top of an open clarifying question from a maintainer would conflict
with their framing and confuse the reporter (now they have two
simultaneous asks).

Surfaced during pre-flight on issue #2757. The maintainer @cjagwani
had just commented questioning the bug's premise — specifically that
the reporter's "kill -9 took the parent down" framing didn't line up
with how the gateway is launched (`nohup ... &` detaches it) — and
asked the reporter to confirm what they actually observed on the
Station. Running verify-stale would have posted a "by-design, close
as wontfix" verdict on top of an active "wait, did this really
happen?" exchange. Wrong move; the skill should detect this state
and skip.

Implementation reuses the SEVEN_DAYS_AGO cutoff already computed for
the marker-TTL check, fetches the reporter via `gh issue view --json
author`, then jq-walks the comments array: find the most recent
maintainer comment within the cutoff, check whether any reporter
comment exists after that timestamp; if maintainer commented recently
AND no reporter reply since, skip.

Single-issue mode prints the skip reason and exits friendlily so the
maintainer knows why the skill bailed. Batch mode just moves on.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
prekshivyas and others added 5 commits May 7, 2026 17:06
…argv

The post-#2592 commit (1d625dd) added a Step 5 prompt asking the
maintainer to "Export NVIDIA_API_KEY=<key>" and propagate it via
`brev exec`. During the #2604 e2e run that played out as
`NVIDIA_API_KEY=<value> brev exec ...` — which puts the literal key
in the brev exec process's argv, visible in `ps -ef` to anyone with
shell access on either the maintainer's laptop or the Brev box for
the duration of the run. The skill's stated promise ("never logged")
got violated by argv visibility.

Fix: switch propagation to file-based. Maintainer writes the key to
~/.nvidia-api-key with 600 perms on their laptop; Step 6.5 brev-copy's
the file to the Brev box (encrypted SSH); install / reproducer scripts
inside the box read with `NVIDIA_API_KEY=$(cat ~/.nvidia-api-key)`,
which sets the env var in the script's own process — never on a
command line and never visible in `ps -ef`.

Update Step 5's option-1 prompt to instruct the maintainer to use the
file form and to `rm ~/.nvidia-api-key` after the run. Add a new
"API-key propagation pattern" section after Step 5 that documents the
file-based mechanism explicitly. Update Step 8a (baseline install) and
Step 8d (latest install) to source the key from the file inside the
Brev box's exec context, not from the local shell's env.

Note in the docs: if the key was previously propagated via the cmdline
(pre-fix), treat it as exposed and rotate. The #2604 run did this; the
maintainer was reminded to rotate the NIM key after the run.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Replace Step 10's narrow "Length target" rule with a richer "Comment
authoring principle" block that captures lessons accumulated across
the e2e runs.

The core rule: every section in a rendered comment must either change
a reader's mind about the verdict or be cut. Word counts follow from
that — 500 is a ceiling, not a target. Most comments land in 200-400.
Simple cases land under 200.

Document why this matters: comments posted by the skill compete for a
maintainer's attention against every other in-flight thread on the
repo, and AI-slop prose actively reduces signal-to-noise.

Add a worked-examples block citing two real iterations:
- #2007 first draft was 750 words; cut to 371.
- #2604 took THREE drafts before settling on 190 words because each
  draft padded with prose that didn't ground the verdict — which
  caused the verdict itself to drift. Rule learned: name the verdict
  in one sentence first, then cut any section that doesn't support it.

Add per-verdict length targets:
- fixed-on-latest: 200-400 words
- wontfix (by-design): 250-500 words
- verify-inconclusive: 100-200 words
- Still-reproduces (no label): 30-80 words — and no transcripts
  (issue body has them), no @-mention (reporter knows), no
  architectural prose. One sentence + marker.

Add an explicit "cut, by default" list naming the patterns that
historically padded comments without changing verdicts.

Generalizes the existing memory note into the skill body so future
agents reading SKILL.md inherit the principle directly.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
… ceiling

Followup to fe363cf. The 200-400 / 250-500 ranges in the per-verdict
table were still too permissive. Tighten to 200-300 for both
fixed-on-latest and wontfix; verify-inconclusive stays at 100-200;
still-reproduces stays at 30-80.

Update the principle preamble: "300 is a hard ceiling for the main
verdicts." Simple cases land under 200. If a draft is past 300, it's
padding — cut before re-reading.

Memory note (feedback_verify_stale_comment_length.md) updated to match.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
…d-question variant

Previously skipped any issue with a maintainer comment in the last 7 days the
reporter hadn't replied to. After 7 days the question is no longer "active
discussion" — it's a stuck thread, and the skill can be the unsticking voice
rather than a clueless interruption.

Step 3: classifies the most-recent-unanswered-maintainer-comment as recent
(within 7d → skip) or stale (>7d → proceed with variant), exporting
UNANSWERED_MAINT_LOGIN/URL/DATE for the templater.

Step 10: new mandatory block describing the variant — prepend an
"@<maint>'s comment from <date> is still unanswered" lead paragraph and
swap the closing reporter-only @-mention for a dual maintainer+reporter
@-mention that flags the open question.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Layer 1 (local → Brev) was fixed in 22a3997 via brev copy. Layer 2
(on-box subshell) was unaddressed: scripts running on the Brev box
that called sg docker -c "...$NVIDIA_API_KEY..." re-introduced argv
exposure because the outer double-quoted heredoc interpolates the
key value into the inner shell's argv, visible in ps -ef on the box
for the onboard duration.

Surfaced during #2611 e2e run when the just-rotated key landed
verbatim in `sg docker -c` argv. The right pattern: escape the $
so the inner shell evaluates `cat ~/.nvidia-api-key` itself.

The skill section now documents both layers with concrete WRONG /
RIGHT side-by-side, and generalizes the rule to any command-string-
taking invocation (sg, bash -c, su -c, ssh).

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@prekshivyas prekshivyas removed the Platform: Brev Support for Brev deployment label May 8, 2026
prekshivyas and others added 3 commits May 7, 2026 18:35
Three skill paragraphs used `[label](placeholder)` syntax to describe the
shape of a markdown link the templater would render. The link checker
parses bracket-paren syntax everywhere except fenced code blocks (single
backticks aren't enough), so `(<url>)` and `(<UNANSWERED_MAINT_URL>)`
were flagged as broken local links.

Restructure: fenced code blocks for the literal template line in Step 10,
plain prose in Step 3 referencing the Step 10 template instead of
duplicating the link shape inline.

Verified locally with `bash test/e2e/e2e-cloud-experimental/check-docs.sh
--only-links --local-only` — passes cleanly.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@prekshivyas prekshivyas marked this pull request as ready for review May 8, 2026 01:47
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.agents/skills/nemoclaw-maintainer-cut-release-tag/SKILL.md:
- Around line 135-139: The shell loop that pipes issue numbers into xargs can
run a bogus gh issue edit when input is empty; update the pipeline that iterates
labels (the for label in fixed-on-latest verify-inconclusive; do ... gh issue
edit {}) to make xargs safe by adding the -r option (or otherwise guard on
non-empty input) so xargs will not run when there are no issue numbers, and
apply this change in the original docs source that generates the .agents/skills
markdown (not the generated SKILL.md).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5ce3439c-9759-45a0-a42b-14bbd316dfca

📥 Commits

Reviewing files that changed from the base of the PR and between d0cbe7c and 3bc1b8c.

📒 Files selected for processing (3)
  • .agents/skills/nemoclaw-maintainer-cut-release-tag/SKILL.md
  • .agents/skills/nemoclaw-maintainer-verify-stale/SKILL.md
  • .agents/skills/nemoclaw-skills-guide/SKILL.md

Comment thread .agents/skills/nemoclaw-maintainer-cut-release-tag/SKILL.md
When no open issues carry `fixed-on-latest` or `verify-inconclusive`,
the previous pipeline still ran `gh issue edit` once with empty stdin,
producing a noisy failure. `xargs -r` skips the invocation entirely
when input is empty.

CodeRabbit suggestion (PR #3063 review). The skill is hand-authored
(not autogenerated from docs/), so the fix lands directly in
.agents/skills/.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
prekshivyas and others added 2 commits May 7, 2026 20:17
…rogressive disclosure

Per Anthropic's skill-authoring best practices guide, SKILL.md body should
stay under 500 lines so it doesn't dominate context once loaded. The
verify-stale SKILL.md was 1505 lines — 3× the target.

Single split at the natural decision-vs-execution boundary (between
Step 6.7 "try local first" and Step 7 "reuse or provision Brev"):

  SKILL.md (497 lines): Steps 1-6.7 — mode, latest detection, candidate
  filter, version parsing, environment classification, reproducer
  extraction, preconditions, local-first short-circuit. The decision tree
  for whether and how to verify.

  reference/execution-and-comment.md (1042 lines, with TOC): Steps 7-12
  + Cadence + Out of Scope + Companion. Brev provisioning, baseline +
  latest verification, by-design detection, scoring, comment templates,
  redaction, infra failure, activity log. The implementation detail
  loaded only when a Brev run is committed to.

The reference file's TOC at the top satisfies the >100-line reference
file rule from the doc; single-level depth (no nested references)
satisfies the "keep references one level deep" rule.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 8, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
@.agents/skills/nemoclaw-maintainer-verify-stale/reference/execution-and-comment.md:
- Around line 52-53: The current RUNNING computation counts all verify-stale-*
instances from INSTANCES; update the jq selection to count only instances whose
name starts with "verify-stale-" AND whose status indicates they are running
(e.g., .status == "RUNNING") so the concurrency cap checks only active boxes;
modify the expression that sets RUNNING (the jq filter applied to INSTANCES) to
include the running-status predicate.
- Around line 316-318: The reproducer is executed via brev exec with a plain
non-login bash which can miss ~/.local/bin and Docker group context; update the
two brev exec invocations that run "bash ~/reproducer.sh" (both the baseline and
latest run occurrences) to invoke the script through a login shell and wrapped
with sg docker -c (or otherwise explicitly export/include $HOME/.local/bin in
PATH) so the ~/.local/bin and docker group membership are available during
execution, then tee the output to the same
baseline-transcript.log/latest-transcript.log as before.
- Around line 767-768: The credential-redaction regex currently uses escaped
pipes so it matches literal "|" instead of alternation; update the pattern
`(?i)(token\|secret\|password\|api[_-]?key\|bearer)[^\n]*[:=][^\n]*` to use
actual alternation (e.g.
`(?i)(?:token|secret|password|api[_-]?key|bearer)[^\n]*[:=][^\n]*`) and consider
adding optional whitespace around `[:=]` or word boundaries if needed to
reliably catch `token=...`, `password: ...` and `api_key = ...`; change the
regex where referenced so functions/filters that use this pattern (the
credential redaction rule) use the corrected expression.

In @.agents/skills/nemoclaw-maintainer-verify-stale/SKILL.md:
- Line 3: Update the description string in SKILL.md to use the canonical label
syntax: replace the occurrence of "status wont-fix" with "status: wont-fix" so
it matches the rest of the skill set and avoids operator confusion; edit the
description field near the top of the file (the `description:` paragraph) to
make this single-word change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 2127a531-9c7d-4192-9fed-193e68fa3de8

📥 Commits

Reviewing files that changed from the base of the PR and between e48e573 and e0eab59.

📒 Files selected for processing (2)
  • .agents/skills/nemoclaw-maintainer-verify-stale/SKILL.md
  • .agents/skills/nemoclaw-maintainer-verify-stale/reference/execution-and-comment.md

Comment thread .agents/skills/nemoclaw-maintainer-verify-stale/SKILL.md Outdated
…Project 199

Adds tracker integration: when the verdict is fixed-on-latest, the skill
also moves the issue to "Needs Review" on the NemoClaw Development
Tracker (NVIDIA Project 199), so the maintainer's review queue picks
it up for confirmation. After the maintainer closes the issue, existing
project automation (or a manual move) advances it to Done. Other
verdicts (wontfix, verify-inconclusive, no-label-still-reproduces) do
not move the tracker — they have separate close paths.

Constants captured at write time (re-run gh project field-list 199
--owner NVIDIA --format json if the project drifts):
  Project ID:           PVT_kwDOABpemM4BSCP5
  Status field ID:      PVTSSF_lADOABpemM4BSCP5zg_r9p8
  Needs Review opt ID:  5c5922a9

Step 6.5 gains a one-line precondition that warns (does not fail) when
the gh CLI lacks the 'project' scope, so the auth gap surfaces before
Step 10 instead of mid-run. Step 12's activity-log template gains a
Tracker: row recording the move outcome.

The tracker step is gated to fixed-on-latest only — wontfix and
verify-inconclusive verdicts do not benefit from "Needs Review" since
they're either final-state (wontfix) or different review flow
(verify-inconclusive); attempting to move them would create churn.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (4)
.agents/skills/nemoclaw-maintainer-verify-stale/SKILL.md (1)

3-3: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use canonical label spelling in the description.

Line 3 says status wont-fix, while the documented/actual label is status: wont-fix. Keeping the canonical string avoids operator mistakes when applying labels manually.

Suggested fix
-description: Verify whether old NVIDIA/NemoClaw bug reports still reproduce against the latest tag. Picks candidate issues opened against older versions, runs the reproducer locally first when possible (Linux or macOS), otherwise reuses or provisions a Brev Linux box (CPU or GPU), detects behavior that was intentionally changed, scores confidence, and posts an evidence-backed comment with a label (fixed-on-latest, status wont-fix, or verify-inconclusive). Tag-only — never auto-closes. Brev verification is Linux-only in v1; Windows and integration-token-dependent issues are skipped. Trigger keywords - verify stale, verify fixed, reproduce on latest, stale issue, old bug, fixed-on-latest, status wont-fix, verify-inconclusive, drain backlog, brev verify.
+description: Verify whether old NVIDIA/NemoClaw bug reports still reproduce against the latest tag. Picks candidate issues opened against older versions, runs the reproducer locally first when possible (Linux or macOS), otherwise reuses or provisions a Brev Linux box (CPU or GPU), detects behavior that was intentionally changed, scores confidence, and posts an evidence-backed comment with a label (fixed-on-latest, status: wont-fix, or verify-inconclusive). Tag-only — never auto-closes. Brev verification is Linux-only in v1; Windows and integration-token-dependent issues are skipped. Trigger keywords - verify stale, verify fixed, reproduce on latest, stale issue, old bug, fixed-on-latest, status: wont-fix, verify-inconclusive, drain backlog, brev verify.

Based on learnings: .agents/skills/nemoclaw-maintainer-* files are hand-authored source-of-truth and should be reviewed directly.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.agents/skills/nemoclaw-maintainer-verify-stale/SKILL.md at line 3, Update
the canonical label spelling in the SKILL.md description: replace the token
"status wont-fix" with the exact label "status: wont-fix" so the documented
labels (alongside "fixed-on-latest" and "verify-inconclusive") match the actual
label string operators use; edit the description line in
.agents/skills/nemoclaw-maintainer-verify-stale/SKILL.md to make this
single-word correction.
.agents/skills/nemoclaw-maintainer-verify-stale/reference/execution-and-comment.md (3)

52-53: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Count only running boxes for the concurrency gate.

Line 52 currently counts all verify-stale-* instances, but the comment and guard are for running capacity. This can block new runs even when capacity exists.

Suggested fix
-  RUNNING=$(echo "$INSTANCES" | jq '[.[]? | select(.name | startswith("verify-stale-"))] | length')
+  RUNNING=$(echo "$INSTANCES" | jq '[.[]? | select(.name | startswith("verify-stale-") and .status == "RUNNING")] | length')
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
@.agents/skills/nemoclaw-maintainer-verify-stale/reference/execution-and-comment.md
around lines 52 - 53, The current RUNNING count uses INSTANCES with a jq
selector that matches all verify-stale-* names; change the jq filter used when
building RUNNING so it only counts instances whose name
startswith("verify-stale-") AND whose status indicates they are currently
running (e.g., .status.state == "running" or the equivalent field in your
instance objects), leaving the if [ "$RUNNING" -ge 4 ] guard unchanged; update
the jq expression referenced by RUNNING to include that running-state predicate.

316-318: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reproducer runs should use docker-group + login-shell wrapper.

Lines 317, 364, and 389 execute bash ~/reproducer.sh directly. That misses the documented docker group activation and can miss ~/.local/bin PATH setup, causing false inconclusive outcomes.

Suggested fix
-brev exec "$INSTANCE_NAME" "bash ~/reproducer.sh" 2>&1 | tee ./baseline-transcript.log
+brev exec "$INSTANCE_NAME" "sg docker -c 'bash -lc \"export PATH=\\\"$HOME/.local/bin:$PATH\\\"; bash ~/reproducer.sh\"'" 2>&1 | tee ./baseline-transcript.log
...
-brev exec "$INSTANCE_NAME" "bash ~/reproducer.sh" 2>&1 | tee ./baseline-transcript-2.log
+brev exec "$INSTANCE_NAME" "sg docker -c 'bash -lc \"export PATH=\\\"$HOME/.local/bin:$PATH\\\"; bash ~/reproducer.sh\"'" 2>&1 | tee ./baseline-transcript-2.log
...
-brev exec "$INSTANCE_NAME" "bash ~/reproducer.sh" 2>&1 | tee ./latest-transcript.log
+brev exec "$INSTANCE_NAME" "sg docker -c 'bash -lc \"export PATH=\\\"$HOME/.local/bin:$PATH\\\"; bash ~/reproducer.sh\"'" 2>&1 | tee ./latest-transcript.log

Also applies to: 363-365, 388-390

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
@.agents/skills/nemoclaw-maintainer-verify-stale/reference/execution-and-comment.md
around lines 316 - 318, Replace direct invocations of "bash ~/reproducer.sh" in
the brev exec calls with the documented docker-group + login-shell wrapper so
the docker group is activated and the user's login PATH (including ~/.local/bin)
is sourced; specifically update each brev exec "$INSTANCE_NAME" "bash
~/reproducer.sh" occurrence (and any identical occurrences around the earlier
shown lines) to call the wrapper that activates the docker group then launches
the reproducer inside a login shell.

767-769: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Redaction regex alternation is escaped and won’t match intended secrets/hosts.

Line 767 and Line 768 use \| where alternation is intended. That matches literal pipes instead of alternatives, weakening redaction and risking secret/hostname leakage.

Suggested fix
-| 8 | `(?i)(token\|secret\|password\|api[_-]?key\|bearer)[^\n]*[:=][^\n]*` | Inline credentials in env/config/log output |
-| 9 | `\b\w+\.(nvidia\.internal\|nv-internal\.com\|nvidia\.dev)\b` | Internal hostnames (extend list per team) |
+| 8 | `(?i)(?:token|secret|password|api[_-]?key|bearer)[^\n]*[:=][^\n]*` | Inline credentials in env/config/log output |
+| 9 | `\b\w+\.(?:nvidia\.internal|nv-internal\.com|nvidia\.dev)\b` | Internal hostnames (extend list per team) |
#!/bin/bash
python3 - <<'PY'
import re
bad_cred = re.compile(r'(?i)(token\|secret\|password\|api[_-]?key\|bearer)[^\n]*[:=][^\n]*')
good_cred = re.compile(r'(?i)(?:token|secret|password|api[_-]?key|bearer)[^\n]*[:=][^\n]*')
samples = ["token=abc", "password: hunter2", "api_key = nvapi-xyz"]
print("credential pattern:")
for s in samples:
    print(s, "bad=", bool(bad_cred.search(s)), "good=", bool(good_cred.search(s)))

bad_host = re.compile(r'\b\w+\.(nvidia\.internal\|nv-internal\.com\|nvidia\.dev)\b')
good_host = re.compile(r'\b\w+\.(?:nvidia\.internal|nv-internal\.com|nvidia\.dev)\b')
hosts = ["svc.nvidia.internal", "x.nv-internal.com", "api.nvidia.dev"]
print("\nhost pattern:")
for h in hosts:
    print(h, "bad=", bool(bad_host.search(h)), "good=", bool(good_host.search(h)))
PY
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
@.agents/skills/nemoclaw-maintainer-verify-stale/reference/execution-and-comment.md
around lines 767 - 769, The redaction regexes use escaped pipe sequences (e.g.
"(?i)(token\|secret\|password\|api[_-]?key\|bearer)[^\n]*[:=][^\n]*" and
"\b\w+\.(nvidia\.internal\|nv-internal\.com\|nvidia\.dev)\b") which match
literal '|' instead of alternation; replace them with proper alternation and
prefer non-capturing groups: use
"(?i)(?:token|secret|password|api[_-]?key|bearer)[^\n]*[:=][^\n]*" for inline
credentials and "\b\w+\.(?:nvidia\.internal|nv-internal\.com|nvidia\.dev)\b" for
internal hosts so the patterns in the file (the two regex strings) correctly
match secrets and hostnames.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In
@.agents/skills/nemoclaw-maintainer-verify-stale/reference/execution-and-comment.md:
- Around line 52-53: The current RUNNING count uses INSTANCES with a jq selector
that matches all verify-stale-* names; change the jq filter used when building
RUNNING so it only counts instances whose name startswith("verify-stale-") AND
whose status indicates they are currently running (e.g., .status.state ==
"running" or the equivalent field in your instance objects), leaving the if [
"$RUNNING" -ge 4 ] guard unchanged; update the jq expression referenced by
RUNNING to include that running-state predicate.
- Around line 316-318: Replace direct invocations of "bash ~/reproducer.sh" in
the brev exec calls with the documented docker-group + login-shell wrapper so
the docker group is activated and the user's login PATH (including ~/.local/bin)
is sourced; specifically update each brev exec "$INSTANCE_NAME" "bash
~/reproducer.sh" occurrence (and any identical occurrences around the earlier
shown lines) to call the wrapper that activates the docker group then launches
the reproducer inside a login shell.
- Around line 767-769: The redaction regexes use escaped pipe sequences (e.g.
"(?i)(token\|secret\|password\|api[_-]?key\|bearer)[^\n]*[:=][^\n]*" and
"\b\w+\.(nvidia\.internal\|nv-internal\.com\|nvidia\.dev)\b") which match
literal '|' instead of alternation; replace them with proper alternation and
prefer non-capturing groups: use
"(?i)(?:token|secret|password|api[_-]?key|bearer)[^\n]*[:=][^\n]*" for inline
credentials and "\b\w+\.(?:nvidia\.internal|nv-internal\.com|nvidia\.dev)\b" for
internal hosts so the patterns in the file (the two regex strings) correctly
match secrets and hostnames.

In @.agents/skills/nemoclaw-maintainer-verify-stale/SKILL.md:
- Line 3: Update the canonical label spelling in the SKILL.md description:
replace the token "status wont-fix" with the exact label "status: wont-fix" so
the documented labels (alongside "fixed-on-latest" and "verify-inconclusive")
match the actual label string operators use; edit the description line in
.agents/skills/nemoclaw-maintainer-verify-stale/SKILL.md to make this
single-word correction.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5c3d6f16-ab32-45e7-ad23-cf3a13a02591

📥 Commits

Reviewing files that changed from the base of the PR and between e0eab59 and 3cb23b0.

📒 Files selected for processing (2)
  • .agents/skills/nemoclaw-maintainer-verify-stale/SKILL.md
  • .agents/skills/nemoclaw-maintainer-verify-stale/reference/execution-and-comment.md

The previous Step 7 cleared fixed-on-latest and verify-inconclusive from
every open issue at every release cut. That re-verifies labels that
were freshly applied yesterday — wasting Brev cost and creating noise.

Replace with a three-tier decision cascade:

1. If Project 199 status == Done → skip clear (maintainer accepted the
   verification; label stays until the issue closes).
2. Else if marker comment is >10 days old → clear (stale; reporter
   never confirmed in the review window).
3. Else if any PR merged since the marker date touches the paths the
   comment cited in `Relevant changes since v0.0.X` → clear (regression
   risk).
4. Else → skip clear (verification still holds).

Closed issues are not iterated (--state open filter).

Requires the `project` scope on gh for the Done-state lookup; falls
back to the time + regression checks alone (with a warning) when the
scope is missing.

Companion to verify-stale's Step 10 tracker integration. Together
they form a coherent verification lifecycle: skill applies label +
moves to Needs Review → maintainer reviews and either accepts (moves
to Done) or lets it sit → release sweep keeps Done-accepted labels,
clears stale or regression-risk ones.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.agents/skills/nemoclaw-maintainer-cut-release-tag/SKILL.md:
- Line 202: The git command that sets REGRESSED uses a hardcoded path (git -C
~/NemoClaw ...) which can fail if the repo is elsewhere; change it to use the
current repo or a configurable path: replace the hardcoded -C ~/NemoClaw with
either -C "$REPO_PATH" (add a REPO_PATH variable set earlier) or remove -C to
run from the current directory, keeping MARKER_DATE and PATHS unchanged so the
lookup still uses those variables.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b400dbe8-8762-4da3-9883-9aab71550ed7

📥 Commits

Reviewing files that changed from the base of the PR and between 3cb23b0 and 9b434bd.

📒 Files selected for processing (1)
  • .agents/skills/nemoclaw-maintainer-cut-release-tag/SKILL.md

Comment thread .agents/skills/nemoclaw-maintainer-cut-release-tag/SKILL.md Outdated
prekshivyas and others added 6 commits May 8, 2026 14:40
10 days felt too aggressive for a real maintainer review window —
weekends, holidays, and across-team handoffs commonly need 2 weeks.
14 days lines up with the natural sprint boundary and gives the
reporter + maintainer enough room before re-verification kicks in.

Updated the decision cascade table and the AGE_DAYS check in lockstep.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five findings from coderabbit's second-pass review, all valid:

F1 (Major) — execution-and-comment.md: concurrency cap counted ALL
  verify-stale-* boxes, not just RUNNING ones, so prior stopped boxes
  could falsely block provisioning. Add the same .status == "RUNNING"
  filter the reuse query above already uses.

F2 (Major) — execution-and-comment.md: the two `brev exec
  bash ~/reproducer.sh` calls (baseline + latest) didn't export PATH,
  so non-login shells could miss ~/.local/bin where the nemoclaw
  binary lives. Wrap both with explicit PATH export. (sg docker stays
  the script's responsibility — double-wrapping nests quoting.)

F3 (Critical) — execution-and-comment.md: credential redaction
  patterns 8 and 9 were inside a markdown table where `|` is the
  column delimiter. The table escaped them as `\|` to preserve the
  table structure, but `\|` in regex matches a LITERAL pipe character,
  not alternation. The redaction pattern
  `(?i)(token\|secret\|password\|api[_-]?key\|bearer)` therefore
  silently failed to match `token=...`, `password: ...`,
  `api_key = ...`. Replaced the table with a fenced code block so
  patterns can use real alternation.

F4 (Minor) — verify-stale SKILL.md: frontmatter description used
  "status wont-fix" (no colon) while the rest of the doc uses the
  canonical "status: wont-fix". Fixed in both the verdict-label list
  and the trigger-keywords list.

F5 (Major) — cut-release-tag SKILL.md: regression check used
  `git -C ~/NemoClaw log ...` which assumes a specific checkout
  location. Step 1's prerequisite already requires the maintainer
  to be inside the NemoClaw repo, so dropping `-C ~/NemoClaw` runs
  from the current directory.

Local check-docs.sh and markdownlint-cli2 pass. SKILL.md is still
exactly 500 lines after F4.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
(1) Step 3 — unanswered-question variant should require a question,
not just a comment. Surfaced when @wscurran's templated triage ack
("✨ Thanks for reporting…") fired the variant on #1642; the rendered
"their comment from N days ago is still unanswered" lead made it
sound like there was a pending question when nothing was actually
asked. Extend the maintainer-comment selector with a question-shape
content check: comment body must contain `?`, a polite imperative
(please confirm/share/provide/clarify/tell/verify/check/let me know),
or a question starter (could/can/would you, do you have/know/see/use).
Pure acknowledgments fall through to the standard template.

(2) Step 8d — enforce OpenShell version pinning after the latest
checkout. Surfaced on #1642's run: v0.0.6 setup phase installed
OpenShell 0.0.37 (whichever was current), but v0.0.38's blueprint.yaml
caps max_openshell_version at 0.0.36 — onboard preflight then refuses
to run. The skill now reads MAX_OS from latest's blueprint.yaml after
checkout. If the installed binary is newer than the cap, install-
openshell.sh refuses to downgrade, so the skill force-installs the
capped version directly from the OpenShell GitHub release. If the
installed binary is at or below the cap, install-openshell.sh runs
normally to bring it forward.

Both fixes lint clean. SKILL.md is now 499 lines (under the 500
ceiling). reference/execution-and-comment.md grew by ~30 lines for
the OpenShell-pin block.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Long-running verifications race with maintainers closing issues
independently. #2513 and #2519 both closed mid-batch — @jyaunches
posted their own comprehensive verification before our run got to
the comment phase. The skill kept going and would have appended a
redundant comment after the close.

Add a pre-post state check: refresh the issue state right before
gh issue comment runs. If closed, apply the label tag-only (the
maintainer's close comment is now the authoritative record) and skip
both the comment and the Project 199 move.

Surfaced from findings2.md gap item 18.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Step 3's issue-type skip list used exact-match `enhancement`, but the
repo has both bare `enhancement` and 8 prefixed variants (enhancement:
feature, enhancement: MCP, enhancement: testing, enhancement: ui,
enhancement: provider, enhancement: platform, enhancement: policy,
enhancement: inference, enhancement: integration, enhancement:
performance, enhancement: skill). The prefixed forms slipped through
the filter — surfaced from #1752 (Gemini 3 Flash Preview), labeled
`enhancement: feature`. The reporter was asking whether the model
variant was supported, which is correctly an enhancement request.

Updated the skip clause prose to spell out the prefix-match rule and
list the variants. Implementations should match exact `enhancement`
OR any label starting with `enhancement:`.

Surfaced from findings2.md gap item 36 (added this run).

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement: skill Improvements to NemoCall repository hygiene or user functionality with skills.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants