feat(skills): add nemoclaw-maintainer-verify-stale by prekshivyas · Pull Request #3063 · NVIDIA/NemoClaw

prekshivyas · 2026-05-05T21:30:21Z

Summary

Adds maintainer skill nemoclaw-maintainer-verify-stale — automates the manual loop of: spin up a Brev box → install latest NemoClaw → try to reproduce an old bug → comment with findings. Tag-only (never auto-closes); a maintainer pulls the trigger after reviewing.

The skill has been driven end-to-end against 9 real candidates during development, surfacing the gaps and skill changes summarised below.

Issues tackled

Issue	Reported	Verified	Verdict	Score	What this run surfaced (skill change)
#2791	v0.0.21	v0.0.35	`wontfix-by-design`	—	By-design prototype. Step 8.5 detection framework (3 signals: maintainer attribution / removal commit / symbol absent). Maintainer reviewed the analysis and closed the issue manually.
#2168	v0.0.21	v0.0.35	`wontfix-by-design` (tagged)	—	Repo-label vocabulary mismatch (`status: wont-fix`, not `wontfix`); markdown-linked citations in comments.
#2007	v0.0.18	v0.0.35	`fixed-on-latest` (tagged)	84/100	Docker-group `sg`, `brev exec` PATH gotcha, sandbox-build-rot reframe, multi-axis architectural-drift check.
#2592	v0.0.28	v0.0.35	`fixed-on-latest` (tagged)	84/100	Sentinel-file SSH-drop guard, `openshell sandbox exec -n NAME --` syntax footgun, reproducer-extraction regex extension (`openclaw\|openshell`).
#2519	v0.0.26	v0.0.35	`fixed-on-latest` (closed by maintainer mid-run)	84/100	Step 5 provider classification + API-key prompt, install.sh env-var passthrough, sandbox-name validator.
#2604	v0.0.28	v0.0.36	`fixed-on-latest` (tagged)	84/100	File-based API key propagation (kill argv exposure local→Brev). 300-word hard ceiling for comments + comment-authoring principle in Step 10.
#2611	v0.0.28	v0.0.36	`fixed-on-latest` (tagged)	70/100	On-box subshell argv leak (extend file-based pattern to Brev→inner shell). First non-cap-at-84 score (the −30 LLM-synth penalty pushed raw below the cap).
#2513	v0.0.26	—	maintainer self-verified mid-batch	—	Gap-queue: skill should re-check `state == OPEN` before posting. Second mid-batch close-by-maintainer (alongside #2519) → promote from queue to a real Step 1/3 check.
#2166	v0.0.21	—	still-reproduces (source-only)	—	Gap-queue: when pre-flight is conclusive, allow source-only verdicts to skip Brev entirely (`Verification mode: static analysis at <tag>`).

Effective touched count: 7 issues with skill-rendered comments + 2 side-discoveries. Two of the seven landed wontfix-by-design, five landed fixed-on-latest. Sandbox-build rot capped four of the five at 84 (the modal verdict shape with reporter @-mention). #2611 was the first run to land below the cap, validating the −30 LLM-synth penalty path.

v1 scope


In	Linux only (Brev), `bug` issues against CLI / Sandbox / OpenShell / Docker / Getting Started / Ubuntu / DGX Spark / GB10
Out	macOS, Windows/WSL, integration bugs needing third-party tokens, service-account bots, auto-close

Verification flow (one-line per step)

Step	Purpose
6.5 — Preconditions	Brev auth + install-URL reachability fail-fast (no Brev cost on dead deps)
6.7 — Local-first	Pure-CLI bugs (no sandbox/Docker/GPU) run on the maintainer's local install — same evidence, zero cost
7 — Reuse-or-provision	`brev ls` → reuse a `verify-stale-*` box if running, else `brev create` with a runtime-picked CPU SKU
8 — Validate-on-baseline, verify-on-latest	Install reported version first, confirm the reproducer actually exposes the bug, reset, then install latest
8.5 — By-design detection	Three signals (maintainer attribution / removal commit / symbol absent) short-circuit Brev cost on intentional removals
9 — Confidence scoring	+50 / +25 / +25 / −30 / −50, clamped to [0,100]. ≥85 silent · 60–84 + reporter @-mention · <60 inconclusive
10 — Comment + label	Comprehensive redaction (JWT / NVAPI / PAT / base64 / internal hostnames / emails) with HTML→text pre-pass for QA bodies. 300-word hard ceiling for fixed/wontfix; 30–80 for still-reproduces.
11 — Infra failure	Sandbox-build rot is the dominant failure mode for any version >5–7 patches behind; cap-at-84 with reporter @-mention is by design
12 — Activity log	Append per-issue + per-session entries to `~/development/daily-rhythm/activity/nemoclaw-verify-stale-log.md`

Idempotency

Every posted comment carries . Candidate filter excludes any issue whose comments match  within the last 7 days OR carries fixed-on-latest / verify-inconclusive / status: wont-fix. Release sweep clears the first two on each tag cut.

Active-discussion handling (two-tier)

Maintainer's unanswered comment age	Skill behaviour
≤ 7 days	Skip — running on top of fresh discussion would conflict with the maintainer's framing
> 7 days	Proceed with the unanswered-question variant: lead with a markdown link to the unanswered comment, dual @-mention of both maintainer and reporter

Companion changes

nemoclaw-skills-guide/SKILL.md — adds the new skill to the maintainer skills table.
nemoclaw-maintainer-cut-release-tag/SKILL.md — release-time sweep of fixed-on-latest and verify-inconclusive from every open issue. Without it, "latest" drifts and verifications go silently stale. Verification records stay in comments; only labels reset.

CI status

Check	Status
commit-lint, dco-check, installer-hash-check, legacy-path-guard, pr/changes	✅
markdown-links	✅ after `3abffd72` (placeholder `[label](url)` examples in SKILL.md were being parsed as real broken links — restructured as fenced-block templates) and `dc5d7143` (merge of main brought in the deletion of `docs/get-started/brev-web-ui-quickstart.md` per #3050, resolving the merge-side missing-file failure)

Test plan

Signed-off-by: Prekshi Vyas prekshiv@nvidia.com

Summary by CodeRabbit

Documentation
- Added a post-release verification step that conditionally clears verification labels on open issues so they’re re-evaluated against the new release; wont-fix status and prior verification comments are preserved.
- Added a comprehensive “verify stale” maintainer workflow and execution reference (batch processing limits, idempotency markers, verification/rubric templates, infra and comment-handling guidance).
Chores
- Updated maintainer skill and role counts to include the new verify-stale skill.

Adds a maintainer skill that automates verifying whether old bug reports still reproduce against the latest NemoClaw release. The skill picks candidate issues opened against older versions, reuses or provisions a Brev Linux box (CPU or GPU based on the bug profile), runs the extracted reproducer, scores confidence, and posts an evidence-backed comment with either a `fixed-on-latest` or `verify-inconclusive` label. Tag-only — never auto-closes. v1 scope is intentionally narrow: Linux only, no third-party integration credentials, no service-account bot. Each maintainer runs it under their own gh credentials. Companion changes: - Adds the new skill to the maintainer table in `nemoclaw-skills-guide` (count 7 -> 8 maintainer skills, 18 -> 19 total). - Adds Step 7 to `nemoclaw-maintainer-cut-release-tag` that sweeps `fixed-on-latest` and `verify-inconclusive` labels off all open issues at release time, so the next verify-stale run re-evaluates against the new latest. Verification records remain in comment history; only the labels are reset. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

copy-pr-bot · 2026-05-05T21:30:25Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-05-05T21:30:28Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR adds a new nemoclaw-maintainer-verify-stale skill to the NemoClaw maintainer skills catalog and expands the release workflow with a post-release label cleanup step that removes verification-related labels from open issues, allowing the next verification run to re-evaluate them against the new release.

Changes

NemoClaw Skills Catalog and Release Workflow

Layer / File(s)	Summary
Skill Catalog Entry `.agents/skills/nemoclaw-skills-guide/SKILL.md`	Introduces `nemoclaw-maintainer-verify-stale` skill to the maintainer skills table, increments maintainer skill bucket count from 7 to 8, and updates Roles table to reflect 19 total maintainer skills.
Verify-Stale Skill `.agents/skills/nemoclaw-maintainer-verify-stale/SKILL.md`	Adds a tag-only maintainer skill with front-matter, single/batch modes, latest-tag discovery, candidate filtering, reported-version parsing, classification, reproducer extraction, preflight checks, and a local-first short-circuit; hooks to Steps 7–12 reference.
Verify-Stale Execution Reference `.agents/skills/nemoclaw-maintainer-verify-stale/reference/execution-and-comment.md`	Adds Steps 7–12: box reuse/provisioning, two-pass baseline→latest validation, dependency bootstrapping and environment quirks, architectural-drift checks, performance/rebuild rubrics, by-design branch, scoring, comment templates/redaction, infra handling, and activity logging.
Release Workflow Integration `.agents/skills/nemoclaw-maintainer-cut-release-tag/SKILL.md`	Adds Step 7: conditional sweep of `fixed-on-latest` and `verify-inconclusive` labels on open issues after release tagging, excluding `status: wont-fix`, using Project 199 status when available, marker-age (10+ days), or regression-touch checks against paths cited in the verify marker.

Sequence Diagram(s)

sequenceDiagram
  participant Maintainer
  participant GitHub
  participant VerifyStale
  participant Brev
  Maintainer->>GitHub: discover candidate issues & latest tag
  GitHub-->>Maintainer: return issues, labels, comments
  Maintainer->>VerifyStale: request filtering/prioritization
  VerifyStale->>Brev: provision or reuse verification box
  Brev-->>VerifyStale: run baseline then latest reproducer
  VerifyStale-->>Maintainer: deliver transcripts, score, verdict

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 New skills hop into the spread,
A verify-stale step clears the bed,
Labels swept clean after tags are made,
Old marks stay in comments, history saved,
Fresh checks awaken on the repo's glade.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title 'feat(skills): add nemoclaw-maintainer-verify-stale' clearly and specifically summarizes the main change—introducing a new maintainer skill named nemoclaw-maintainer-verify-stale.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/verify-stale-skill

Warning

Review ran into problems

🔥 Problems

Timed out fetching pipeline failures after 30000ms

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Three flaws surfaced when dry-running the Step 3 candidate filter against the live open-issue list (141 open bugs). 1. Step 2 used `gh release view` to detect "latest". NemoClaw tags but does not publish GitHub releases, so this returned empty. Added a fallback that picks the highest semver from `git ls-remote --tags`, which is the load-bearing path today. 2. Step 3 said "2 or more minor versions behind". Wrong vocabulary for `0.0.x` repos — patch is what's iterating. Reworded to "at least 2 versions behind in the rightmost-incrementing component", with a concrete `0.0.x` example and a forward-compat note for when NemoClaw moves to `0.1.x`. 3. Step 4 specified a naive `v?0\.0\.\d+` regex that matched IP-like strings in bug bodies (Ollama bind `0.0.0.0:11434`, loopback `127.0.0.1`, etc.) and produced phantom v0.0.0 / v0.0.1 candidates. Tightened to require `v` prefix + word boundaries first, with a context-anchored fallback that only matches `0.0.X` on lines containing `nemoclaw` or `version`. Added a clamp that rejects any parsed version greater than `$LATEST` (catches roadmap labels like `v0.0.35` from being treated as "reported on"). After these fixes the dry-run produces 33 plausible candidates out of 141 open bugs, with credible reported-version assignments verified against issue bodies. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

Two more issues surfaced from a deeper second-pass dry-run. 1. The Step 4 regex was hardcoded to v0.0.x. NemoClaw is on that line today, but the skill should keep working when it moves to v0.1.x or higher. Generalized to `\bv\d+\.\d+\.\d+\b` with a context-anchored `\b\d+\.\d+\.\d+\b` fallback on lines containing nemoclaw or version (so the loose form doesn't match `0.0.0.0:11434` and `127.0.0.1`). 2. Some bug reports cite NemoClaw versions that were never tagged (e.g. `v0.1.0` × 3 issues, calver `2026.3.11` × 1). Without a tag existence check the skill would point Brev verification at a version that does not exist. Added a `git ls-remote --tags` validation pass that drops the version if no matching tag is found. Also added an implementer note documenting a real failure mode caught during the dry-run: a naive `[scan(primary)] | first | .[0] | tonumber // [scan(fallback)] | ...` pipeline silently dropped 9 valid candidates because `null | first` errors when scan returns empty, and the `//` chain did not propagate cleanly to the fallback. The SKILL.md regex itself was correct — the failure was implementation-level fragility in how the two passes were composed. The note tells implementers to bind each pass to a named variable, coalesce at the end, and explicitly test the empty-match path. After these changes the live dry-run produces 47 plausible candidates (out of 141 open bugs) and the 4 residual drops are all reporter typos that cite a version that was never released — correctly unverifiable. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

Third dry-run round surfaced a precision issue. The previous regex matched any `\bv\d+\.\d+\.\d+\b` anywhere in the body, then validated against the tag list. That picked up versions belonging to *other* products that share the v0.0.x line — most often OpenShell, but also Node.js (v22.x.x), and other dependencies. Tag validation then correctly rejected the non-NemoClaw ones, but on bodies where multiple products were listed (e.g. `Environment: openshell 0.0.4, nemoclaw 0.1.0, Node.js v22.16.0`) the parser would land on OpenShell's tag-valid v0.0.4 and report it as the NemoClaw version, producing a false-positive candidate. 12 such collisions exist in the current backlog. Fix: require the version to follow `nemoclaw` (case-insensitive) within 80 non-letter, non-newline characters. The anchored regex `(?i)nemoclaw[^a-z\n]{0,80}v?(\d+\.\d+\.\d+)` matches `NemoClaw v0.0.32`, `nemoclaw 0.0.28`, `- NemoClaw: v0.0.16`, but not `openshell 0.0.4` followed later by `nemoclaw 0.1.0`. Combined with the existing tag-list validation, this collapses the four error classes — typos, calver dates, roadmap labels, and product collisions — into one "version must be a real NemoClaw release tag, mentioned next to NemoClaw" check. Also expanded the implementer note to cover three concrete failure modes hit during the dry-run: - Empty-match handling (`[]` flowing into `first | .[0]` errors). - Capture-group inconsistency between branches in a parser pipeline. - Variable-scoping bug in `select($tags | index(.))` where `.` rebinds to `$tags` and silently passes invalid labels. After this change the live dry-run produces 33 candidates out of 141 open bugs — same count as the very first round, but every candidate's parsed version is now anchored to NemoClaw and validated against the real tag list. The earlier wider counts (47, 48) included roughly 12-15 issues that would have been wasted Brev runs. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

…test A clean run on latest is ambiguous on its own: - Scenario A: bug really got fixed (script worked, latest passes) - Scenario B: script was junk all along (script never triggered the bug; latest passes for the same reason baseline would) These look identical from the latest pass alone, which means the previous flow could land `fixed-on-latest` on a script that never had a chance — a false positive that the +25 commit and +25 PR signals only partially mitigate. This change adds a baseline pass before the latest pass: - Step 8a: install reported version on the box. - Step 8b: run reproducer, compare output to issue's actual-result description. Match = script validated. - Step 8c: if no match (script silent OR errored on the wrong thing), synth-repro using the issue body PLUS the baseline transcript so the LLM can react to the actual failure mode. Apply -30. Retry. Still no match -> verify-inconclusive, skip latest entirely. - Step 8d: install latest, run validated reproducer. Score from this. A single trigger drives synthesis: "the script didn't expose the bug on baseline." Whether it was silent or noisy doesn't matter; both mean the script needs work. This collapses the previous separate "narrative-only -> synth" and "verbatim-failed -> ???" paths into one rule. Step 6 simplified accordingly: just extract verbatim if available, otherwise carry the issue body forward to Step 8b for on-demand synthesis. No more "give up before provisioning" branch — that decision moves into Step 8c where it has more context to work with. Step 9 adds a baseline-validation gating rule. If the reported- version install fails (old releases rot — installer URLs, deps, OS images drift), the score is capped at 84 unless commit-area or PR- mention evidence also fires. That forces the @-mention-reporter band when we lack independent corroboration, which is the honest position when we couldn't validate the script ourselves. Step 11 split into two failure types: - Latest-install fail or harness error -> hard infra failure (no label, optional comment, move on). - Baseline-install fail -> degraded mode (skip baseline gate, run latest anyway, apply Step 9 cap). Wallclock budget bumped from 15 to 25 minutes per verification to accommodate two installs. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

Tier-1 (blocking for E2E): - Add the still-reproduces-on-latest path. If the latest run output matches the issue symptom, the bug is still live — the skill posts a "still reproducible" comment with both transcripts, applies no label, and includes a dated marker ``. No new label per user direction; a 7-day TTL on the marker handles re-verification on the next weekly cron. - Update the comment template for the two-pass flow. The previous template described a single transcript; the new one has Baseline and Latest sections plus a dedicated still-reproduces template, and the idempotency marker now carries today's date. - Update the Step 12 log template to record baseline-install, baseline-match, latest-install, and latest-result independently. - Clarify in Step 4 that REPORTED_VERSION must be the full tag string ("v0.0.32"), not just the patch number (32). Step 8a's installer expects the full tag. Tier-2: - Spell out an explicit match rubric in Step 8b: exit code agreement, symptom-phrase match (LLM-judged semantic equivalence counts), and distinguishing the bug from infra noise (DNS / auth / rate limits). - Replace the minimal `rm -rf ~/.nemoclaw` reset with a comprehensive one that stops nemoclaw/openshell processes, removes spawned Docker containers, frees common ports (8080, 18789, 9119), and removes the installed binaries and lib paths. Run before each install in 8a and 8d; idempotent so it's safe on a fresh box. - Add interactive-subcommand handling in 8b: auto-detect `nemoclaw onboard` / `configure` and try `--non-interactive`, then `--dangerously-skip-prompts`, then stdin pre-feed. Fall through to Step 8c (synth) if none work. Tier-3 / opportunistic: - Fix the actual installer command (gap 9). The installer accepts the target ref via `NEMOCLAW_INSTALL_TAG` env var, NOT a `--version` flag (verified against install.sh source). Updated 8a accordingly. - Update Step 3 idempotency: drop on either label OR a marker comment within the last 7 days. Previously a marker excluded the issue forever; the release sweep only clears labels, so the marker alone made re-verification impossible. The 7-day TTL also handles the still-reproduces case naturally. - Extend wallclock budget to 60 min when issue body contains time-sensitive keywords (`after N minutes`, `eventually`, `memory leak`, etc.). Hard ceiling at 60 — bugs that require hours fall out of v1 scope. - Keep provisioned boxes for 30 minutes on verify-inconclusive outcomes so a maintainer can `brev shell` in and triage. Reused boxes always stay. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

Five concrete fixes from the pre-E2E gap analysis, plus a verified finding on gap 2 that lowered its risk profile. Gap 1 — sudo behavior. The reset script's `sudo rm -f` and `sudo rm -rf` could hang on a password prompt if the Brev image's default user lacks passwordless sudo. Switched all sudo calls to `sudo -n` (non- interactive) so they fail fast instead. Added a precondition note explaining the assumption: Brev stock images allow passwordless sudo, custom images may not. The user-local install path (~/.nemoclaw) is fully reset regardless of sudo state, so the worst case is a stale /usr/local/bin/nemoclaw symlink that the next install overwrites. Gap 2 — verified the install architecture supports old-version pinning natively. The bootstrap clones the requested ref and runs that tag's own install.sh (with a payload-marker fallback for legacy ones), so NEMOCLAW_INSTALL_TAG=v0.0.32 works architecturally. Risk reduces to "does v0.0.32's own installer still work on a 2026 OS image" — handled by Step 11's degraded-mode fall-through. No SKILL.md change needed. Gap 4 — path extraction was hand-waved. The +25 commits-touched-area weight referenced `git log v<reported>..$LATEST -- <path>` without specifying how to determine `<path>`. Added a three-tier extraction procedure: stack-trace path mentions parsed from the body and mapped to repo paths, then a component-label-to-directory map (NemoClaw CLI → bin/, Sandbox → src/lib/sandbox/, etc.), then title-keyword fallbacks. If none yields a path, skip the +25 signal entirely rather than guessing — floating the weight would inflate scores meaninglessly. Gap 5 — PR-search query was hand-waved. The +25 PR-mention weight had no concrete query. Added two-stage search: first `gh pr list --search "$ISSUE_NUMBER"` filtered for actual `#NNNN` references in body or title, then a symptom-phrase fallback if direct reference returns nothing. PRs that merged before the issue was filed are excluded via `mergedAt > tag-date(REPORTED_VERSION)`. Gap 6 — issues without an "Actual result" section had no match path. Behavioral / configuration bugs (e.g., #1242 "should default to a stable released version") describe a wrong default rather than a runtime error. Added a fallback to Step 8b's match rubric: use the title + full body as the symptom, match if reproducer's outcome contradicts the issue's expected behavior. If neither error string nor expected-behavior contradiction can be identified, route to Step 8c (synth-repro) so the LLM produces a more diagnostic script. Gap 7 — transcript redaction was thin. Step 10 only covered tokens, paths, and basic-auth URLs. Real transcripts leak more: Authorization headers, JWTs, GitHub PATs, AWS keys, NVIDIA API keys, internal hostnames, emails (PII), long base64 blobs. Expanded the redaction table with concrete patterns for each. Added an order note: longest/most-specific patterns first so the generic base64 catchall doesn't mask what was actually redacted. Made it explicit that redaction runs on every quoted chunk in the comment, not just issue- body excerpts. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

…l SKU pick Preflight Brev auth and install-URL reachability before paying any cost (Step 6.5). For pure-CLI bugs with no sandbox, Docker, or GPU dependency, try the reproducer locally on the maintainer's `nemoclaw` install before provisioning a Brev box (Step 6.7) — same evidence, zero cost. Replace the `<your-team's-CPU-SKU>` placeholder in Step 7 with a runtime pick via `brev search cpu --sort price --json | jq` so the skill keeps working as SKUs change, with a `VERIFY_STALE_CPU_TYPE` env override for teams that need to pin. Drop the spurious `--yes` flag from `brev delete` — it does not exist on that subcommand and would error. `brev delete` is non-interactive by default. Print the instance name before the trap registers so manual cleanup is possible if the trap doesn't fire. Thread `INSTALL_URL` through Step 8a/8d so the URL override from Step 6.5 applies to both the baseline and latest installs. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

Some bugs are filed against behavior that was deliberately removed or changed in a merged PR. Running the standard rubric on these produces a misleading verdict — the symptom "still reproduces" but the right answer is "won't fix, see PR #X." Issue #2791 is the prototype: `config set` was removed in #2227, the reporter tested a version that already had it gone, and the standard rubric would have buried that context under a low-confidence `verify-inconclusive` label. Add Step 8.5 with three independent detection signals: - Maintainer-attribution comment phrasing ("removed in #N", "by design", "wontfix", "intentional") from MEMBER/OWNER/COLLABORATOR authors. - A removal commit between reported version and `$LATEST` whose subject matches `\b(remove|delete|drop|deprecate)\b` and whose diff deletes the symbol implicated by the reproducer. - The implicated symbol absent from both reported version and `$LATEST`, meaning it was already gone when the issue was filed. When any signal fires, skip Step 9 scoring and Brev provisioning, apply a new `wontfix-by-design` label, and post a focused comment that links to the responsible PR. Detection on signals 2 and 3 can run as soon as the reported version is parsed, saving Brev cost on the entire class. Generalize the Step 3 idempotency marker regex from `v1` to `v\d+` so future skill versions can re-verify older-marked issues by tightening the regex to require a specific marker version. Add `wontfix-by-design` to the idempotency label list, the activity log entry options, the session summary, and the release sweep in `nemoclaw-maintainer-cut-release-tag`. Update the skill's frontmatter description to mention the new label and the local-first short-circuit. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

…var log path The issue body that comes back from `gh issue view` for NV QA bugs is HTML, not markdown — `<pre>...</pre>` blocks and tables, with HTML-encoded entities. The previous Step 6 only mentioned `<pre>` in prose without an extractor that actually parsed it. Add a Python extractor that handles markdown fences (triple-backtick and tilde), HTML `<pre>`, and entity unescape in one pass, so verbatim reproducers from QA-filed issues land cleanly instead of falling through to LLM synthesis with a -30 penalty. Step 10 already had a comprehensive redaction regex table, but the patterns assumed plain text — tokens nested in HTML tags or attributes slipped through unredacted. Add an HTML to text pre-pass for issue-body excerpts before the regex table runs, so quoted excerpts from QA bodies get the same redaction coverage transcripts already do. Step 5's GPU keyword list flagged `inference` and `model serving`, both of which false-positive on CPU bugs (`models.providers.inference.baseUrl` is a config path, not a GPU need). Tighten to whole-word matches and swap in `vllm`, `tensorrt`, `L40S`, `L4`, `T4`. Step 12's activity log was hard-coded to one maintainer's personal organizer path. Read from `VERIFY_STALE_LOG_DIR` with the existing path as default, and require the skill to create the directory rather than assuming it exists, so the skill works in CI / shared-volume / other environments. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

Path-extraction map in Step 9 was written from assumption rather than verification — `src/lib/sandbox/`, `src/lib/openshell/`, and `src/lib/policy/` don't exist in the current repo, and OpenShell lives in a separate repo entirely. The +25 commits-touched-component signal would have silently misfired on every Sandbox / OpenShell / policy issue, underscoring real fixes. Replace with paths verified against the current tree (`nemoclaw/src/blueprint/`, `nemoclaw-blueprint/`, `nemoclaw/src/commands/`, etc.) and explicitly note the cross-repo OpenShell case is out of scope for v1. Step 8 reset wiped `~/.nemoclaw` but not `~/.openclaw`. Sandbox state has been writable-by-default under `.openclaw` since #2227, so it persists across the baseline → latest reinstall and contaminates the latest run. Add it to the reset. Anchor the `pkill -9 -f nemoclaw|openshell` patterns to a leading slash (`/nemoclaw`, `/openshell`) so the kill matches actual installed paths but not unrelated processes that mention these strings — including the agent harness running this skill if its working directory contains the word. Reorder the Step 10 redaction table to match the stated "longest, most specific patterns first" rule. The previous order had the generic `(token|secret|password|...)` pattern executing before JWT/AWS/NVIDIA patterns, which would have masked specific-token redactions with generic ones — losing the signal of *what kind* of credential leaked. Replace `git ls-remote git@github.com:NVIDIA/...` (Steps 2 and 4) with `gh api repos/NVIDIA/NemoClaw/tags --paginate --jq '.[].name'`. The SSH path required keys to be configured; `gh api` reuses whatever auth the user already has. Add a CLI-deps precondition check (`command -v gh brev jq python3 curl`) to Step 6.5 so the skill fails fast if any required dependency is missing, instead of failing late and confusingly mid-run. Step 11's keep-box-on-inconclusive said "delay cleanup by 30 minutes" but had no implementation. A backgrounded `sleep && brev delete` doesn't survive session end. Replace with skip-the-trap-entirely plus an explicit manual-cleanup reminder printed to the run output. Out-of-Scope was contradicting itself on macOS — Step 6.7 explicitly allows macOS for local-first runs. Clarify: Brev path is Linux-only, the local-first short-circuit on a maintainer's laptop works on macOS for manual single-issue runs. Add a `local (no Brev — Step 6.7 short-circuit)` option to the Step 12 log entry's Box field so local-first runs round-trip the activity log correctly. Frontmatter description said "latest release" but NemoClaw uses tags, not GitHub Releases — fix to "latest tag" to match Step 2. Step 4's implementer note said "Two real failure modes" but listed three; fix the off-by-one. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

…-verification Step 8.5 was producing comments that were correct in direction but hand-wavy on substance. A side-by-side against an independent analysis of #2168 surfaced the gap: that analysis cited specific file:line locations, separated the literal bug-as-filed from a related failure mode that still exists on latest, called out existing CI coverage of the new workflow, and self-corrected an overstatement when prompted to re-verify. Today the skill required none of that. Restructure Step 8.5 into substeps so the rigor is mechanical, not prompt-dependent: - 8.5a: signal detection now requires verifiable evidence per signal — comment URL + quoted phrase (signal 1), commit SHA + diff range (signal 2), grep commands + outputs (signal 3). Add a sub-case for vestigial deprecation shims so a stub doesn't silently fail signal 3 by appearing in latest. - 8.5b: pre-check for related failure modes. Grep latest for the bug's symptom keywords (not the removed symbol) and require the comment to call out anything that surfaces. This is the section the side-by-side identified as missing — saying "the bug as filed can't reproduce" is not the same as "every bug shaped like this is fixed." - 8.5c: check existing test coverage for the new workflow and cite up to three test paths if found. Strengthens the comment from "trust me, it was removed" to "the new workflow is exercised by these tests." - 8.5d: explicit self-verification pass — re-run every cited command before composing the comment; bail to verify-inconclusive on any discrepancy. LLMs confidently overstate; mechanical re-verification catches it without needing a human prompt. Replace the by-design comment template with one that has mandatory sections matching the substeps: "What's structurally fixed", "Vestigial references", "What's not literally the same bug", "Existing CI coverage", "Recommendation". Each section either has concrete content or is explicitly omitted — no hand-wavy claims slip through. Add NVBugs cross-reference extraction to Step 4 (`grep -oE '\[NVB#[0-9]+\]'`) so the by-design template can append the standard "NVBugs#NNNNNNN will need a separate update; closing this GitHub issue won't propagate" reminder when the issue body carries that footer (most NV QA bugs do). Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

The previous signal-2 candidate query narrowed by commit subject (`remove|delete|drop|deprecate`) before checking whether the diff deleted the implicated symbol. That filter excludes the common case where a removal lands inside a `refactor(...)` or `feat(...)` commit — e.g., PR #2227 removed `--dangerously-skip-permissions` under a "refactor(sandbox): default to mutable config" subject. A literal follower of the previous rule would have concluded signal 2 doesn't fire on issue #2168 even though the flag was deliberately removed. Switch the primary candidate lookup to `git log -S<symbol>` pickaxe search, which finds every commit whose diff changes the count of the symbol regardless of subject wording. Keep the subject-keyword query as a supplementary lookup for narrowing when the pickaxe returns many commits. Note explicitly that the commit's actual subject doesn't need to mention removal. Surfaced while testing the new Step 8.5 rigor against #2168. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

…horing Two changes that surfaced from running Step 8.5 against issue #2168. Use the existing repo `wontfix` label for the by-design path instead of creating a new `wontfix-by-design` label. The repo already has `wontfix` and it's already in Step 3's issue-type skip list, so applying it gives idempotency for free without proliferating labels. The by-design verdict is encoded in the marker comment, not the label vocabulary. Important consequence: the release sweep in `nemoclaw-maintainer-cut-release-tag` does NOT clear `wontfix` — that label is also applied for non-skill reasons (scope, priority, dup decisions made by maintainers), and sweeping it would erase human triage work. Sweep stays scoped to `fixed-on-latest` and `verify-inconclusive`, which are skill-only. Add a `**Verification mode:**` line to the by-design comment template that explicitly says "static analysis at the verified-on tag — no runtime reproduction." This was the honesty an independent side-by-side analysis of #2168 caught: the previous template implied runtime confirmation even though Step 8.5 short-circuits Brev provisioning by design. Add a tag-anchoring rule above the template too — every `file:line` cite must refer to the verified-on tag, not the maintainer's HEAD, since line numbers drift between tags and main and we want comments to stay reproducible by anyone reading them later. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

…ion check Bare `file:line` paths in comments force the reader to navigate manually — that's a usability bug, not a stylistic preference. Comments posted by the skill should be self-serving artifacts: a maintainer or QA reader should be able to click any citation and land on the exact content at the verified-on tag. Extend the tag-anchoring rule above the by-design template into a "tag-anchoring + linking rule" with concrete formats for file blobs, file:line, file:line-range, commit SHAs, and test paths. Note that bare `#NNNN` PR/issue references already auto-link in same-repo comments. Strengthen Step 8.5d into two passes: evidence (re-run cited commands) and link (resolve at least one rendered link per evidence section via `gh api .../contents/<path>?ref=<tag>` or `curl -fsI`). A 404 on a citation suggests verification that didn't actually happen — worse than no link at all. Bail to verify-inconclusive on any failure. Surfaced while reviewing the rendered #2168 comment: the citations were correct and tag-anchored but not clickable. Better caught now than after the first batch run posts a wall of bare paths. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

Discovered while applying the by-design label to issue #2168: the skill referenced `wontfix` and `needs-info` as bare strings, but the repo's canonical labels are `status: wont-fix` and `status: needs-info` (with prefix and hyphen). Same shape in Step 3's issue-type skip list — issues labelled `status: wont-fix` or `status: needs-info` were NOT being filtered out at candidate-time as the spec intended, because the skip list keys didn't match the live labels. Three coordinated fixes: 1. Step 3 issue-type skip list now uses `status: wont-fix` and `status: needs-info`. Annotate the rule so future readers know not to substitute the bare forms. 2. Step 8.5 by-design action applies `status: wont-fix` (with quoting on the CLI). Step 12 log entry, session summary, frontmatter description, and Companion Behavior section in both the verify-stale skill and `nemoclaw-maintainer-cut-release-tag` updated to match. 3. Step 6.5 preconditions now verifies all expected labels exist on the repo with `gh label list` before any later step tries to apply them. This is the gap that bit us on #2168 — the comment posted but the label couldn't be applied because the spec name didn't match. Failing fast in 6.5 means the run aborts before any Brev cost or comment posting can happen against a misconfigured label vocabulary. Bare `wontfix` / `won't fix` / etc. are still in Signal 1's detection regex — that's intentional, those are SEARCH PATTERNS for what maintainers might write in free-text comments, not label names. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

…s for faithfulness Two changes that landed before the first real Brev E2E run (#2007). The previous wallclock cap split — 25 min default with a 60 min extension keyed off "memory leak" / "over time" / "eventually" keywords — optimised for the wrong constraint. Most issues fit comfortably under 60 min once you account for two full install passes plus comprehensive resets, and the keyword-based extension forced re-runs whenever a real install or bootstrap took longer than the optimistic 25 min budget. Bump the cap to a single 60 min default and drop the keyword extension; bugs that genuinely require more than an hour fall out of v1 scope. Add a new Step 8a.5 for bootstrapping reproducer dependencies that the stock Brev image doesn't ship — local model servers (Ollama, vLLM), provider runtimes, third-party CLIs. Default policy is maximum faithfulness: install the actual dependency the reporter used rather than substituting a stub. Substituting trades faithfulness for speed, and on a 60 min budget that trade is rarely worth it; it almost always introduces a confound that makes the verdict less trustworthy. Document the canonical Ollama and vLLM bootstraps with specific attention to daemon survival between `brev exec` calls (Ollama's installer registers a systemd service; vLLM uses nohup + log file). Bootstrap runs once before Step 8b's baseline and is reused for Step 8d's latest run — model downloads are expensive and external state, so the comprehensive reset must explicitly leave them alone. Carve out one substitution case: providers that require API keys the skill cannot safely supply (NIM, OpenAI, Anthropic). Stubbing a key defeats faithfulness anyway, and a real key shouldn't sit in a verify-stale run. Substitution applies the same -30 LLM-synth penalty as Step 8b and must be documented in the rendered comment. Bootstrap failure escalates to Step 11 infra failure — do NOT silently substitute, since the maintainer opted into faithfulness for a reason. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

…get on every comment Two requirements surfaced from the #2007 e2e run that should be enforced by the skill, not by per-session memory or by the prompter remembering. Mandatory reporter @-mention with confirmation language. The skill cannot independently confirm a closed-as-fixed verdict — only the reporter knows whether their original symptom is gone in their environment. The @-mention is what converts a "skill says it is fixed" claim into actionable confirmation work for QA. Add the explicit closing block (canonical wording: "please confirm the symptom is gone on a recent build and reopen with a fresh reproducer if you observe otherwise") to all three Step 10 templates: fixed-on-latest, still-reproduces, and the Step 8.5 by-design template. The previous "If this verification is wrong, please reopen..." line was passive; the new line names the reporter and asks them to act. Length target. Default rendered comments to 400-500 words. The evidence table or by-design fixed/vestigial sections are the hero; everything else has to either change the reader's mind about the verdict or be deleted. The first #2007 draft ran ~750 words and the user pushed back explicitly; encoding the cap into Step 10 means future agents do not start from scratch on length each run. Surfaced from: e2e Brev verification run on issue #2007 (the first real Brev-track exercise of the skill end to end). Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

Six concrete gaps that surfaced during the first end-to-end Brev run (issue #2007). Each almost produced a wrong verdict or a wasted run; each is now encoded so the next agent doesn't have to re-discover. 1. Step 9 baseline-validation cap-removal rule was too permissive. The `unless commits-touched OR PR-mention also fires` escape hatch let inferred fix evidence override the absence of a baseline run, which produced a misleading 100/100 on #2007 despite zero baseline confirmation. Tightened: cap at 84 holds regardless of corroboration; PR-search signals raise the score within the cap, never past it. Step 10 now requires an explicit one-line caveat naming the cap and the reason in the rendered Verdict section. 2. Step 6.5 install URL default switched from `nemoclaw.nvidia.com` (NVIDIA-internal, doesn't resolve from Brev) to `www.nvidia.com/ nemoclaw.sh` (public Akamai 301 redirect). Brev runs were dying at bootstrap on the wrong host. 3. Step 7 CPU SKU picker now biases by reproducer-implied memory needs. On #2007 the cheapest 2 GB SKU couldn't load a 4.8 GiB Ollama probe; onboard failed at provider validation and we burned ~25 min before re-provisioning a 16 GB box. Adds a `CPU_RAM_FLOOR` env var and a memory-floor heuristic (16 GB when reproducer references a model server, 8 GB for sandbox onboarding without a model, 4 GB for pure-CLI bugs). 4. Step 11 failure taxonomy now has a "baseline-build rot" bucket distinct from "binary install rot." Same `BASELINE_INSTALL_FAILED=1` flag and same downstream cap-and-degrade behavior, but failures at the in-image Dockerfile build phase (what we hit on v0.0.18 — the `.openclaw-data/workspace/media` symlink layer, removed entirely by #2227) get a separate label so reviewers can see *why* the old image no longer builds without re-running. 5. New Step 8a.5b documents the two non-obvious `brev exec` quirks that reproducer scripts have to handle every time: PATH does not include `~/.local/bin` in non-login shells (so reproducers must `export PATH` at the top, or callers must wrap with `bash -lc`); and the docker group requires `sg docker -c '...'` because adding the user via `usermod -aG` doesn't take effect within the same Brev session. 6. New Step 8d.5 architectural-drift check. When the diff between `$REPORTED_VERSION` and `$LATEST` touches the *tool* the reproducer's expected output depends on (e.g. `openshell forward` between v0.0.18 and v0.0.35), don't trust the reproducer's surface alone — multi-axis verification on OS-level surfaces (host listeners, NAT rules, docker ports, SSH tunnels, etc.) is required before claiming fixed-on-latest. This is the five-axis pattern we used to confirm #2007 wasn't a false positive; codifies it as a check the skill applies whenever pickaxe shows the reproducer's tool was reworked. Surfaced from: end-to-end Brev verification of issue #2007. Build the skill, exercise it on a real issue, fix what breaks, repeat — every fix here came from a concrete failure mode in one real run. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

…ll host Two small UX fixes surfaced from a thorough re-read. Step 6.5's brev-auth error printed three numbered options without a clear recommendation. From a non-TTY agent harness (Claude Code or any unattended runner) the right answer is option 2 — open a separate terminal, run `brev login`, complete the browser flow, come back. The previous message buried that path inside option 2's inline comment. Replace with a directive recipe that names the recommended flow first (open separate terminal, browser auth, re-run the skill) and lists the headless alternatives below as "when option 1 isn't available." Also explicitly notes that credentials persist to ~/.brev/credentials.json, so re-running the skill picks them up automatically. Step 6.5's install-URL error still suggested checking `https://nemoclaw.nvidia.com` even after the default was switched to `https://www.nvidia.com/nemoclaw.sh` in commit 296f7cd. Update the suggestion to match the new default and add an explicit "then re-run this skill" so the maintainer knows the flow is fix-and-retry, not abort. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

Five gaps verified against the source during a thorough re-read pass. The audit also caught two false-positive claims I'd made earlier (`NEMOCLAW_INSTALL_TAG` env var and `comments` field in batch fetch) that didn't survive verification — those are not landed because there was nothing to fix. Step 1 batch discovery query bumped from `--limit 100` to `--limit 1000` so the skill doesn't silently drop issues beyond the page. The recent candidate triage on this repo found 129 open bugs; the previous limit would have lost 29 of them. The 20-issue per-run processing cap is downstream of Step 3/4 filters and unaffected. Step 3's 7-day comment-marker TTL had the rule but no implementation. Add a portable `gh issue view --json comments --jq` snippet that extracts marker comments newer than the cutoff. macOS and Linux date(1) syntax differ for relative-date math, so the cutoff line tries both. Without this, the first cron run after any posting will silently re-verify everything previously marked, posting duplicate comments every Monday. Step 6.5 preconditions now checks gh identity via `gh api user --jq .login` and prints the resolved login before any later step runs. Comments posted by Step 10 land under this account; surfacing it explicitly catches the multi-token / wrong-tab / mid-session-reauth case before a public comment lands under the wrong handle. Step 10 now requires a `**Verification mode:**` header line in every template (was previously by-design only). Reader should never have to guess whether a verdict came from real install logs or from static analysis. Filled in concrete defaults for the standard fixed/inconclusive template ("runtime reproduction; baseline + latest installed and run") and the still-reproduces template ("runtime reproduction; bug confirmed live"). Step 10 also requires a link-pass self-verification on every template, not just by-design's Step 8.5d. The "Tag-anchoring + linking rule" already declared every citation must be a clickable markdown link to the verified-on tag, but the verification step that resolves those links (`gh api .../contents/<path>?ref=<tag>` or `curl -fsI`) was scoped only to the by-design path. Same 404-cite risk applies to the standard template — broken citation links advertise verification work that didn't happen, and that's worse than no citation at all. Surfaced from: end-to-end audit after the #2007 e2e run, with the specific `gh issue view`/`grep` commands run against the SKILL.md to confirm each claim before listing it as a gap. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

Two changes that together turn the per-run batch cap from policy into code. Lower the per-batch processing cap from 20 to 15. Sequential execution with 1-2 reused Brev boxes works fine for runs in this size range; ~2-3 hours wallclock for a full 15-issue batch is the comfortable budget before either the per-plan approval gate breaks down or the maintainer session needs to span multiple sittings. Add the slicing code that actually enforces the cap. Previously Step 1 declared "Cap at N issues per run" as policy and nothing applied it — candidate sets larger than the cap would just process to completion silently. Move the slice to the end of Step 4 (after Step 3 label filters and the Step 4 version+candidate-rule filters narrow the pool) and sort by `(-versions_behind, -age_days)` so the most stale come first. Spillover beyond 15 stays eligible for the next run via the marker-comment 7-day TTL added in 8976176. Single-issue mode bypasses the cap entirely; the maintainer named the issue explicitly. Cadence section updated from "≤20 issues" to "≤15 issues" to match. Surfaced from the post-#2007 audit: of the 13 gaps initially called out, the cap-enforcement one was operational toil (skill could chew through cost on a runaway batch). Lowering and enforcing closes that toil path. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

Looked at the 24 candidates surfaced by the recent triage run by *kind* of bug rather than by version, and found six gaps where the standard rubric would produce a wrong verdict or hang. All six landed in this commit. Step 3 platform skip: drop `Platform: Jetson AGX Thor/Orin`. Brev has no equivalent silicon (embedded/edge ARM with integrated GPU is not in the SKU catalog), so any Brev verification of a Jetson-only bug produces a misleading "fixed-on-x86" verdict. `Platform: DGX Spark` and `Platform: GB10` stay in scope but Step 10 now requires a hardware-substitution caveat naming the Brev SKU we substituted. Step 3 TUI / interactive-UI skip: drop issues whose title contains TUI / dashboard UI / chat UI / keystroke / key press, or whose body describes interactive UI behavior without a non-interactive reproducer. `brev exec` does not allocate a real TTY by default; TUI reproducers hang or silently fail at the first prompt. v1 documents this as out-of-scope; v1.1 may add a script(1) / expect / tmux harness. Step 5 now classifies bug-class in addition to CPU/GPU. Four classes: `performance`, `rebuild-cycle`, `log-only`, `functional` (default). Detection heuristics from the issue body (latency thresholds, "across rebuilds" / "after restart", "see lots of error in <X> log"). Each class routes to a different Step 8 rubric. Step 8b: log-scraping. When `BUG_CLASS=log-only`, also pull `~/.openclaw/logs/*.log` and `/var/log/nemoclaw/*.log` from inside the sandbox after the reproducer runs and search them for the issue's symptom phrase. Some bugs describe symptoms in internal log files, not the reproducer's stdout; the previous match rubric only checked the transcript. Step 8b: flake-detection retry. For functional bugs, run baseline three times if the first run shows the symptom inconsistently. Mixed results (1 or 2 of 3 reproduce) trigger a "flake suspected" caveat, −25 score adjustment, and downgrade `+50 latest clean` to `+25` so a lucky-clean-latest-run on an intermittent bug doesn't silently become a fixed-on-latest verdict. New Step 8e: performance-bug verification. Multi-run latency distribution rubric — N=10 runs each side, compute p50/p90, parse SLA from issue body, match latest's distribution against the SLA rather than against a symptom phrase. Cap at 60 unless the bug is silicon-independent because Brev SKUs aren't faithful to DGX Spark or GB10 hardware for performance-shape bugs. New Step 8f: rebuild-cycle verification. Run-rebuild-rerun harness for bugs that only manifest across destroy/recreate boundaries (#2701-shape). Capture artifacts pre-rebuild, trigger `nemoclaw destroy --all --force && nemoclaw onboard`, re-capture post-rebuild, diff to determine whether the artifact persisted as the issue expects. Step 10 mandatory hardware-substitution caveat: when DGX Spark or GB10 is in the issue's labels and Step 7 substituted with a different silicon class, the comment metadata block must name the substitution explicitly so silicon-shape bugs get the right reader expectation. Surfaced from: stress-testing the skill against the actual shapes of bugs in the 24-candidate set, not just against the bugs we already verified. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

Six gaps surfaced from the second e2e run (issue #2592) and the broader audit of how the skill handles provider classification. Step 5 now classifies provider in addition to CPU/GPU and bug-class. Detection signals from labels and body keywords identify NIM, Gemini, Anthropic, Bedrock, or Ollama. When the reproducer references a non-Ollama provider AND actually exercises inference, the skill stops at Step 5 and prompts the maintainer interactively before any Brev cost — three options: provide the API key, accept Ollama substitution with the -30 penalty, or skip to verify-inconclusive. Pure-CLI / pure-sandbox bugs are exempt because the provider doesn't matter. Step 6 reproducer-extraction regex extended to also match `openclaw` and `openshell` invocations as anchor words. Issue #2592's reproducer was `openclaw channels add telegram` run inside the sandbox; the previous `nemoclaw`-only regex would have missed the verbatim block. Step 8a now passes the provider env vars through to install.sh's bundled onboard step, so it doesn't fall back to the default `build` (NIM) provider and fail with the misleading NVIDIA_API_KEY missing error. NEMOCLAW_PROVIDER=ollama (the default case) makes the bundled onboard use the local Ollama set up in Step 8a.5; NVIDIA_API_KEY only flows through if the maintainer provided one at Step 5's prompt. The bundled onboard creates a throwaway sandbox that gets destroyed before the reproducer runs. The reproducer's own onboard should pass `--fresh` so a half-built install-sandbox doesn't trip the "previous session failed" guard. Step 8a.5 now has an Ollama-coverage table making explicit which bug classes Ollama covers faithfully (CLI, sandbox, networking) vs which ones it doesn't (provider-specific behavior, model-specific behavior, silicon-dependent perf). Step 5's prompt keys off this table. Step 8a.5b documents the openshell-sandbox-exec syntax footgun. The correct form is `openshell sandbox exec -n NAME -- CMD`; the wrong form silently auto-detects the sandbox by "last used" and stuffs the leftover positional into bash's $0. Same section gets a brev-exec re-execution guard via a sentinel file at ~/.verify-stale-running to prevent the double-onboard scenario when SSH drops mid-run. Step 11 reframes baseline-build-rot as the dominant failure mode for any reported version >5-7 patches behind, not an edge case. Both Brev e2e runs hit it. Cap-at-84 with reporter @-mention is the modal verdict shape, not the exception — pre-flight carries more weight than baseline runtime evidence for older bugs. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

…sandbox name Two fixes from a rot-debugging investigation that nearly produced a wrong conclusion. When testing whether old NemoClaw versions actually rot under the literal end-user invocation (`https://www.nvidia.com/nemoclaw.sh`), the first test run was botched by a shell-scoping mistake: `NEMOCLAW_INSTALL_TAG=v0.0.26 curl ... | bash` scopes the env var to curl, not to the downstream bash reading from the pipe. The bash side defaulted to `latest`, resolved to v0.0.36, and the install proceeded for several minutes producing convincing-looking output before the mistake surfaced. The install script had no log line saying which version it actually resolved, so the silent failure looked like a working install of v0.0.26. Skill-side fix: after each install (Step 8a baseline and Step 8d latest), echo the resolved `nemoclaw --version` and case-match it against the requested version. Mismatch in baseline sets BASELINE_INSTALL_FAILED=1 to prevent verifying against the wrong version. Mismatch in latest logs a WARN that gets surfaced in the final comment. Cheapest possible guard against the class "I asked for X, got Y" — and the principle generalizes: print the resolved state, never trust the requested state. Also fix the sandbox name in the install.sh-passthrough block: change `NEMOCLAW_SANDBOX_NAME=__verify_stale_install__` (rejected by NemoClaw's name validator: "Allowed format: lowercase, starts with a letter, letters/numbers/internal hyphens only, ends with letter/number") to `NEMOCLAW_SANDBOX_NAME=verify-stale-install`. Surfaced during the #2519 e2e run; net-zero impact on that run because install.sh fell through on name validation rather than reaching the buggy code path, but the spec was wrong. The deeper context for the resolved-version-echo fix: when the rot hypothesis was being tested, an independent agent's logical analysis correctly identified that the public install URL DOES support `NEMOCLAW_INSTALL_TAG=<ref>` and demanded an empirical re-test. The re-test (with the env var on the bash side of the pipe) installed v0.0.26 correctly and STILL hit `Patch 4 (replaceConfigFile EACCES) not applied` — so the cap-at-84 framing held empirically, just not for the original "git-clone is the only path" reason. The full investigation is documented in findings.md Part 6's closing section ("Closing investigation: is baseline-build rot real, or our setup artifact?"). Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

@cjagwani

Add a Step 3 skip rule for issues where a MEMBER/OWNER/COLLABORATOR comment was posted within the last 7 days AND the original reporter hasn't replied since. The skill is for *stale* issues; actively- discussed ones are in-flight, and posting a verify-stale verdict on top of an open clarifying question from a maintainer would conflict with their framing and confuse the reporter (now they have two simultaneous asks). Surfaced during pre-flight on issue #2757. The maintainer @cjagwani had just commented questioning the bug's premise — specifically that the reporter's "kill -9 took the parent down" framing didn't line up with how the gateway is launched (`nohup ... &` detaches it) — and asked the reporter to confirm what they actually observed on the Station. Running verify-stale would have posted a "by-design, close as wontfix" verdict on top of an active "wait, did this really happen?" exchange. Wrong move; the skill should detect this state and skip. Implementation reuses the SEVEN_DAYS_AGO cutoff already computed for the marker-TTL check, fetches the reporter via `gh issue view --json author`, then jq-walks the comments array: find the most recent maintainer comment within the cutoff, check whether any reporter comment exists after that timestamp; if maintainer commented recently AND no reporter reply since, skip. Single-issue mode prints the skip reason and exits friendlily so the maintainer knows why the skill bailed. Batch mode just moves on. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

…argv The post-#2592 commit (1d625dd) added a Step 5 prompt asking the maintainer to "Export NVIDIA_API_KEY=<key>" and propagate it via `brev exec`. During the #2604 e2e run that played out as `NVIDIA_API_KEY=<value> brev exec ...` — which puts the literal key in the brev exec process's argv, visible in `ps -ef` to anyone with shell access on either the maintainer's laptop or the Brev box for the duration of the run. The skill's stated promise ("never logged") got violated by argv visibility. Fix: switch propagation to file-based. Maintainer writes the key to ~/.nvidia-api-key with 600 perms on their laptop; Step 6.5 brev-copy's the file to the Brev box (encrypted SSH); install / reproducer scripts inside the box read with `NVIDIA_API_KEY=$(cat ~/.nvidia-api-key)`, which sets the env var in the script's own process — never on a command line and never visible in `ps -ef`. Update Step 5's option-1 prompt to instruct the maintainer to use the file form and to `rm ~/.nvidia-api-key` after the run. Add a new "API-key propagation pattern" section after Step 5 that documents the file-based mechanism explicitly. Update Step 8a (baseline install) and Step 8d (latest install) to source the key from the file inside the Brev box's exec context, not from the local shell's env. Note in the docs: if the key was previously propagated via the cmdline (pre-fix), treat it as exposed and rotate. The #2604 run did this; the maintainer was reminded to rotate the NIM key after the run. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

Replace Step 10's narrow "Length target" rule with a richer "Comment authoring principle" block that captures lessons accumulated across the e2e runs. The core rule: every section in a rendered comment must either change a reader's mind about the verdict or be cut. Word counts follow from that — 500 is a ceiling, not a target. Most comments land in 200-400. Simple cases land under 200. Document why this matters: comments posted by the skill compete for a maintainer's attention against every other in-flight thread on the repo, and AI-slop prose actively reduces signal-to-noise. Add a worked-examples block citing two real iterations: - #2007 first draft was 750 words; cut to 371. - #2604 took THREE drafts before settling on 190 words because each draft padded with prose that didn't ground the verdict — which caused the verdict itself to drift. Rule learned: name the verdict in one sentence first, then cut any section that doesn't support it. Add per-verdict length targets: - fixed-on-latest: 200-400 words - wontfix (by-design): 250-500 words - verify-inconclusive: 100-200 words - Still-reproduces (no label): 30-80 words — and no transcripts (issue body has them), no @-mention (reporter knows), no architectural prose. One sentence + marker. Add an explicit "cut, by default" list naming the patterns that historically padded comments without changing verdicts. Generalizes the existing memory note into the skill body so future agents reading SKILL.md inherit the principle directly. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

… ceiling Followup to fe363cf. The 200-400 / 250-500 ranges in the per-verdict table were still too permissive. Tighten to 200-300 for both fixed-on-latest and wontfix; verify-inconclusive stays at 100-200; still-reproduces stays at 30-80. Update the principle preamble: "300 is a hard ceiling for the main verdicts." Simple cases land under 200. If a draft is past 300, it's padding — cut before re-reading. Memory note (feedback_verify_stale_comment_length.md) updated to match. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

…d-question variant Previously skipped any issue with a maintainer comment in the last 7 days the reporter hadn't replied to. After 7 days the question is no longer "active discussion" — it's a stuck thread, and the skill can be the unsticking voice rather than a clueless interruption. Step 3: classifies the most-recent-unanswered-maintainer-comment as recent (within 7d → skip) or stale (>7d → proceed with variant), exporting UNANSWERED_MAINT_LOGIN/URL/DATE for the templater. Step 10: new mandatory block describing the variant — prepend an "@<maint>'s comment from <date> is still unanswered" lead paragraph and swap the closing reporter-only @-mention for a dual maintainer+reporter @-mention that flags the open question. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Layer 1 (local → Brev) was fixed in 22a3997 via brev copy. Layer 2 (on-box subshell) was unaddressed: scripts running on the Brev box that called sg docker -c "...$NVIDIA_API_KEY..." re-introduced argv exposure because the outer double-quoted heredoc interpolates the key value into the inner shell's argv, visible in ps -ef on the box for the onboard duration. Surfaced during #2611 e2e run when the just-rotated key landed verbatim in `sg docker -c` argv. The right pattern: escape the $ so the inner shell evaluates `cat ~/.nvidia-api-key` itself. The skill section now documents both layers with concrete WRONG / RIGHT side-by-side, and generalizes the rule to any command-string- taking invocation (sg, bash -c, su -c, ssh). Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three skill paragraphs used `[label](placeholder)` syntax to describe the shape of a markdown link the templater would render. The link checker parses bracket-paren syntax everywhere except fenced code blocks (single backticks aren't enough), so `(<url>)` and `(<UNANSWERED_MAINT_URL>)` were flagged as broken local links. Restructure: fenced code blocks for the literal template line in Step 10, plain prose in Step 3 referencing the Step 10 template instead of duplicating the link shape inline. Verified locally with `bash test/e2e/e2e-cloud-experimental/check-docs.sh --only-links --local-only` — passes cleanly. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.agents/skills/nemoclaw-maintainer-cut-release-tag/SKILL.md:
- Around line 135-139: The shell loop that pipes issue numbers into xargs can
run a bogus gh issue edit when input is empty; update the pipeline that iterates
labels (the for label in fixed-on-latest verify-inconclusive; do ... gh issue
edit {}) to make xargs safe by adding the -r option (or otherwise guard on
non-empty input) so xargs will not run when there are no issue numbers, and
apply this change in the original docs source that generates the .agents/skills
markdown (not the generated SKILL.md).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5ce3439c-9759-45a0-a42b-14bbd316dfca

📥 Commits

Reviewing files that changed from the base of the PR and between d0cbe7c and 3bc1b8c.

📒 Files selected for processing (3)

.agents/skills/nemoclaw-maintainer-cut-release-tag/SKILL.md
.agents/skills/nemoclaw-maintainer-verify-stale/SKILL.md
.agents/skills/nemoclaw-skills-guide/SKILL.md

When no open issues carry `fixed-on-latest` or `verify-inconclusive`, the previous pipeline still ran `gh issue edit` once with empty stdin, producing a noisy failure. `xargs -r` skips the invocation entirely when input is empty. CodeRabbit suggestion (PR #3063 review). The skill is hand-authored (not autogenerated from docs/), so the fix lands directly in .agents/skills/. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rogressive disclosure Per Anthropic's skill-authoring best practices guide, SKILL.md body should stay under 500 lines so it doesn't dominate context once loaded. The verify-stale SKILL.md was 1505 lines — 3× the target. Single split at the natural decision-vs-execution boundary (between Step 6.7 "try local first" and Step 7 "reuse or provision Brev"): SKILL.md (497 lines): Steps 1-6.7 — mode, latest detection, candidate filter, version parsing, environment classification, reproducer extraction, preconditions, local-first short-circuit. The decision tree for whether and how to verify. reference/execution-and-comment.md (1042 lines, with TOC): Steps 7-12 + Cadence + Out of Scope + Companion. Brev provisioning, baseline + latest verification, by-design detection, scoring, comment templates, redaction, infra failure, activity log. The implementation detail loaded only when a Brev run is committed to. The reference file's TOC at the top satisfies the >100-line reference file rule from the doc; single-level depth (no nested references) satisfies the "keep references one level deep" rule. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

copy-pr-bot · 2026-05-08T19:23:52Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
@.agents/skills/nemoclaw-maintainer-verify-stale/reference/execution-and-comment.md:
- Around line 52-53: The current RUNNING computation counts all verify-stale-*
instances from INSTANCES; update the jq selection to count only instances whose
name starts with "verify-stale-" AND whose status indicates they are running
(e.g., .status == "RUNNING") so the concurrency cap checks only active boxes;
modify the expression that sets RUNNING (the jq filter applied to INSTANCES) to
include the running-status predicate.
- Around line 316-318: The reproducer is executed via brev exec with a plain
non-login bash which can miss ~/.local/bin and Docker group context; update the
two brev exec invocations that run "bash ~/reproducer.sh" (both the baseline and
latest run occurrences) to invoke the script through a login shell and wrapped
with sg docker -c (or otherwise explicitly export/include $HOME/.local/bin in
PATH) so the ~/.local/bin and docker group membership are available during
execution, then tee the output to the same
baseline-transcript.log/latest-transcript.log as before.
- Around line 767-768: The credential-redaction regex currently uses escaped
pipes so it matches literal "|" instead of alternation; update the pattern
`(?i)(token\|secret\|password\|api[_-]?key\|bearer)[^\n]*[:=][^\n]*` to use
actual alternation (e.g.
`(?i)(?:token|secret|password|api[_-]?key|bearer)[^\n]*[:=][^\n]*`) and consider
adding optional whitespace around `[:=]` or word boundaries if needed to
reliably catch `token=...`, `password: ...` and `api_key = ...`; change the
regex where referenced so functions/filters that use this pattern (the
credential redaction rule) use the corrected expression.

In @.agents/skills/nemoclaw-maintainer-verify-stale/SKILL.md:
- Line 3: Update the description string in SKILL.md to use the canonical label
syntax: replace the occurrence of "status wont-fix" with "status: wont-fix" so
it matches the rest of the skill set and avoids operator confusion; edit the
description field near the top of the file (the `description:` paragraph) to
make this single-word change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 2127a531-9c7d-4192-9fed-193e68fa3de8

📥 Commits

Reviewing files that changed from the base of the PR and between e48e573 and e0eab59.

📒 Files selected for processing (2)

.agents/skills/nemoclaw-maintainer-verify-stale/SKILL.md
.agents/skills/nemoclaw-maintainer-verify-stale/reference/execution-and-comment.md

…Project 199 Adds tracker integration: when the verdict is fixed-on-latest, the skill also moves the issue to "Needs Review" on the NemoClaw Development Tracker (NVIDIA Project 199), so the maintainer's review queue picks it up for confirmation. After the maintainer closes the issue, existing project automation (or a manual move) advances it to Done. Other verdicts (wontfix, verify-inconclusive, no-label-still-reproduces) do not move the tracker — they have separate close paths. Constants captured at write time (re-run gh project field-list 199 --owner NVIDIA --format json if the project drifts): Project ID: PVT_kwDOABpemM4BSCP5 Status field ID: PVTSSF_lADOABpemM4BSCP5zg_r9p8 Needs Review opt ID: 5c5922a9 Step 6.5 gains a one-line precondition that warns (does not fail) when the gh CLI lacks the 'project' scope, so the auth gap surfaces before Step 10 instead of mid-run. Step 12's activity-log template gains a Tracker: row recording the move outcome. The tracker step is gated to fixed-on-latest only — wontfix and verify-inconclusive verdicts do not benefit from "Needs Review" since they're either final-state (wontfix) or different review flow (verify-inconclusive); attempting to move them would create churn. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai

♻️ Duplicate comments (4)

.agents/skills/nemoclaw-maintainer-verify-stale/SKILL.md (1)

3-3: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use canonical label spelling in the description.

Line 3 says status wont-fix, while the documented/actual label is status: wont-fix. Keeping the canonical string avoids operator mistakes when applying labels manually.

Suggested fix

-description: Verify whether old NVIDIA/NemoClaw bug reports still reproduce against the latest tag. Picks candidate issues opened against older versions, runs the reproducer locally first when possible (Linux or macOS), otherwise reuses or provisions a Brev Linux box (CPU or GPU), detects behavior that was intentionally changed, scores confidence, and posts an evidence-backed comment with a label (fixed-on-latest, status wont-fix, or verify-inconclusive). Tag-only — never auto-closes. Brev verification is Linux-only in v1; Windows and integration-token-dependent issues are skipped. Trigger keywords - verify stale, verify fixed, reproduce on latest, stale issue, old bug, fixed-on-latest, status wont-fix, verify-inconclusive, drain backlog, brev verify.
+description: Verify whether old NVIDIA/NemoClaw bug reports still reproduce against the latest tag. Picks candidate issues opened against older versions, runs the reproducer locally first when possible (Linux or macOS), otherwise reuses or provisions a Brev Linux box (CPU or GPU), detects behavior that was intentionally changed, scores confidence, and posts an evidence-backed comment with a label (fixed-on-latest, status: wont-fix, or verify-inconclusive). Tag-only — never auto-closes. Brev verification is Linux-only in v1; Windows and integration-token-dependent issues are skipped. Trigger keywords - verify stale, verify fixed, reproduce on latest, stale issue, old bug, fixed-on-latest, status: wont-fix, verify-inconclusive, drain backlog, brev verify.

Based on learnings: .agents/skills/nemoclaw-maintainer-* files are hand-authored source-of-truth and should be reviewed directly.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.agents/skills/nemoclaw-maintainer-verify-stale/SKILL.md at line 3, Update
the canonical label spelling in the SKILL.md description: replace the token
"status wont-fix" with the exact label "status: wont-fix" so the documented
labels (alongside "fixed-on-latest" and "verify-inconclusive") match the actual
label string operators use; edit the description line in
.agents/skills/nemoclaw-maintainer-verify-stale/SKILL.md to make this
single-word correction.

.agents/skills/nemoclaw-maintainer-verify-stale/reference/execution-and-comment.md (3)

52-53: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Count only running boxes for the concurrency gate.

Line 52 currently counts all verify-stale-* instances, but the comment and guard are for running capacity. This can block new runs even when capacity exists.

Suggested fix

-  RUNNING=$(echo "$INSTANCES" | jq '[.[]? | select(.name | startswith("verify-stale-"))] | length')
+  RUNNING=$(echo "$INSTANCES" | jq '[.[]? | select(.name | startswith("verify-stale-") and .status == "RUNNING")] | length')

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
@.agents/skills/nemoclaw-maintainer-verify-stale/reference/execution-and-comment.md
around lines 52 - 53, The current RUNNING count uses INSTANCES with a jq
selector that matches all verify-stale-* names; change the jq filter used when
building RUNNING so it only counts instances whose name
startswith("verify-stale-") AND whose status indicates they are currently
running (e.g., .status.state == "running" or the equivalent field in your
instance objects), leaving the if [ "$RUNNING" -ge 4 ] guard unchanged; update
the jq expression referenced by RUNNING to include that running-state predicate.

316-318: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reproducer runs should use docker-group + login-shell wrapper.

Lines 317, 364, and 389 execute bash ~/reproducer.sh directly. That misses the documented docker group activation and can miss ~/.local/bin PATH setup, causing false inconclusive outcomes.

Suggested fix

-brev exec "$INSTANCE_NAME" "bash ~/reproducer.sh" 2>&1 | tee ./baseline-transcript.log
+brev exec "$INSTANCE_NAME" "sg docker -c 'bash -lc \"export PATH=\\\"$HOME/.local/bin:$PATH\\\"; bash ~/reproducer.sh\"'" 2>&1 | tee ./baseline-transcript.log
...
-brev exec "$INSTANCE_NAME" "bash ~/reproducer.sh" 2>&1 | tee ./baseline-transcript-2.log
+brev exec "$INSTANCE_NAME" "sg docker -c 'bash -lc \"export PATH=\\\"$HOME/.local/bin:$PATH\\\"; bash ~/reproducer.sh\"'" 2>&1 | tee ./baseline-transcript-2.log
...
-brev exec "$INSTANCE_NAME" "bash ~/reproducer.sh" 2>&1 | tee ./latest-transcript.log
+brev exec "$INSTANCE_NAME" "sg docker -c 'bash -lc \"export PATH=\\\"$HOME/.local/bin:$PATH\\\"; bash ~/reproducer.sh\"'" 2>&1 | tee ./latest-transcript.log

Also applies to: 363-365, 388-390

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
@.agents/skills/nemoclaw-maintainer-verify-stale/reference/execution-and-comment.md
around lines 316 - 318, Replace direct invocations of "bash ~/reproducer.sh" in
the brev exec calls with the documented docker-group + login-shell wrapper so
the docker group is activated and the user's login PATH (including ~/.local/bin)
is sourced; specifically update each brev exec "$INSTANCE_NAME" "bash
~/reproducer.sh" occurrence (and any identical occurrences around the earlier
shown lines) to call the wrapper that activates the docker group then launches
the reproducer inside a login shell.

767-769: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Redaction regex alternation is escaped and won’t match intended secrets/hosts.

Line 767 and Line 768 use \| where alternation is intended. That matches literal pipes instead of alternatives, weakening redaction and risking secret/hostname leakage.

Suggested fix

-| 8 | `(?i)(token\|secret\|password\|api[_-]?key\|bearer)[^\n]*[:=][^\n]*` | Inline credentials in env/config/log output |
-| 9 | `\b\w+\.(nvidia\.internal\|nv-internal\.com\|nvidia\.dev)\b` | Internal hostnames (extend list per team) |
+| 8 | `(?i)(?:token|secret|password|api[_-]?key|bearer)[^\n]*[:=][^\n]*` | Inline credentials in env/config/log output |
+| 9 | `\b\w+\.(?:nvidia\.internal|nv-internal\.com|nvidia\.dev)\b` | Internal hostnames (extend list per team) |

#!/bin/bash
python3 - <<'PY'
import re
bad_cred = re.compile(r'(?i)(token\|secret\|password\|api[_-]?key\|bearer)[^\n]*[:=][^\n]*')
good_cred = re.compile(r'(?i)(?:token|secret|password|api[_-]?key|bearer)[^\n]*[:=][^\n]*')
samples = ["token=abc", "password: hunter2", "api_key = nvapi-xyz"]
print("credential pattern:")
for s in samples:
    print(s, "bad=", bool(bad_cred.search(s)), "good=", bool(good_cred.search(s)))

bad_host = re.compile(r'\b\w+\.(nvidia\.internal\|nv-internal\.com\|nvidia\.dev)\b')
good_host = re.compile(r'\b\w+\.(?:nvidia\.internal|nv-internal\.com|nvidia\.dev)\b')
hosts = ["svc.nvidia.internal", "x.nv-internal.com", "api.nvidia.dev"]
print("\nhost pattern:")
for h in hosts:
    print(h, "bad=", bool(bad_host.search(h)), "good=", bool(good_host.search(h)))
PY

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
@.agents/skills/nemoclaw-maintainer-verify-stale/reference/execution-and-comment.md
around lines 767 - 769, The redaction regexes use escaped pipe sequences (e.g.
"(?i)(token\|secret\|password\|api[_-]?key\|bearer)[^\n]*[:=][^\n]*" and
"\b\w+\.(nvidia\.internal\|nv-internal\.com\|nvidia\.dev)\b") which match
literal '|' instead of alternation; replace them with proper alternation and
prefer non-capturing groups: use
"(?i)(?:token|secret|password|api[_-]?key|bearer)[^\n]*[:=][^\n]*" for inline
credentials and "\b\w+\.(?:nvidia\.internal|nv-internal\.com|nvidia\.dev)\b" for
internal hosts so the patterns in the file (the two regex strings) correctly
match secrets and hostnames.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In
@.agents/skills/nemoclaw-maintainer-verify-stale/reference/execution-and-comment.md:
- Around line 52-53: The current RUNNING count uses INSTANCES with a jq selector
that matches all verify-stale-* names; change the jq filter used when building
RUNNING so it only counts instances whose name startswith("verify-stale-") AND
whose status indicates they are currently running (e.g., .status.state ==
"running" or the equivalent field in your instance objects), leaving the if [
"$RUNNING" -ge 4 ] guard unchanged; update the jq expression referenced by
RUNNING to include that running-state predicate.
- Around line 316-318: Replace direct invocations of "bash ~/reproducer.sh" in
the brev exec calls with the documented docker-group + login-shell wrapper so
the docker group is activated and the user's login PATH (including ~/.local/bin)
is sourced; specifically update each brev exec "$INSTANCE_NAME" "bash
~/reproducer.sh" occurrence (and any identical occurrences around the earlier
shown lines) to call the wrapper that activates the docker group then launches
the reproducer inside a login shell.
- Around line 767-769: The redaction regexes use escaped pipe sequences (e.g.
"(?i)(token\|secret\|password\|api[_-]?key\|bearer)[^\n]*[:=][^\n]*" and
"\b\w+\.(nvidia\.internal\|nv-internal\.com\|nvidia\.dev)\b") which match
literal '|' instead of alternation; replace them with proper alternation and
prefer non-capturing groups: use
"(?i)(?:token|secret|password|api[_-]?key|bearer)[^\n]*[:=][^\n]*" for inline
credentials and "\b\w+\.(?:nvidia\.internal|nv-internal\.com|nvidia\.dev)\b" for
internal hosts so the patterns in the file (the two regex strings) correctly
match secrets and hostnames.

In @.agents/skills/nemoclaw-maintainer-verify-stale/SKILL.md:
- Line 3: Update the canonical label spelling in the SKILL.md description:
replace the token "status wont-fix" with the exact label "status: wont-fix" so
the documented labels (alongside "fixed-on-latest" and "verify-inconclusive")
match the actual label string operators use; edit the description line in
.agents/skills/nemoclaw-maintainer-verify-stale/SKILL.md to make this
single-word correction.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5c3d6f16-ab32-45e7-ad23-cf3a13a02591

📥 Commits

Reviewing files that changed from the base of the PR and between e0eab59 and 3cb23b0.

📒 Files selected for processing (2)

.agents/skills/nemoclaw-maintainer-verify-stale/SKILL.md
.agents/skills/nemoclaw-maintainer-verify-stale/reference/execution-and-comment.md

The previous Step 7 cleared fixed-on-latest and verify-inconclusive from every open issue at every release cut. That re-verifies labels that were freshly applied yesterday — wasting Brev cost and creating noise. Replace with a three-tier decision cascade: 1. If Project 199 status == Done → skip clear (maintainer accepted the verification; label stays until the issue closes). 2. Else if marker comment is >10 days old → clear (stale; reporter never confirmed in the review window). 3. Else if any PR merged since the marker date touches the paths the comment cited in `Relevant changes since v0.0.X` → clear (regression risk). 4. Else → skip clear (verification still holds). Closed issues are not iterated (--state open filter). Requires the `project` scope on gh for the Done-state lookup; falls back to the time + regression checks alone (with a warning) when the scope is missing. Companion to verify-stale's Step 10 tracker integration. Together they form a coherent verification lifecycle: skill applies label + moves to Needs Review → maintainer reviews and either accepts (moves to Done) or lets it sit → release sweep keeps Done-accepted labels, clears stale or regression-risk ones. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.agents/skills/nemoclaw-maintainer-cut-release-tag/SKILL.md:
- Line 202: The git command that sets REGRESSED uses a hardcoded path (git -C
~/NemoClaw ...) which can fail if the repo is elsewhere; change it to use the
current repo or a configurable path: replace the hardcoded -C ~/NemoClaw with
either -C "$REPO_PATH" (add a REPO_PATH variable set earlier) or remove -C to
run from the current directory, keeping MARKER_DATE and PATHS unchanged so the
lookup still uses those variables.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b400dbe8-8762-4da3-9883-9aab71550ed7

📥 Commits

Reviewing files that changed from the base of the PR and between 3cb23b0 and 9b434bd.

📒 Files selected for processing (1)

.agents/skills/nemoclaw-maintainer-cut-release-tag/SKILL.md

10 days felt too aggressive for a real maintainer review window — weekends, holidays, and across-team handoffs commonly need 2 weeks. 14 days lines up with the natural sprint boundary and gives the reporter + maintainer enough room before re-verification kicks in. Updated the decision cascade table and the AGE_DAYS check in lockstep. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Five findings from coderabbit's second-pass review, all valid: F1 (Major) — execution-and-comment.md: concurrency cap counted ALL verify-stale-* boxes, not just RUNNING ones, so prior stopped boxes could falsely block provisioning. Add the same .status == "RUNNING" filter the reuse query above already uses. F2 (Major) — execution-and-comment.md: the two `brev exec bash ~/reproducer.sh` calls (baseline + latest) didn't export PATH, so non-login shells could miss ~/.local/bin where the nemoclaw binary lives. Wrap both with explicit PATH export. (sg docker stays the script's responsibility — double-wrapping nests quoting.) F3 (Critical) — execution-and-comment.md: credential redaction patterns 8 and 9 were inside a markdown table where `|` is the column delimiter. The table escaped them as `\|` to preserve the table structure, but `\|` in regex matches a LITERAL pipe character, not alternation. The redaction pattern `(?i)(token\|secret\|password\|api[_-]?key\|bearer)` therefore silently failed to match `token=...`, `password: ...`, `api_key = ...`. Replaced the table with a fenced code block so patterns can use real alternation. F4 (Minor) — verify-stale SKILL.md: frontmatter description used "status wont-fix" (no colon) while the rest of the doc uses the canonical "status: wont-fix". Fixed in both the verdict-label list and the trigger-keywords list. F5 (Major) — cut-release-tag SKILL.md: regression check used `git -C ~/NemoClaw log ...` which assumes a specific checkout location. Step 1's prerequisite already requires the maintainer to be inside the NemoClaw repo, so dropping `-C ~/NemoClaw` runs from the current directory. Local check-docs.sh and markdownlint-cli2 pass. SKILL.md is still exactly 500 lines after F4. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@wscurran

(1) Step 3 — unanswered-question variant should require a question, not just a comment. Surfaced when @wscurran's templated triage ack ("✨ Thanks for reporting…") fired the variant on #1642; the rendered "their comment from N days ago is still unanswered" lead made it sound like there was a pending question when nothing was actually asked. Extend the maintainer-comment selector with a question-shape content check: comment body must contain `?`, a polite imperative (please confirm/share/provide/clarify/tell/verify/check/let me know), or a question starter (could/can/would you, do you have/know/see/use). Pure acknowledgments fall through to the standard template. (2) Step 8d — enforce OpenShell version pinning after the latest checkout. Surfaced on #1642's run: v0.0.6 setup phase installed OpenShell 0.0.37 (whichever was current), but v0.0.38's blueprint.yaml caps max_openshell_version at 0.0.36 — onboard preflight then refuses to run. The skill now reads MAX_OS from latest's blueprint.yaml after checkout. If the installed binary is newer than the cap, install- openshell.sh refuses to downgrade, so the skill force-installs the capped version directly from the OpenShell GitHub release. If the installed binary is at or below the cap, install-openshell.sh runs normally to bring it forward. Both fixes lint clean. SKILL.md is now 499 lines (under the 500 ceiling). reference/execution-and-comment.md grew by ~30 lines for the OpenShell-pin block. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@jyaunches

Long-running verifications race with maintainers closing issues independently. #2513 and #2519 both closed mid-batch — @jyaunches posted their own comprehensive verification before our run got to the comment phase. The skill kept going and would have appended a redundant comment after the close. Add a pre-post state check: refresh the issue state right before gh issue comment runs. If closed, apply the label tag-only (the maintainer's close comment is now the authoritative record) and skip both the comment and the Project 199 move. Surfaced from findings2.md gap item 18. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Step 3's issue-type skip list used exact-match `enhancement`, but the repo has both bare `enhancement` and 8 prefixed variants (enhancement: feature, enhancement: MCP, enhancement: testing, enhancement: ui, enhancement: provider, enhancement: platform, enhancement: policy, enhancement: inference, enhancement: integration, enhancement: performance, enhancement: skill). The prefixed forms slipped through the filter — surfaced from #1752 (Gemini 3 Flash Preview), labeled `enhancement: feature`. The reporter was asking whether the model variant was supported, which is correctly an enhancement request. Updated the skip clause prose to spell out the prefix-match rule and list the variants. Implementations should match exact `enhancement` OR any label starting with `enhancement:`. Surfaced from findings2.md gap item 36 (added this run). Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

prekshivyas added 5 commits May 5, 2026 14:42

wscurran added Platform: Brev Support for Brev deployment enhancement: skill Improvements to NemoCall repository hygiene or user functionality with skills. labels May 6, 2026

prekshivyas added 20 commits May 6, 2026 14:48

prekshivyas and others added 5 commits May 7, 2026 17:06

prekshivyas removed the Platform: Brev Support for Brev deployment label May 8, 2026

prekshivyas and others added 3 commits May 7, 2026 18:35

Merge branch 'main' into feat/verify-stale-skill

dc5d714

Merge branch 'main' into feat/verify-stale-skill

3bc1b8c

prekshivyas marked this pull request as ready for review May 8, 2026 01:47

coderabbitai Bot reviewed May 8, 2026

View reviewed changes

Comment thread .agents/skills/nemoclaw-maintainer-cut-release-tag/SKILL.md

cv approved these changes May 8, 2026

View reviewed changes

prekshivyas and others added 2 commits May 7, 2026 20:17

Merge branch 'main' into feat/verify-stale-skill

97407e1

coderabbitai Bot reviewed May 8, 2026

View reviewed changes

Comment thread .agents/skills/nemoclaw-maintainer-cut-release-tag/SKILL.md Outdated

prekshivyas and others added 6 commits May 8, 2026 14:40

Merge branch 'main' into feat/verify-stale-skill

4102027

Conversation

prekshivyas commented May 5, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Issues tackled

v1 scope

Verification flow (one-line per step)

Idempotency

Active-discussion handling (two-tier)

Companion changes

CI status

Test plan

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented May 5, 2026

Uh oh!

coderabbitai Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Poem

Review ran into problems

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

copy-pr-bot Bot commented May 8, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

prekshivyas commented May 5, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 5, 2026 •

edited

Loading