feat(skills): add nemoclaw-maintainer-verify-stale#3063
feat(skills): add nemoclaw-maintainer-verify-stale#3063prekshivyas wants to merge 45 commits intomainfrom
Conversation
Adds a maintainer skill that automates verifying whether old bug reports still reproduce against the latest NemoClaw release. The skill picks candidate issues opened against older versions, reuses or provisions a Brev Linux box (CPU or GPU based on the bug profile), runs the extracted reproducer, scores confidence, and posts an evidence-backed comment with either a `fixed-on-latest` or `verify-inconclusive` label. Tag-only — never auto-closes. v1 scope is intentionally narrow: Linux only, no third-party integration credentials, no service-account bot. Each maintainer runs it under their own gh credentials. Companion changes: - Adds the new skill to the maintainer table in `nemoclaw-skills-guide` (count 7 -> 8 maintainer skills, 18 -> 19 total). - Adds Step 7 to `nemoclaw-maintainer-cut-release-tag` that sweeps `fixed-on-latest` and `verify-inconclusive` labels off all open issues at release time, so the next verify-stale run re-evaluates against the new latest. Verification records remain in comment history; only the labels are reset. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR adds a new ChangesNemoClaw Skills Catalog and Release Workflow
Sequence Diagram(s)sequenceDiagram
participant Maintainer
participant GitHub
participant VerifyStale
participant Brev
Maintainer->>GitHub: discover candidate issues & latest tag
GitHub-->>Maintainer: return issues, labels, comments
Maintainer->>VerifyStale: request filtering/prioritization
VerifyStale->>Brev: provision or reuse verification box
Brev-->>VerifyStale: run baseline then latest reproducer
VerifyStale-->>Maintainer: deliver transcripts, score, verdict
🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsTimed out fetching pipeline failures after 30000ms Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
Three flaws surfaced when dry-running the Step 3 candidate filter against the live open-issue list (141 open bugs). 1. Step 2 used `gh release view` to detect "latest". NemoClaw tags but does not publish GitHub releases, so this returned empty. Added a fallback that picks the highest semver from `git ls-remote --tags`, which is the load-bearing path today. 2. Step 3 said "2 or more minor versions behind". Wrong vocabulary for `0.0.x` repos — patch is what's iterating. Reworded to "at least 2 versions behind in the rightmost-incrementing component", with a concrete `0.0.x` example and a forward-compat note for when NemoClaw moves to `0.1.x`. 3. Step 4 specified a naive `v?0\.0\.\d+` regex that matched IP-like strings in bug bodies (Ollama bind `0.0.0.0:11434`, loopback `127.0.0.1`, etc.) and produced phantom v0.0.0 / v0.0.1 candidates. Tightened to require `v` prefix + word boundaries first, with a context-anchored fallback that only matches `0.0.X` on lines containing `nemoclaw` or `version`. Added a clamp that rejects any parsed version greater than `$LATEST` (catches roadmap labels like `v0.0.35` from being treated as "reported on"). After these fixes the dry-run produces 33 plausible candidates out of 141 open bugs, with credible reported-version assignments verified against issue bodies. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Two more issues surfaced from a deeper second-pass dry-run. 1. The Step 4 regex was hardcoded to v0.0.x. NemoClaw is on that line today, but the skill should keep working when it moves to v0.1.x or higher. Generalized to `\bv\d+\.\d+\.\d+\b` with a context-anchored `\b\d+\.\d+\.\d+\b` fallback on lines containing nemoclaw or version (so the loose form doesn't match `0.0.0.0:11434` and `127.0.0.1`). 2. Some bug reports cite NemoClaw versions that were never tagged (e.g. `v0.1.0` × 3 issues, calver `2026.3.11` × 1). Without a tag existence check the skill would point Brev verification at a version that does not exist. Added a `git ls-remote --tags` validation pass that drops the version if no matching tag is found. Also added an implementer note documenting a real failure mode caught during the dry-run: a naive `[scan(primary)] | first | .[0] | tonumber // [scan(fallback)] | ...` pipeline silently dropped 9 valid candidates because `null | first` errors when scan returns empty, and the `//` chain did not propagate cleanly to the fallback. The SKILL.md regex itself was correct — the failure was implementation-level fragility in how the two passes were composed. The note tells implementers to bind each pass to a named variable, coalesce at the end, and explicitly test the empty-match path. After these changes the live dry-run produces 47 plausible candidates (out of 141 open bugs) and the 4 residual drops are all reporter typos that cite a version that was never released — correctly unverifiable. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Third dry-run round surfaced a precision issue. The previous regex
matched any `\bv\d+\.\d+\.\d+\b` anywhere in the body, then validated
against the tag list. That picked up versions belonging to *other*
products that share the v0.0.x line — most often OpenShell, but also
Node.js (v22.x.x), and other dependencies. Tag validation then
correctly rejected the non-NemoClaw ones, but on bodies where multiple
products were listed (e.g. `Environment: openshell 0.0.4, nemoclaw
0.1.0, Node.js v22.16.0`) the parser would land on OpenShell's
tag-valid v0.0.4 and report it as the NemoClaw version, producing a
false-positive candidate. 12 such collisions exist in the current
backlog.
Fix: require the version to follow `nemoclaw` (case-insensitive) within
80 non-letter, non-newline characters. The anchored regex
`(?i)nemoclaw[^a-z\n]{0,80}v?(\d+\.\d+\.\d+)` matches `NemoClaw
v0.0.32`, `nemoclaw 0.0.28`, `- NemoClaw: v0.0.16`, but not `openshell
0.0.4` followed later by `nemoclaw 0.1.0`. Combined with the existing
tag-list validation, this collapses the four error classes — typos,
calver dates, roadmap labels, and product collisions — into one
"version must be a real NemoClaw release tag, mentioned next to
NemoClaw" check.
Also expanded the implementer note to cover three concrete failure
modes hit during the dry-run:
- Empty-match handling (`[]` flowing into `first | .[0]` errors).
- Capture-group inconsistency between branches in a parser pipeline.
- Variable-scoping bug in `select($tags | index(.))` where `.` rebinds
to `$tags` and silently passes invalid labels.
After this change the live dry-run produces 33 candidates out of 141
open bugs — same count as the very first round, but every candidate's
parsed version is now anchored to NemoClaw and validated against the
real tag list. The earlier wider counts (47, 48) included roughly 12-15
issues that would have been wasted Brev runs.
Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
…test A clean run on latest is ambiguous on its own: - Scenario A: bug really got fixed (script worked, latest passes) - Scenario B: script was junk all along (script never triggered the bug; latest passes for the same reason baseline would) These look identical from the latest pass alone, which means the previous flow could land `fixed-on-latest` on a script that never had a chance — a false positive that the +25 commit and +25 PR signals only partially mitigate. This change adds a baseline pass before the latest pass: - Step 8a: install reported version on the box. - Step 8b: run reproducer, compare output to issue's actual-result description. Match = script validated. - Step 8c: if no match (script silent OR errored on the wrong thing), synth-repro using the issue body PLUS the baseline transcript so the LLM can react to the actual failure mode. Apply -30. Retry. Still no match -> verify-inconclusive, skip latest entirely. - Step 8d: install latest, run validated reproducer. Score from this. A single trigger drives synthesis: "the script didn't expose the bug on baseline." Whether it was silent or noisy doesn't matter; both mean the script needs work. This collapses the previous separate "narrative-only -> synth" and "verbatim-failed -> ???" paths into one rule. Step 6 simplified accordingly: just extract verbatim if available, otherwise carry the issue body forward to Step 8b for on-demand synthesis. No more "give up before provisioning" branch — that decision moves into Step 8c where it has more context to work with. Step 9 adds a baseline-validation gating rule. If the reported- version install fails (old releases rot — installer URLs, deps, OS images drift), the score is capped at 84 unless commit-area or PR- mention evidence also fires. That forces the @-mention-reporter band when we lack independent corroboration, which is the honest position when we couldn't validate the script ourselves. Step 11 split into two failure types: - Latest-install fail or harness error -> hard infra failure (no label, optional comment, move on). - Baseline-install fail -> degraded mode (skip baseline gate, run latest anyway, apply Step 9 cap). Wallclock budget bumped from 15 to 25 minutes per verification to accommodate two installs. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Tier-1 (blocking for E2E):
- Add the still-reproduces-on-latest path. If the latest run output
matches the issue symptom, the bug is still live — the skill posts
a "still reproducible" comment with both transcripts, applies no
label, and includes a dated marker `<!-- ... v1 YYYY-MM-DD -->`.
No new label per user direction; a 7-day TTL on the marker handles
re-verification on the next weekly cron.
- Update the comment template for the two-pass flow. The previous
template described a single transcript; the new one has Baseline
and Latest sections plus a dedicated still-reproduces template,
and the idempotency marker now carries today's date.
- Update the Step 12 log template to record baseline-install,
baseline-match, latest-install, and latest-result independently.
- Clarify in Step 4 that REPORTED_VERSION must be the full tag string
("v0.0.32"), not just the patch number (32). Step 8a's installer
expects the full tag.
Tier-2:
- Spell out an explicit match rubric in Step 8b: exit code agreement,
symptom-phrase match (LLM-judged semantic equivalence counts), and
distinguishing the bug from infra noise (DNS / auth / rate limits).
- Replace the minimal `rm -rf ~/.nemoclaw` reset with a comprehensive
one that stops nemoclaw/openshell processes, removes spawned Docker
containers, frees common ports (8080, 18789, 9119), and removes the
installed binaries and lib paths. Run before each install in 8a and
8d; idempotent so it's safe on a fresh box.
- Add interactive-subcommand handling in 8b: auto-detect `nemoclaw
onboard` / `configure` and try `--non-interactive`, then
`--dangerously-skip-prompts`, then stdin pre-feed. Fall through to
Step 8c (synth) if none work.
Tier-3 / opportunistic:
- Fix the actual installer command (gap 9). The installer accepts
the target ref via `NEMOCLAW_INSTALL_TAG` env var, NOT a `--version`
flag (verified against install.sh source). Updated 8a accordingly.
- Update Step 3 idempotency: drop on either label OR a marker
comment within the last 7 days. Previously a marker excluded the
issue forever; the release sweep only clears labels, so the marker
alone made re-verification impossible. The 7-day TTL also handles
the still-reproduces case naturally.
- Extend wallclock budget to 60 min when issue body contains
time-sensitive keywords (`after N minutes`, `eventually`, `memory
leak`, etc.). Hard ceiling at 60 — bugs that require hours fall
out of v1 scope.
- Keep provisioned boxes for 30 minutes on verify-inconclusive
outcomes so a maintainer can `brev shell` in and triage. Reused
boxes always stay.
Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Five concrete fixes from the pre-E2E gap analysis, plus a verified finding on gap 2 that lowered its risk profile. Gap 1 — sudo behavior. The reset script's `sudo rm -f` and `sudo rm -rf` could hang on a password prompt if the Brev image's default user lacks passwordless sudo. Switched all sudo calls to `sudo -n` (non- interactive) so they fail fast instead. Added a precondition note explaining the assumption: Brev stock images allow passwordless sudo, custom images may not. The user-local install path (~/.nemoclaw) is fully reset regardless of sudo state, so the worst case is a stale /usr/local/bin/nemoclaw symlink that the next install overwrites. Gap 2 — verified the install architecture supports old-version pinning natively. The bootstrap clones the requested ref and runs that tag's own install.sh (with a payload-marker fallback for legacy ones), so NEMOCLAW_INSTALL_TAG=v0.0.32 works architecturally. Risk reduces to "does v0.0.32's own installer still work on a 2026 OS image" — handled by Step 11's degraded-mode fall-through. No SKILL.md change needed. Gap 4 — path extraction was hand-waved. The +25 commits-touched-area weight referenced `git log v<reported>..$LATEST -- <path>` without specifying how to determine `<path>`. Added a three-tier extraction procedure: stack-trace path mentions parsed from the body and mapped to repo paths, then a component-label-to-directory map (NemoClaw CLI → bin/, Sandbox → src/lib/sandbox/, etc.), then title-keyword fallbacks. If none yields a path, skip the +25 signal entirely rather than guessing — floating the weight would inflate scores meaninglessly. Gap 5 — PR-search query was hand-waved. The +25 PR-mention weight had no concrete query. Added two-stage search: first `gh pr list --search "$ISSUE_NUMBER"` filtered for actual `#NNNN` references in body or title, then a symptom-phrase fallback if direct reference returns nothing. PRs that merged before the issue was filed are excluded via `mergedAt > tag-date(REPORTED_VERSION)`. Gap 6 — issues without an "Actual result" section had no match path. Behavioral / configuration bugs (e.g., #1242 "should default to a stable released version") describe a wrong default rather than a runtime error. Added a fallback to Step 8b's match rubric: use the title + full body as the symptom, match if reproducer's outcome contradicts the issue's expected behavior. If neither error string nor expected-behavior contradiction can be identified, route to Step 8c (synth-repro) so the LLM produces a more diagnostic script. Gap 7 — transcript redaction was thin. Step 10 only covered tokens, paths, and basic-auth URLs. Real transcripts leak more: Authorization headers, JWTs, GitHub PATs, AWS keys, NVIDIA API keys, internal hostnames, emails (PII), long base64 blobs. Expanded the redaction table with concrete patterns for each. Added an order note: longest/most-specific patterns first so the generic base64 catchall doesn't mask what was actually redacted. Made it explicit that redaction runs on every quoted chunk in the comment, not just issue- body excerpts. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
…l SKU pick Preflight Brev auth and install-URL reachability before paying any cost (Step 6.5). For pure-CLI bugs with no sandbox, Docker, or GPU dependency, try the reproducer locally on the maintainer's `nemoclaw` install before provisioning a Brev box (Step 6.7) — same evidence, zero cost. Replace the `<your-team's-CPU-SKU>` placeholder in Step 7 with a runtime pick via `brev search cpu --sort price --json | jq` so the skill keeps working as SKUs change, with a `VERIFY_STALE_CPU_TYPE` env override for teams that need to pin. Drop the spurious `--yes` flag from `brev delete` — it does not exist on that subcommand and would error. `brev delete` is non-interactive by default. Print the instance name before the trap registers so manual cleanup is possible if the trap doesn't fire. Thread `INSTALL_URL` through Step 8a/8d so the URL override from Step 6.5 applies to both the baseline and latest installs. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Some bugs are filed against behavior that was deliberately removed or changed in a merged PR. Running the standard rubric on these produces a misleading verdict — the symptom "still reproduces" but the right answer is "won't fix, see PR #X." Issue #2791 is the prototype: `config set` was removed in #2227, the reporter tested a version that already had it gone, and the standard rubric would have buried that context under a low-confidence `verify-inconclusive` label. Add Step 8.5 with three independent detection signals: - Maintainer-attribution comment phrasing ("removed in #N", "by design", "wontfix", "intentional") from MEMBER/OWNER/COLLABORATOR authors. - A removal commit between reported version and `$LATEST` whose subject matches `\b(remove|delete|drop|deprecate)\b` and whose diff deletes the symbol implicated by the reproducer. - The implicated symbol absent from both reported version and `$LATEST`, meaning it was already gone when the issue was filed. When any signal fires, skip Step 9 scoring and Brev provisioning, apply a new `wontfix-by-design` label, and post a focused comment that links to the responsible PR. Detection on signals 2 and 3 can run as soon as the reported version is parsed, saving Brev cost on the entire class. Generalize the Step 3 idempotency marker regex from `v1` to `v\d+` so future skill versions can re-verify older-marked issues by tightening the regex to require a specific marker version. Add `wontfix-by-design` to the idempotency label list, the activity log entry options, the session summary, and the release sweep in `nemoclaw-maintainer-cut-release-tag`. Update the skill's frontmatter description to mention the new label and the local-first short-circuit. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
…var log path The issue body that comes back from `gh issue view` for NV QA bugs is HTML, not markdown — `<pre>...</pre>` blocks and tables, with HTML-encoded entities. The previous Step 6 only mentioned `<pre>` in prose without an extractor that actually parsed it. Add a Python extractor that handles markdown fences (triple-backtick and tilde), HTML `<pre>`, and entity unescape in one pass, so verbatim reproducers from QA-filed issues land cleanly instead of falling through to LLM synthesis with a -30 penalty. Step 10 already had a comprehensive redaction regex table, but the patterns assumed plain text — tokens nested in HTML tags or attributes slipped through unredacted. Add an HTML to text pre-pass for issue-body excerpts before the regex table runs, so quoted excerpts from QA bodies get the same redaction coverage transcripts already do. Step 5's GPU keyword list flagged `inference` and `model serving`, both of which false-positive on CPU bugs (`models.providers.inference.baseUrl` is a config path, not a GPU need). Tighten to whole-word matches and swap in `vllm`, `tensorrt`, `L40S`, `L4`, `T4`. Step 12's activity log was hard-coded to one maintainer's personal organizer path. Read from `VERIFY_STALE_LOG_DIR` with the existing path as default, and require the skill to create the directory rather than assuming it exists, so the skill works in CI / shared-volume / other environments. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Path-extraction map in Step 9 was written from assumption rather than verification — `src/lib/sandbox/`, `src/lib/openshell/`, and `src/lib/policy/` don't exist in the current repo, and OpenShell lives in a separate repo entirely. The +25 commits-touched-component signal would have silently misfired on every Sandbox / OpenShell / policy issue, underscoring real fixes. Replace with paths verified against the current tree (`nemoclaw/src/blueprint/`, `nemoclaw-blueprint/`, `nemoclaw/src/commands/`, etc.) and explicitly note the cross-repo OpenShell case is out of scope for v1. Step 8 reset wiped `~/.nemoclaw` but not `~/.openclaw`. Sandbox state has been writable-by-default under `.openclaw` since #2227, so it persists across the baseline → latest reinstall and contaminates the latest run. Add it to the reset. Anchor the `pkill -9 -f nemoclaw|openshell` patterns to a leading slash (`/nemoclaw`, `/openshell`) so the kill matches actual installed paths but not unrelated processes that mention these strings — including the agent harness running this skill if its working directory contains the word. Reorder the Step 10 redaction table to match the stated "longest, most specific patterns first" rule. The previous order had the generic `(token|secret|password|...)` pattern executing before JWT/AWS/NVIDIA patterns, which would have masked specific-token redactions with generic ones — losing the signal of *what kind* of credential leaked. Replace `git ls-remote git@github.com:NVIDIA/...` (Steps 2 and 4) with `gh api repos/NVIDIA/NemoClaw/tags --paginate --jq '.[].name'`. The SSH path required keys to be configured; `gh api` reuses whatever auth the user already has. Add a CLI-deps precondition check (`command -v gh brev jq python3 curl`) to Step 6.5 so the skill fails fast if any required dependency is missing, instead of failing late and confusingly mid-run. Step 11's keep-box-on-inconclusive said "delay cleanup by 30 minutes" but had no implementation. A backgrounded `sleep && brev delete` doesn't survive session end. Replace with skip-the-trap-entirely plus an explicit manual-cleanup reminder printed to the run output. Out-of-Scope was contradicting itself on macOS — Step 6.7 explicitly allows macOS for local-first runs. Clarify: Brev path is Linux-only, the local-first short-circuit on a maintainer's laptop works on macOS for manual single-issue runs. Add a `local (no Brev — Step 6.7 short-circuit)` option to the Step 12 log entry's Box field so local-first runs round-trip the activity log correctly. Frontmatter description said "latest release" but NemoClaw uses tags, not GitHub Releases — fix to "latest tag" to match Step 2. Step 4's implementer note said "Two real failure modes" but listed three; fix the off-by-one. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
…-verification Step 8.5 was producing comments that were correct in direction but hand-wavy on substance. A side-by-side against an independent analysis of #2168 surfaced the gap: that analysis cited specific file:line locations, separated the literal bug-as-filed from a related failure mode that still exists on latest, called out existing CI coverage of the new workflow, and self-corrected an overstatement when prompted to re-verify. Today the skill required none of that. Restructure Step 8.5 into substeps so the rigor is mechanical, not prompt-dependent: - 8.5a: signal detection now requires verifiable evidence per signal — comment URL + quoted phrase (signal 1), commit SHA + diff range (signal 2), grep commands + outputs (signal 3). Add a sub-case for vestigial deprecation shims so a stub doesn't silently fail signal 3 by appearing in latest. - 8.5b: pre-check for related failure modes. Grep latest for the bug's symptom keywords (not the removed symbol) and require the comment to call out anything that surfaces. This is the section the side-by-side identified as missing — saying "the bug as filed can't reproduce" is not the same as "every bug shaped like this is fixed." - 8.5c: check existing test coverage for the new workflow and cite up to three test paths if found. Strengthens the comment from "trust me, it was removed" to "the new workflow is exercised by these tests." - 8.5d: explicit self-verification pass — re-run every cited command before composing the comment; bail to verify-inconclusive on any discrepancy. LLMs confidently overstate; mechanical re-verification catches it without needing a human prompt. Replace the by-design comment template with one that has mandatory sections matching the substeps: "What's structurally fixed", "Vestigial references", "What's not literally the same bug", "Existing CI coverage", "Recommendation". Each section either has concrete content or is explicitly omitted — no hand-wavy claims slip through. Add NVBugs cross-reference extraction to Step 4 (`grep -oE '\[NVB#[0-9]+\]'`) so the by-design template can append the standard "NVBugs#NNNNNNN will need a separate update; closing this GitHub issue won't propagate" reminder when the issue body carries that footer (most NV QA bugs do). Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
The previous signal-2 candidate query narrowed by commit subject (`remove|delete|drop|deprecate`) before checking whether the diff deleted the implicated symbol. That filter excludes the common case where a removal lands inside a `refactor(...)` or `feat(...)` commit — e.g., PR #2227 removed `--dangerously-skip-permissions` under a "refactor(sandbox): default to mutable config" subject. A literal follower of the previous rule would have concluded signal 2 doesn't fire on issue #2168 even though the flag was deliberately removed. Switch the primary candidate lookup to `git log -S<symbol>` pickaxe search, which finds every commit whose diff changes the count of the symbol regardless of subject wording. Keep the subject-keyword query as a supplementary lookup for narrowing when the pickaxe returns many commits. Note explicitly that the commit's actual subject doesn't need to mention removal. Surfaced while testing the new Step 8.5 rigor against #2168. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
…horing Two changes that surfaced from running Step 8.5 against issue #2168. Use the existing repo `wontfix` label for the by-design path instead of creating a new `wontfix-by-design` label. The repo already has `wontfix` and it's already in Step 3's issue-type skip list, so applying it gives idempotency for free without proliferating labels. The by-design verdict is encoded in the marker comment, not the label vocabulary. Important consequence: the release sweep in `nemoclaw-maintainer-cut-release-tag` does NOT clear `wontfix` — that label is also applied for non-skill reasons (scope, priority, dup decisions made by maintainers), and sweeping it would erase human triage work. Sweep stays scoped to `fixed-on-latest` and `verify-inconclusive`, which are skill-only. Add a `**Verification mode:**` line to the by-design comment template that explicitly says "static analysis at the verified-on tag — no runtime reproduction." This was the honesty an independent side-by-side analysis of #2168 caught: the previous template implied runtime confirmation even though Step 8.5 short-circuits Brev provisioning by design. Add a tag-anchoring rule above the template too — every `file:line` cite must refer to the verified-on tag, not the maintainer's HEAD, since line numbers drift between tags and main and we want comments to stay reproducible by anyone reading them later. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
…ion check Bare `file:line` paths in comments force the reader to navigate manually — that's a usability bug, not a stylistic preference. Comments posted by the skill should be self-serving artifacts: a maintainer or QA reader should be able to click any citation and land on the exact content at the verified-on tag. Extend the tag-anchoring rule above the by-design template into a "tag-anchoring + linking rule" with concrete formats for file blobs, file:line, file:line-range, commit SHAs, and test paths. Note that bare `#NNNN` PR/issue references already auto-link in same-repo comments. Strengthen Step 8.5d into two passes: evidence (re-run cited commands) and link (resolve at least one rendered link per evidence section via `gh api .../contents/<path>?ref=<tag>` or `curl -fsI`). A 404 on a citation suggests verification that didn't actually happen — worse than no link at all. Bail to verify-inconclusive on any failure. Surfaced while reviewing the rendered #2168 comment: the citations were correct and tag-anchored but not clickable. Better caught now than after the first batch run posts a wall of bare paths. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Discovered while applying the by-design label to issue #2168: the skill referenced `wontfix` and `needs-info` as bare strings, but the repo's canonical labels are `status: wont-fix` and `status: needs-info` (with prefix and hyphen). Same shape in Step 3's issue-type skip list — issues labelled `status: wont-fix` or `status: needs-info` were NOT being filtered out at candidate-time as the spec intended, because the skip list keys didn't match the live labels. Three coordinated fixes: 1. Step 3 issue-type skip list now uses `status: wont-fix` and `status: needs-info`. Annotate the rule so future readers know not to substitute the bare forms. 2. Step 8.5 by-design action applies `status: wont-fix` (with quoting on the CLI). Step 12 log entry, session summary, frontmatter description, and Companion Behavior section in both the verify-stale skill and `nemoclaw-maintainer-cut-release-tag` updated to match. 3. Step 6.5 preconditions now verifies all expected labels exist on the repo with `gh label list` before any later step tries to apply them. This is the gap that bit us on #2168 — the comment posted but the label couldn't be applied because the spec name didn't match. Failing fast in 6.5 means the run aborts before any Brev cost or comment posting can happen against a misconfigured label vocabulary. Bare `wontfix` / `won't fix` / etc. are still in Signal 1's detection regex — that's intentional, those are SEARCH PATTERNS for what maintainers might write in free-text comments, not label names. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
…s for faithfulness Two changes that landed before the first real Brev E2E run (#2007). The previous wallclock cap split — 25 min default with a 60 min extension keyed off "memory leak" / "over time" / "eventually" keywords — optimised for the wrong constraint. Most issues fit comfortably under 60 min once you account for two full install passes plus comprehensive resets, and the keyword-based extension forced re-runs whenever a real install or bootstrap took longer than the optimistic 25 min budget. Bump the cap to a single 60 min default and drop the keyword extension; bugs that genuinely require more than an hour fall out of v1 scope. Add a new Step 8a.5 for bootstrapping reproducer dependencies that the stock Brev image doesn't ship — local model servers (Ollama, vLLM), provider runtimes, third-party CLIs. Default policy is maximum faithfulness: install the actual dependency the reporter used rather than substituting a stub. Substituting trades faithfulness for speed, and on a 60 min budget that trade is rarely worth it; it almost always introduces a confound that makes the verdict less trustworthy. Document the canonical Ollama and vLLM bootstraps with specific attention to daemon survival between `brev exec` calls (Ollama's installer registers a systemd service; vLLM uses nohup + log file). Bootstrap runs once before Step 8b's baseline and is reused for Step 8d's latest run — model downloads are expensive and external state, so the comprehensive reset must explicitly leave them alone. Carve out one substitution case: providers that require API keys the skill cannot safely supply (NIM, OpenAI, Anthropic). Stubbing a key defeats faithfulness anyway, and a real key shouldn't sit in a verify-stale run. Substitution applies the same -30 LLM-synth penalty as Step 8b and must be documented in the rendered comment. Bootstrap failure escalates to Step 11 infra failure — do NOT silently substitute, since the maintainer opted into faithfulness for a reason. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
…get on every comment Two requirements surfaced from the #2007 e2e run that should be enforced by the skill, not by per-session memory or by the prompter remembering. Mandatory reporter @-mention with confirmation language. The skill cannot independently confirm a closed-as-fixed verdict — only the reporter knows whether their original symptom is gone in their environment. The @-mention is what converts a "skill says it is fixed" claim into actionable confirmation work for QA. Add the explicit closing block (canonical wording: "please confirm the symptom is gone on a recent build and reopen with a fresh reproducer if you observe otherwise") to all three Step 10 templates: fixed-on-latest, still-reproduces, and the Step 8.5 by-design template. The previous "If this verification is wrong, please reopen..." line was passive; the new line names the reporter and asks them to act. Length target. Default rendered comments to 400-500 words. The evidence table or by-design fixed/vestigial sections are the hero; everything else has to either change the reader's mind about the verdict or be deleted. The first #2007 draft ran ~750 words and the user pushed back explicitly; encoding the cap into Step 10 means future agents do not start from scratch on length each run. Surfaced from: e2e Brev verification run on issue #2007 (the first real Brev-track exercise of the skill end to end). Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Six concrete gaps that surfaced during the first end-to-end Brev run (issue #2007). Each almost produced a wrong verdict or a wasted run; each is now encoded so the next agent doesn't have to re-discover. 1. Step 9 baseline-validation cap-removal rule was too permissive. The `unless commits-touched OR PR-mention also fires` escape hatch let inferred fix evidence override the absence of a baseline run, which produced a misleading 100/100 on #2007 despite zero baseline confirmation. Tightened: cap at 84 holds regardless of corroboration; PR-search signals raise the score within the cap, never past it. Step 10 now requires an explicit one-line caveat naming the cap and the reason in the rendered Verdict section. 2. Step 6.5 install URL default switched from `nemoclaw.nvidia.com` (NVIDIA-internal, doesn't resolve from Brev) to `www.nvidia.com/ nemoclaw.sh` (public Akamai 301 redirect). Brev runs were dying at bootstrap on the wrong host. 3. Step 7 CPU SKU picker now biases by reproducer-implied memory needs. On #2007 the cheapest 2 GB SKU couldn't load a 4.8 GiB Ollama probe; onboard failed at provider validation and we burned ~25 min before re-provisioning a 16 GB box. Adds a `CPU_RAM_FLOOR` env var and a memory-floor heuristic (16 GB when reproducer references a model server, 8 GB for sandbox onboarding without a model, 4 GB for pure-CLI bugs). 4. Step 11 failure taxonomy now has a "baseline-build rot" bucket distinct from "binary install rot." Same `BASELINE_INSTALL_FAILED=1` flag and same downstream cap-and-degrade behavior, but failures at the in-image Dockerfile build phase (what we hit on v0.0.18 — the `.openclaw-data/workspace/media` symlink layer, removed entirely by #2227) get a separate label so reviewers can see *why* the old image no longer builds without re-running. 5. New Step 8a.5b documents the two non-obvious `brev exec` quirks that reproducer scripts have to handle every time: PATH does not include `~/.local/bin` in non-login shells (so reproducers must `export PATH` at the top, or callers must wrap with `bash -lc`); and the docker group requires `sg docker -c '...'` because adding the user via `usermod -aG` doesn't take effect within the same Brev session. 6. New Step 8d.5 architectural-drift check. When the diff between `$REPORTED_VERSION` and `$LATEST` touches the *tool* the reproducer's expected output depends on (e.g. `openshell forward` between v0.0.18 and v0.0.35), don't trust the reproducer's surface alone — multi-axis verification on OS-level surfaces (host listeners, NAT rules, docker ports, SSH tunnels, etc.) is required before claiming fixed-on-latest. This is the five-axis pattern we used to confirm #2007 wasn't a false positive; codifies it as a check the skill applies whenever pickaxe shows the reproducer's tool was reworked. Surfaced from: end-to-end Brev verification of issue #2007. Build the skill, exercise it on a real issue, fix what breaks, repeat — every fix here came from a concrete failure mode in one real run. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
…ll host Two small UX fixes surfaced from a thorough re-read. Step 6.5's brev-auth error printed three numbered options without a clear recommendation. From a non-TTY agent harness (Claude Code or any unattended runner) the right answer is option 2 — open a separate terminal, run `brev login`, complete the browser flow, come back. The previous message buried that path inside option 2's inline comment. Replace with a directive recipe that names the recommended flow first (open separate terminal, browser auth, re-run the skill) and lists the headless alternatives below as "when option 1 isn't available." Also explicitly notes that credentials persist to ~/.brev/credentials.json, so re-running the skill picks them up automatically. Step 6.5's install-URL error still suggested checking `https://nemoclaw.nvidia.com` even after the default was switched to `https://www.nvidia.com/nemoclaw.sh` in commit 296f7cd. Update the suggestion to match the new default and add an explicit "then re-run this skill" so the maintainer knows the flow is fix-and-retry, not abort. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Five gaps verified against the source during a thorough re-read pass.
The audit also caught two false-positive claims I'd made earlier
(`NEMOCLAW_INSTALL_TAG` env var and `comments` field in batch fetch)
that didn't survive verification — those are not landed because there
was nothing to fix.
Step 1 batch discovery query bumped from `--limit 100` to `--limit 1000`
so the skill doesn't silently drop issues beyond the page. The recent
candidate triage on this repo found 129 open bugs; the previous limit
would have lost 29 of them. The 20-issue per-run processing cap is
downstream of Step 3/4 filters and unaffected.
Step 3's 7-day comment-marker TTL had the rule but no implementation.
Add a portable `gh issue view --json comments --jq` snippet that
extracts marker comments newer than the cutoff. macOS and Linux
date(1) syntax differ for relative-date math, so the cutoff line
tries both. Without this, the first cron run after any posting will
silently re-verify everything previously marked, posting duplicate
comments every Monday.
Step 6.5 preconditions now checks gh identity via `gh api user --jq
.login` and prints the resolved login before any later step runs.
Comments posted by Step 10 land under this account; surfacing it
explicitly catches the multi-token / wrong-tab / mid-session-reauth
case before a public comment lands under the wrong handle.
Step 10 now requires a `**Verification mode:**` header line in every
template (was previously by-design only). Reader should never have to
guess whether a verdict came from real install logs or from static
analysis. Filled in concrete defaults for the standard
fixed/inconclusive template ("runtime reproduction; baseline + latest
installed and run") and the still-reproduces template ("runtime
reproduction; bug confirmed live").
Step 10 also requires a link-pass self-verification on every
template, not just by-design's Step 8.5d. The "Tag-anchoring + linking
rule" already declared every citation must be a clickable markdown
link to the verified-on tag, but the verification step that resolves
those links (`gh api .../contents/<path>?ref=<tag>` or `curl -fsI`)
was scoped only to the by-design path. Same 404-cite risk applies to
the standard template — broken citation links advertise verification
work that didn't happen, and that's worse than no citation at all.
Surfaced from: end-to-end audit after the #2007 e2e run, with the
specific `gh issue view`/`grep` commands run against the SKILL.md to
confirm each claim before listing it as a gap.
Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Two changes that together turn the per-run batch cap from policy into code. Lower the per-batch processing cap from 20 to 15. Sequential execution with 1-2 reused Brev boxes works fine for runs in this size range; ~2-3 hours wallclock for a full 15-issue batch is the comfortable budget before either the per-plan approval gate breaks down or the maintainer session needs to span multiple sittings. Add the slicing code that actually enforces the cap. Previously Step 1 declared "Cap at N issues per run" as policy and nothing applied it — candidate sets larger than the cap would just process to completion silently. Move the slice to the end of Step 4 (after Step 3 label filters and the Step 4 version+candidate-rule filters narrow the pool) and sort by `(-versions_behind, -age_days)` so the most stale come first. Spillover beyond 15 stays eligible for the next run via the marker-comment 7-day TTL added in 8976176. Single-issue mode bypasses the cap entirely; the maintainer named the issue explicitly. Cadence section updated from "≤20 issues" to "≤15 issues" to match. Surfaced from the post-#2007 audit: of the 13 gaps initially called out, the cap-enforcement one was operational toil (skill could chew through cost on a runaway batch). Lowering and enforcing closes that toil path. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Looked at the 24 candidates surfaced by the recent triage run by *kind* of bug rather than by version, and found six gaps where the standard rubric would produce a wrong verdict or hang. All six landed in this commit. Step 3 platform skip: drop `Platform: Jetson AGX Thor/Orin`. Brev has no equivalent silicon (embedded/edge ARM with integrated GPU is not in the SKU catalog), so any Brev verification of a Jetson-only bug produces a misleading "fixed-on-x86" verdict. `Platform: DGX Spark` and `Platform: GB10` stay in scope but Step 10 now requires a hardware-substitution caveat naming the Brev SKU we substituted. Step 3 TUI / interactive-UI skip: drop issues whose title contains TUI / dashboard UI / chat UI / keystroke / key press, or whose body describes interactive UI behavior without a non-interactive reproducer. `brev exec` does not allocate a real TTY by default; TUI reproducers hang or silently fail at the first prompt. v1 documents this as out-of-scope; v1.1 may add a script(1) / expect / tmux harness. Step 5 now classifies bug-class in addition to CPU/GPU. Four classes: `performance`, `rebuild-cycle`, `log-only`, `functional` (default). Detection heuristics from the issue body (latency thresholds, "across rebuilds" / "after restart", "see lots of error in <X> log"). Each class routes to a different Step 8 rubric. Step 8b: log-scraping. When `BUG_CLASS=log-only`, also pull `~/.openclaw/logs/*.log` and `/var/log/nemoclaw/*.log` from inside the sandbox after the reproducer runs and search them for the issue's symptom phrase. Some bugs describe symptoms in internal log files, not the reproducer's stdout; the previous match rubric only checked the transcript. Step 8b: flake-detection retry. For functional bugs, run baseline three times if the first run shows the symptom inconsistently. Mixed results (1 or 2 of 3 reproduce) trigger a "flake suspected" caveat, −25 score adjustment, and downgrade `+50 latest clean` to `+25` so a lucky-clean-latest-run on an intermittent bug doesn't silently become a fixed-on-latest verdict. New Step 8e: performance-bug verification. Multi-run latency distribution rubric — N=10 runs each side, compute p50/p90, parse SLA from issue body, match latest's distribution against the SLA rather than against a symptom phrase. Cap at 60 unless the bug is silicon-independent because Brev SKUs aren't faithful to DGX Spark or GB10 hardware for performance-shape bugs. New Step 8f: rebuild-cycle verification. Run-rebuild-rerun harness for bugs that only manifest across destroy/recreate boundaries (#2701-shape). Capture artifacts pre-rebuild, trigger `nemoclaw destroy --all --force && nemoclaw onboard`, re-capture post-rebuild, diff to determine whether the artifact persisted as the issue expects. Step 10 mandatory hardware-substitution caveat: when DGX Spark or GB10 is in the issue's labels and Step 7 substituted with a different silicon class, the comment metadata block must name the substitution explicitly so silicon-shape bugs get the right reader expectation. Surfaced from: stress-testing the skill against the actual shapes of bugs in the 24-candidate set, not just against the bugs we already verified. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Six gaps surfaced from the second e2e run (issue #2592) and the broader audit of how the skill handles provider classification. Step 5 now classifies provider in addition to CPU/GPU and bug-class. Detection signals from labels and body keywords identify NIM, Gemini, Anthropic, Bedrock, or Ollama. When the reproducer references a non-Ollama provider AND actually exercises inference, the skill stops at Step 5 and prompts the maintainer interactively before any Brev cost — three options: provide the API key, accept Ollama substitution with the -30 penalty, or skip to verify-inconclusive. Pure-CLI / pure-sandbox bugs are exempt because the provider doesn't matter. Step 6 reproducer-extraction regex extended to also match `openclaw` and `openshell` invocations as anchor words. Issue #2592's reproducer was `openclaw channels add telegram` run inside the sandbox; the previous `nemoclaw`-only regex would have missed the verbatim block. Step 8a now passes the provider env vars through to install.sh's bundled onboard step, so it doesn't fall back to the default `build` (NIM) provider and fail with the misleading NVIDIA_API_KEY missing error. NEMOCLAW_PROVIDER=ollama (the default case) makes the bundled onboard use the local Ollama set up in Step 8a.5; NVIDIA_API_KEY only flows through if the maintainer provided one at Step 5's prompt. The bundled onboard creates a throwaway sandbox that gets destroyed before the reproducer runs. The reproducer's own onboard should pass `--fresh` so a half-built install-sandbox doesn't trip the "previous session failed" guard. Step 8a.5 now has an Ollama-coverage table making explicit which bug classes Ollama covers faithfully (CLI, sandbox, networking) vs which ones it doesn't (provider-specific behavior, model-specific behavior, silicon-dependent perf). Step 5's prompt keys off this table. Step 8a.5b documents the openshell-sandbox-exec syntax footgun. The correct form is `openshell sandbox exec -n NAME -- CMD`; the wrong form silently auto-detects the sandbox by "last used" and stuffs the leftover positional into bash's $0. Same section gets a brev-exec re-execution guard via a sentinel file at ~/.verify-stale-running to prevent the double-onboard scenario when SSH drops mid-run. Step 11 reframes baseline-build-rot as the dominant failure mode for any reported version >5-7 patches behind, not an edge case. Both Brev e2e runs hit it. Cap-at-84 with reporter @-mention is the modal verdict shape, not the exception — pre-flight carries more weight than baseline runtime evidence for older bugs. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
…sandbox name Two fixes from a rot-debugging investigation that nearly produced a wrong conclusion. When testing whether old NemoClaw versions actually rot under the literal end-user invocation (`https://www.nvidia.com/nemoclaw.sh`), the first test run was botched by a shell-scoping mistake: `NEMOCLAW_INSTALL_TAG=v0.0.26 curl ... | bash` scopes the env var to curl, not to the downstream bash reading from the pipe. The bash side defaulted to `latest`, resolved to v0.0.36, and the install proceeded for several minutes producing convincing-looking output before the mistake surfaced. The install script had no log line saying which version it actually resolved, so the silent failure looked like a working install of v0.0.26. Skill-side fix: after each install (Step 8a baseline and Step 8d latest), echo the resolved `nemoclaw --version` and case-match it against the requested version. Mismatch in baseline sets BASELINE_INSTALL_FAILED=1 to prevent verifying against the wrong version. Mismatch in latest logs a WARN that gets surfaced in the final comment. Cheapest possible guard against the class "I asked for X, got Y" — and the principle generalizes: print the resolved state, never trust the requested state. Also fix the sandbox name in the install.sh-passthrough block: change `NEMOCLAW_SANDBOX_NAME=__verify_stale_install__` (rejected by NemoClaw's name validator: "Allowed format: lowercase, starts with a letter, letters/numbers/internal hyphens only, ends with letter/number") to `NEMOCLAW_SANDBOX_NAME=verify-stale-install`. Surfaced during the #2519 e2e run; net-zero impact on that run because install.sh fell through on name validation rather than reaching the buggy code path, but the spec was wrong. The deeper context for the resolved-version-echo fix: when the rot hypothesis was being tested, an independent agent's logical analysis correctly identified that the public install URL DOES support `NEMOCLAW_INSTALL_TAG=<ref>` and demanded an empirical re-test. The re-test (with the env var on the bash side of the pipe) installed v0.0.26 correctly and STILL hit `Patch 4 (replaceConfigFile EACCES) not applied` — so the cap-at-84 framing held empirically, just not for the original "git-clone is the only path" reason. The full investigation is documented in findings.md Part 6's closing section ("Closing investigation: is baseline-build rot real, or our setup artifact?"). Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Add a Step 3 skip rule for issues where a MEMBER/OWNER/COLLABORATOR comment was posted within the last 7 days AND the original reporter hasn't replied since. The skill is for *stale* issues; actively- discussed ones are in-flight, and posting a verify-stale verdict on top of an open clarifying question from a maintainer would conflict with their framing and confuse the reporter (now they have two simultaneous asks). Surfaced during pre-flight on issue #2757. The maintainer @cjagwani had just commented questioning the bug's premise — specifically that the reporter's "kill -9 took the parent down" framing didn't line up with how the gateway is launched (`nohup ... &` detaches it) — and asked the reporter to confirm what they actually observed on the Station. Running verify-stale would have posted a "by-design, close as wontfix" verdict on top of an active "wait, did this really happen?" exchange. Wrong move; the skill should detect this state and skip. Implementation reuses the SEVEN_DAYS_AGO cutoff already computed for the marker-TTL check, fetches the reporter via `gh issue view --json author`, then jq-walks the comments array: find the most recent maintainer comment within the cutoff, check whether any reporter comment exists after that timestamp; if maintainer commented recently AND no reporter reply since, skip. Single-issue mode prints the skip reason and exits friendlily so the maintainer knows why the skill bailed. Batch mode just moves on. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
…argv The post-#2592 commit (1d625dd) added a Step 5 prompt asking the maintainer to "Export NVIDIA_API_KEY=<key>" and propagate it via `brev exec`. During the #2604 e2e run that played out as `NVIDIA_API_KEY=<value> brev exec ...` — which puts the literal key in the brev exec process's argv, visible in `ps -ef` to anyone with shell access on either the maintainer's laptop or the Brev box for the duration of the run. The skill's stated promise ("never logged") got violated by argv visibility. Fix: switch propagation to file-based. Maintainer writes the key to ~/.nvidia-api-key with 600 perms on their laptop; Step 6.5 brev-copy's the file to the Brev box (encrypted SSH); install / reproducer scripts inside the box read with `NVIDIA_API_KEY=$(cat ~/.nvidia-api-key)`, which sets the env var in the script's own process — never on a command line and never visible in `ps -ef`. Update Step 5's option-1 prompt to instruct the maintainer to use the file form and to `rm ~/.nvidia-api-key` after the run. Add a new "API-key propagation pattern" section after Step 5 that documents the file-based mechanism explicitly. Update Step 8a (baseline install) and Step 8d (latest install) to source the key from the file inside the Brev box's exec context, not from the local shell's env. Note in the docs: if the key was previously propagated via the cmdline (pre-fix), treat it as exposed and rotate. The #2604 run did this; the maintainer was reminded to rotate the NIM key after the run. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Replace Step 10's narrow "Length target" rule with a richer "Comment authoring principle" block that captures lessons accumulated across the e2e runs. The core rule: every section in a rendered comment must either change a reader's mind about the verdict or be cut. Word counts follow from that — 500 is a ceiling, not a target. Most comments land in 200-400. Simple cases land under 200. Document why this matters: comments posted by the skill compete for a maintainer's attention against every other in-flight thread on the repo, and AI-slop prose actively reduces signal-to-noise. Add a worked-examples block citing two real iterations: - #2007 first draft was 750 words; cut to 371. - #2604 took THREE drafts before settling on 190 words because each draft padded with prose that didn't ground the verdict — which caused the verdict itself to drift. Rule learned: name the verdict in one sentence first, then cut any section that doesn't support it. Add per-verdict length targets: - fixed-on-latest: 200-400 words - wontfix (by-design): 250-500 words - verify-inconclusive: 100-200 words - Still-reproduces (no label): 30-80 words — and no transcripts (issue body has them), no @-mention (reporter knows), no architectural prose. One sentence + marker. Add an explicit "cut, by default" list naming the patterns that historically padded comments without changing verdicts. Generalizes the existing memory note into the skill body so future agents reading SKILL.md inherit the principle directly. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
… ceiling Followup to fe363cf. The 200-400 / 250-500 ranges in the per-verdict table were still too permissive. Tighten to 200-300 for both fixed-on-latest and wontfix; verify-inconclusive stays at 100-200; still-reproduces stays at 30-80. Update the principle preamble: "300 is a hard ceiling for the main verdicts." Simple cases land under 200. If a draft is past 300, it's padding — cut before re-reading. Memory note (feedback_verify_stale_comment_length.md) updated to match. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
…d-question variant Previously skipped any issue with a maintainer comment in the last 7 days the reporter hadn't replied to. After 7 days the question is no longer "active discussion" — it's a stuck thread, and the skill can be the unsticking voice rather than a clueless interruption. Step 3: classifies the most-recent-unanswered-maintainer-comment as recent (within 7d → skip) or stale (>7d → proceed with variant), exporting UNANSWERED_MAINT_LOGIN/URL/DATE for the templater. Step 10: new mandatory block describing the variant — prepend an "@<maint>'s comment from <date> is still unanswered" lead paragraph and swap the closing reporter-only @-mention for a dual maintainer+reporter @-mention that flags the open question. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Layer 1 (local → Brev) was fixed in 22a3997 via brev copy. Layer 2 (on-box subshell) was unaddressed: scripts running on the Brev box that called sg docker -c "...$NVIDIA_API_KEY..." re-introduced argv exposure because the outer double-quoted heredoc interpolates the key value into the inner shell's argv, visible in ps -ef on the box for the onboard duration. Surfaced during #2611 e2e run when the just-rotated key landed verbatim in `sg docker -c` argv. The right pattern: escape the $ so the inner shell evaluates `cat ~/.nvidia-api-key` itself. The skill section now documents both layers with concrete WRONG / RIGHT side-by-side, and generalizes the rule to any command-string- taking invocation (sg, bash -c, su -c, ssh). Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three skill paragraphs used `[label](placeholder)` syntax to describe the shape of a markdown link the templater would render. The link checker parses bracket-paren syntax everywhere except fenced code blocks (single backticks aren't enough), so `(<url>)` and `(<UNANSWERED_MAINT_URL>)` were flagged as broken local links. Restructure: fenced code blocks for the literal template line in Step 10, plain prose in Step 3 referencing the Step 10 template instead of duplicating the link shape inline. Verified locally with `bash test/e2e/e2e-cloud-experimental/check-docs.sh --only-links --local-only` — passes cleanly. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.agents/skills/nemoclaw-maintainer-cut-release-tag/SKILL.md:
- Around line 135-139: The shell loop that pipes issue numbers into xargs can
run a bogus gh issue edit when input is empty; update the pipeline that iterates
labels (the for label in fixed-on-latest verify-inconclusive; do ... gh issue
edit {}) to make xargs safe by adding the -r option (or otherwise guard on
non-empty input) so xargs will not run when there are no issue numbers, and
apply this change in the original docs source that generates the .agents/skills
markdown (not the generated SKILL.md).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 5ce3439c-9759-45a0-a42b-14bbd316dfca
📒 Files selected for processing (3)
.agents/skills/nemoclaw-maintainer-cut-release-tag/SKILL.md.agents/skills/nemoclaw-maintainer-verify-stale/SKILL.md.agents/skills/nemoclaw-skills-guide/SKILL.md
When no open issues carry `fixed-on-latest` or `verify-inconclusive`, the previous pipeline still ran `gh issue edit` once with empty stdin, producing a noisy failure. `xargs -r` skips the invocation entirely when input is empty. CodeRabbit suggestion (PR #3063 review). The skill is hand-authored (not autogenerated from docs/), so the fix lands directly in .agents/skills/. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rogressive disclosure Per Anthropic's skill-authoring best practices guide, SKILL.md body should stay under 500 lines so it doesn't dominate context once loaded. The verify-stale SKILL.md was 1505 lines — 3× the target. Single split at the natural decision-vs-execution boundary (between Step 6.7 "try local first" and Step 7 "reuse or provision Brev"): SKILL.md (497 lines): Steps 1-6.7 — mode, latest detection, candidate filter, version parsing, environment classification, reproducer extraction, preconditions, local-first short-circuit. The decision tree for whether and how to verify. reference/execution-and-comment.md (1042 lines, with TOC): Steps 7-12 + Cadence + Out of Scope + Companion. Brev provisioning, baseline + latest verification, by-design detection, scoring, comment templates, redaction, infra failure, activity log. The implementation detail loaded only when a Brev run is committed to. The reference file's TOC at the top satisfies the >100-line reference file rule from the doc; single-level depth (no nested references) satisfies the "keep references one level deep" rule. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
@.agents/skills/nemoclaw-maintainer-verify-stale/reference/execution-and-comment.md:
- Around line 52-53: The current RUNNING computation counts all verify-stale-*
instances from INSTANCES; update the jq selection to count only instances whose
name starts with "verify-stale-" AND whose status indicates they are running
(e.g., .status == "RUNNING") so the concurrency cap checks only active boxes;
modify the expression that sets RUNNING (the jq filter applied to INSTANCES) to
include the running-status predicate.
- Around line 316-318: The reproducer is executed via brev exec with a plain
non-login bash which can miss ~/.local/bin and Docker group context; update the
two brev exec invocations that run "bash ~/reproducer.sh" (both the baseline and
latest run occurrences) to invoke the script through a login shell and wrapped
with sg docker -c (or otherwise explicitly export/include $HOME/.local/bin in
PATH) so the ~/.local/bin and docker group membership are available during
execution, then tee the output to the same
baseline-transcript.log/latest-transcript.log as before.
- Around line 767-768: The credential-redaction regex currently uses escaped
pipes so it matches literal "|" instead of alternation; update the pattern
`(?i)(token\|secret\|password\|api[_-]?key\|bearer)[^\n]*[:=][^\n]*` to use
actual alternation (e.g.
`(?i)(?:token|secret|password|api[_-]?key|bearer)[^\n]*[:=][^\n]*`) and consider
adding optional whitespace around `[:=]` or word boundaries if needed to
reliably catch `token=...`, `password: ...` and `api_key = ...`; change the
regex where referenced so functions/filters that use this pattern (the
credential redaction rule) use the corrected expression.
In @.agents/skills/nemoclaw-maintainer-verify-stale/SKILL.md:
- Line 3: Update the description string in SKILL.md to use the canonical label
syntax: replace the occurrence of "status wont-fix" with "status: wont-fix" so
it matches the rest of the skill set and avoids operator confusion; edit the
description field near the top of the file (the `description:` paragraph) to
make this single-word change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 2127a531-9c7d-4192-9fed-193e68fa3de8
📒 Files selected for processing (2)
.agents/skills/nemoclaw-maintainer-verify-stale/SKILL.md.agents/skills/nemoclaw-maintainer-verify-stale/reference/execution-and-comment.md
…Project 199 Adds tracker integration: when the verdict is fixed-on-latest, the skill also moves the issue to "Needs Review" on the NemoClaw Development Tracker (NVIDIA Project 199), so the maintainer's review queue picks it up for confirmation. After the maintainer closes the issue, existing project automation (or a manual move) advances it to Done. Other verdicts (wontfix, verify-inconclusive, no-label-still-reproduces) do not move the tracker — they have separate close paths. Constants captured at write time (re-run gh project field-list 199 --owner NVIDIA --format json if the project drifts): Project ID: PVT_kwDOABpemM4BSCP5 Status field ID: PVTSSF_lADOABpemM4BSCP5zg_r9p8 Needs Review opt ID: 5c5922a9 Step 6.5 gains a one-line precondition that warns (does not fail) when the gh CLI lacks the 'project' scope, so the auth gap surfaces before Step 10 instead of mid-run. Step 12's activity-log template gains a Tracker: row recording the move outcome. The tracker step is gated to fixed-on-latest only — wontfix and verify-inconclusive verdicts do not benefit from "Needs Review" since they're either final-state (wontfix) or different review flow (verify-inconclusive); attempting to move them would create churn. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
♻️ Duplicate comments (4)
.agents/skills/nemoclaw-maintainer-verify-stale/SKILL.md (1)
3-3:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUse canonical label spelling in the description.
Line 3 says
status wont-fix, while the documented/actual label isstatus: wont-fix. Keeping the canonical string avoids operator mistakes when applying labels manually.Suggested fix
-description: Verify whether old NVIDIA/NemoClaw bug reports still reproduce against the latest tag. Picks candidate issues opened against older versions, runs the reproducer locally first when possible (Linux or macOS), otherwise reuses or provisions a Brev Linux box (CPU or GPU), detects behavior that was intentionally changed, scores confidence, and posts an evidence-backed comment with a label (fixed-on-latest, status wont-fix, or verify-inconclusive). Tag-only — never auto-closes. Brev verification is Linux-only in v1; Windows and integration-token-dependent issues are skipped. Trigger keywords - verify stale, verify fixed, reproduce on latest, stale issue, old bug, fixed-on-latest, status wont-fix, verify-inconclusive, drain backlog, brev verify. +description: Verify whether old NVIDIA/NemoClaw bug reports still reproduce against the latest tag. Picks candidate issues opened against older versions, runs the reproducer locally first when possible (Linux or macOS), otherwise reuses or provisions a Brev Linux box (CPU or GPU), detects behavior that was intentionally changed, scores confidence, and posts an evidence-backed comment with a label (fixed-on-latest, status: wont-fix, or verify-inconclusive). Tag-only — never auto-closes. Brev verification is Linux-only in v1; Windows and integration-token-dependent issues are skipped. Trigger keywords - verify stale, verify fixed, reproduce on latest, stale issue, old bug, fixed-on-latest, status: wont-fix, verify-inconclusive, drain backlog, brev verify.Based on learnings:
.agents/skills/nemoclaw-maintainer-*files are hand-authored source-of-truth and should be reviewed directly.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.agents/skills/nemoclaw-maintainer-verify-stale/SKILL.md at line 3, Update the canonical label spelling in the SKILL.md description: replace the token "status wont-fix" with the exact label "status: wont-fix" so the documented labels (alongside "fixed-on-latest" and "verify-inconclusive") match the actual label string operators use; edit the description line in .agents/skills/nemoclaw-maintainer-verify-stale/SKILL.md to make this single-word correction..agents/skills/nemoclaw-maintainer-verify-stale/reference/execution-and-comment.md (3)
52-53:⚠️ Potential issue | 🟠 Major | ⚡ Quick winCount only running boxes for the concurrency gate.
Line 52 currently counts all
verify-stale-*instances, but the comment and guard are for running capacity. This can block new runs even when capacity exists.Suggested fix
- RUNNING=$(echo "$INSTANCES" | jq '[.[]? | select(.name | startswith("verify-stale-"))] | length') + RUNNING=$(echo "$INSTANCES" | jq '[.[]? | select(.name | startswith("verify-stale-") and .status == "RUNNING")] | length')🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.agents/skills/nemoclaw-maintainer-verify-stale/reference/execution-and-comment.md around lines 52 - 53, The current RUNNING count uses INSTANCES with a jq selector that matches all verify-stale-* names; change the jq filter used when building RUNNING so it only counts instances whose name startswith("verify-stale-") AND whose status indicates they are currently running (e.g., .status.state == "running" or the equivalent field in your instance objects), leaving the if [ "$RUNNING" -ge 4 ] guard unchanged; update the jq expression referenced by RUNNING to include that running-state predicate.
316-318:⚠️ Potential issue | 🟠 Major | ⚡ Quick winReproducer runs should use docker-group + login-shell wrapper.
Lines 317, 364, and 389 execute
bash ~/reproducer.shdirectly. That misses the documented docker group activation and can miss~/.local/binPATH setup, causing false inconclusive outcomes.Suggested fix
-brev exec "$INSTANCE_NAME" "bash ~/reproducer.sh" 2>&1 | tee ./baseline-transcript.log +brev exec "$INSTANCE_NAME" "sg docker -c 'bash -lc \"export PATH=\\\"$HOME/.local/bin:$PATH\\\"; bash ~/reproducer.sh\"'" 2>&1 | tee ./baseline-transcript.log ... -brev exec "$INSTANCE_NAME" "bash ~/reproducer.sh" 2>&1 | tee ./baseline-transcript-2.log +brev exec "$INSTANCE_NAME" "sg docker -c 'bash -lc \"export PATH=\\\"$HOME/.local/bin:$PATH\\\"; bash ~/reproducer.sh\"'" 2>&1 | tee ./baseline-transcript-2.log ... -brev exec "$INSTANCE_NAME" "bash ~/reproducer.sh" 2>&1 | tee ./latest-transcript.log +brev exec "$INSTANCE_NAME" "sg docker -c 'bash -lc \"export PATH=\\\"$HOME/.local/bin:$PATH\\\"; bash ~/reproducer.sh\"'" 2>&1 | tee ./latest-transcript.logAlso applies to: 363-365, 388-390
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.agents/skills/nemoclaw-maintainer-verify-stale/reference/execution-and-comment.md around lines 316 - 318, Replace direct invocations of "bash ~/reproducer.sh" in the brev exec calls with the documented docker-group + login-shell wrapper so the docker group is activated and the user's login PATH (including ~/.local/bin) is sourced; specifically update each brev exec "$INSTANCE_NAME" "bash ~/reproducer.sh" occurrence (and any identical occurrences around the earlier shown lines) to call the wrapper that activates the docker group then launches the reproducer inside a login shell.
767-769:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winRedaction regex alternation is escaped and won’t match intended secrets/hosts.
Line 767 and Line 768 use
\|where alternation is intended. That matches literal pipes instead of alternatives, weakening redaction and risking secret/hostname leakage.Suggested fix
-| 8 | `(?i)(token\|secret\|password\|api[_-]?key\|bearer)[^\n]*[:=][^\n]*` | Inline credentials in env/config/log output | -| 9 | `\b\w+\.(nvidia\.internal\|nv-internal\.com\|nvidia\.dev)\b` | Internal hostnames (extend list per team) | +| 8 | `(?i)(?:token|secret|password|api[_-]?key|bearer)[^\n]*[:=][^\n]*` | Inline credentials in env/config/log output | +| 9 | `\b\w+\.(?:nvidia\.internal|nv-internal\.com|nvidia\.dev)\b` | Internal hostnames (extend list per team) |#!/bin/bash python3 - <<'PY' import re bad_cred = re.compile(r'(?i)(token\|secret\|password\|api[_-]?key\|bearer)[^\n]*[:=][^\n]*') good_cred = re.compile(r'(?i)(?:token|secret|password|api[_-]?key|bearer)[^\n]*[:=][^\n]*') samples = ["token=abc", "password: hunter2", "api_key = nvapi-xyz"] print("credential pattern:") for s in samples: print(s, "bad=", bool(bad_cred.search(s)), "good=", bool(good_cred.search(s))) bad_host = re.compile(r'\b\w+\.(nvidia\.internal\|nv-internal\.com\|nvidia\.dev)\b') good_host = re.compile(r'\b\w+\.(?:nvidia\.internal|nv-internal\.com|nvidia\.dev)\b') hosts = ["svc.nvidia.internal", "x.nv-internal.com", "api.nvidia.dev"] print("\nhost pattern:") for h in hosts: print(h, "bad=", bool(bad_host.search(h)), "good=", bool(good_host.search(h))) PY🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.agents/skills/nemoclaw-maintainer-verify-stale/reference/execution-and-comment.md around lines 767 - 769, The redaction regexes use escaped pipe sequences (e.g. "(?i)(token\|secret\|password\|api[_-]?key\|bearer)[^\n]*[:=][^\n]*" and "\b\w+\.(nvidia\.internal\|nv-internal\.com\|nvidia\.dev)\b") which match literal '|' instead of alternation; replace them with proper alternation and prefer non-capturing groups: use "(?i)(?:token|secret|password|api[_-]?key|bearer)[^\n]*[:=][^\n]*" for inline credentials and "\b\w+\.(?:nvidia\.internal|nv-internal\.com|nvidia\.dev)\b" for internal hosts so the patterns in the file (the two regex strings) correctly match secrets and hostnames.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In
@.agents/skills/nemoclaw-maintainer-verify-stale/reference/execution-and-comment.md:
- Around line 52-53: The current RUNNING count uses INSTANCES with a jq selector
that matches all verify-stale-* names; change the jq filter used when building
RUNNING so it only counts instances whose name startswith("verify-stale-") AND
whose status indicates they are currently running (e.g., .status.state ==
"running" or the equivalent field in your instance objects), leaving the if [
"$RUNNING" -ge 4 ] guard unchanged; update the jq expression referenced by
RUNNING to include that running-state predicate.
- Around line 316-318: Replace direct invocations of "bash ~/reproducer.sh" in
the brev exec calls with the documented docker-group + login-shell wrapper so
the docker group is activated and the user's login PATH (including ~/.local/bin)
is sourced; specifically update each brev exec "$INSTANCE_NAME" "bash
~/reproducer.sh" occurrence (and any identical occurrences around the earlier
shown lines) to call the wrapper that activates the docker group then launches
the reproducer inside a login shell.
- Around line 767-769: The redaction regexes use escaped pipe sequences (e.g.
"(?i)(token\|secret\|password\|api[_-]?key\|bearer)[^\n]*[:=][^\n]*" and
"\b\w+\.(nvidia\.internal\|nv-internal\.com\|nvidia\.dev)\b") which match
literal '|' instead of alternation; replace them with proper alternation and
prefer non-capturing groups: use
"(?i)(?:token|secret|password|api[_-]?key|bearer)[^\n]*[:=][^\n]*" for inline
credentials and "\b\w+\.(?:nvidia\.internal|nv-internal\.com|nvidia\.dev)\b" for
internal hosts so the patterns in the file (the two regex strings) correctly
match secrets and hostnames.
In @.agents/skills/nemoclaw-maintainer-verify-stale/SKILL.md:
- Line 3: Update the canonical label spelling in the SKILL.md description:
replace the token "status wont-fix" with the exact label "status: wont-fix" so
the documented labels (alongside "fixed-on-latest" and "verify-inconclusive")
match the actual label string operators use; edit the description line in
.agents/skills/nemoclaw-maintainer-verify-stale/SKILL.md to make this
single-word correction.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 5c3d6f16-ab32-45e7-ad23-cf3a13a02591
📒 Files selected for processing (2)
.agents/skills/nemoclaw-maintainer-verify-stale/SKILL.md.agents/skills/nemoclaw-maintainer-verify-stale/reference/execution-and-comment.md
The previous Step 7 cleared fixed-on-latest and verify-inconclusive from every open issue at every release cut. That re-verifies labels that were freshly applied yesterday — wasting Brev cost and creating noise. Replace with a three-tier decision cascade: 1. If Project 199 status == Done → skip clear (maintainer accepted the verification; label stays until the issue closes). 2. Else if marker comment is >10 days old → clear (stale; reporter never confirmed in the review window). 3. Else if any PR merged since the marker date touches the paths the comment cited in `Relevant changes since v0.0.X` → clear (regression risk). 4. Else → skip clear (verification still holds). Closed issues are not iterated (--state open filter). Requires the `project` scope on gh for the Done-state lookup; falls back to the time + regression checks alone (with a warning) when the scope is missing. Companion to verify-stale's Step 10 tracker integration. Together they form a coherent verification lifecycle: skill applies label + moves to Needs Review → maintainer reviews and either accepts (moves to Done) or lets it sit → release sweep keeps Done-accepted labels, clears stale or regression-risk ones. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.agents/skills/nemoclaw-maintainer-cut-release-tag/SKILL.md:
- Line 202: The git command that sets REGRESSED uses a hardcoded path (git -C
~/NemoClaw ...) which can fail if the repo is elsewhere; change it to use the
current repo or a configurable path: replace the hardcoded -C ~/NemoClaw with
either -C "$REPO_PATH" (add a REPO_PATH variable set earlier) or remove -C to
run from the current directory, keeping MARKER_DATE and PATHS unchanged so the
lookup still uses those variables.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: b400dbe8-8762-4da3-9883-9aab71550ed7
📒 Files selected for processing (1)
.agents/skills/nemoclaw-maintainer-cut-release-tag/SKILL.md
10 days felt too aggressive for a real maintainer review window — weekends, holidays, and across-team handoffs commonly need 2 weeks. 14 days lines up with the natural sprint boundary and gives the reporter + maintainer enough room before re-verification kicks in. Updated the decision cascade table and the AGE_DAYS check in lockstep. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five findings from coderabbit's second-pass review, all valid: F1 (Major) — execution-and-comment.md: concurrency cap counted ALL verify-stale-* boxes, not just RUNNING ones, so prior stopped boxes could falsely block provisioning. Add the same .status == "RUNNING" filter the reuse query above already uses. F2 (Major) — execution-and-comment.md: the two `brev exec bash ~/reproducer.sh` calls (baseline + latest) didn't export PATH, so non-login shells could miss ~/.local/bin where the nemoclaw binary lives. Wrap both with explicit PATH export. (sg docker stays the script's responsibility — double-wrapping nests quoting.) F3 (Critical) — execution-and-comment.md: credential redaction patterns 8 and 9 were inside a markdown table where `|` is the column delimiter. The table escaped them as `\|` to preserve the table structure, but `\|` in regex matches a LITERAL pipe character, not alternation. The redaction pattern `(?i)(token\|secret\|password\|api[_-]?key\|bearer)` therefore silently failed to match `token=...`, `password: ...`, `api_key = ...`. Replaced the table with a fenced code block so patterns can use real alternation. F4 (Minor) — verify-stale SKILL.md: frontmatter description used "status wont-fix" (no colon) while the rest of the doc uses the canonical "status: wont-fix". Fixed in both the verdict-label list and the trigger-keywords list. F5 (Major) — cut-release-tag SKILL.md: regression check used `git -C ~/NemoClaw log ...` which assumes a specific checkout location. Step 1's prerequisite already requires the maintainer to be inside the NemoClaw repo, so dropping `-C ~/NemoClaw` runs from the current directory. Local check-docs.sh and markdownlint-cli2 pass. SKILL.md is still exactly 500 lines after F4. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
(1) Step 3 — unanswered-question variant should require a question, not just a comment. Surfaced when @wscurran's templated triage ack ("✨ Thanks for reporting…") fired the variant on #1642; the rendered "their comment from N days ago is still unanswered" lead made it sound like there was a pending question when nothing was actually asked. Extend the maintainer-comment selector with a question-shape content check: comment body must contain `?`, a polite imperative (please confirm/share/provide/clarify/tell/verify/check/let me know), or a question starter (could/can/would you, do you have/know/see/use). Pure acknowledgments fall through to the standard template. (2) Step 8d — enforce OpenShell version pinning after the latest checkout. Surfaced on #1642's run: v0.0.6 setup phase installed OpenShell 0.0.37 (whichever was current), but v0.0.38's blueprint.yaml caps max_openshell_version at 0.0.36 — onboard preflight then refuses to run. The skill now reads MAX_OS from latest's blueprint.yaml after checkout. If the installed binary is newer than the cap, install- openshell.sh refuses to downgrade, so the skill force-installs the capped version directly from the OpenShell GitHub release. If the installed binary is at or below the cap, install-openshell.sh runs normally to bring it forward. Both fixes lint clean. SKILL.md is now 499 lines (under the 500 ceiling). reference/execution-and-comment.md grew by ~30 lines for the OpenShell-pin block. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Long-running verifications race with maintainers closing issues independently. #2513 and #2519 both closed mid-batch — @jyaunches posted their own comprehensive verification before our run got to the comment phase. The skill kept going and would have appended a redundant comment after the close. Add a pre-post state check: refresh the issue state right before gh issue comment runs. If closed, apply the label tag-only (the maintainer's close comment is now the authoritative record) and skip both the comment and the Project 199 move. Surfaced from findings2.md gap item 18. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Step 3's issue-type skip list used exact-match `enhancement`, but the repo has both bare `enhancement` and 8 prefixed variants (enhancement: feature, enhancement: MCP, enhancement: testing, enhancement: ui, enhancement: provider, enhancement: platform, enhancement: policy, enhancement: inference, enhancement: integration, enhancement: performance, enhancement: skill). The prefixed forms slipped through the filter — surfaced from #1752 (Gemini 3 Flash Preview), labeled `enhancement: feature`. The reporter was asking whether the model variant was supported, which is correctly an enhancement request. Updated the skip clause prose to spell out the prefix-match rule and list the variants. Implementations should match exact `enhancement` OR any label starting with `enhancement:`. Surfaced from findings2.md gap item 36 (added this run). Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Adds maintainer skill
nemoclaw-maintainer-verify-stale— automates the manual loop of: spin up a Brev box → install latest NemoClaw → try to reproduce an old bug → comment with findings. Tag-only (never auto-closes); a maintainer pulls the trigger after reviewing.The skill has been driven end-to-end against 9 real candidates during development, surfacing the gaps and skill changes summarised below.
Issues tackled
wontfix-by-designwontfix-by-design(tagged)status: wont-fix, notwontfix); markdown-linked citations in comments.fixed-on-latest(tagged)sg,brev execPATH gotcha, sandbox-build-rot reframe, multi-axis architectural-drift check.fixed-on-latest(tagged)openshell sandbox exec -n NAME --syntax footgun, reproducer-extraction regex extension (openclaw|openshell).fixed-on-latest(closed by maintainer mid-run)fixed-on-latest(tagged)fixed-on-latest(tagged)state == OPENbefore posting. Second mid-batch close-by-maintainer (alongside #2519) → promote from queue to a real Step 1/3 check.Verification mode: static analysis at <tag>).Effective touched count: 7 issues with skill-rendered comments + 2 side-discoveries. Two of the seven landed
wontfix-by-design, five landedfixed-on-latest. Sandbox-build rot capped four of the five at 84 (the modal verdict shape with reporter @-mention). #2611 was the first run to land below the cap, validating the −30 LLM-synth penalty path.v1 scope
bugissues against CLI / Sandbox / OpenShell / Docker / Getting Started / Ubuntu / DGX Spark / GB10Verification flow (one-line per step)
brev ls→ reuse averify-stale-*box if running, elsebrev createwith a runtime-picked CPU SKU~/development/daily-rhythm/activity/nemoclaw-verify-stale-log.mdIdempotency
Every posted comment carries
<!-- nemoclaw-verify-stale v1 YYYY-MM-DD -->. Candidate filter excludes any issue whose comments match<!-- nemoclaw-verify-stale v\d+ YYYY-MM-DD -->within the last 7 days OR carriesfixed-on-latest/verify-inconclusive/status: wont-fix. Release sweep clears the first two on each tag cut.Active-discussion handling (two-tier)
Companion changes
nemoclaw-skills-guide/SKILL.md— adds the new skill to the maintainer skills table.nemoclaw-maintainer-cut-release-tag/SKILL.md— release-time sweep offixed-on-latestandverify-inconclusivefrom every open issue. Without it, "latest" drifts and verifications go silently stale. Verification records stay in comments; only labels reset.CI status
3abffd72(placeholder[label](url)examples in SKILL.md were being parsed as real broken links — restructured as fenced-block templates) anddc5d7143(merge of main brought in the deletion ofdocs/get-started/brev-web-ui-quickstart.mdper #3050, resolving the merge-side missing-file failure)Test plan
wontfix-by-design, maintainer agreed and closedwontfix-by-design, tagged + open awaiting closeopenclaw channels add/removenot implemented — no host-side hint shown #2592, [Linux][Brev]Ollama-local provider returns HTTP 401 Unauthorized via inference.local — wizard caches OPENAI_API_KEY env var into credentials.json and not refreshable via env unset #2519, [NemoClaw][DGX Spark][Ubuntu 24.04][CLI] nemoclaw status omits Connected/Inference fields and shows cloudflared stopped with no context #2604, [Linux][Policy] sandbox logs repeatedly emit os.networkInterfaces guard errors after NVIDIA endpoint onboard #2611) — provision → baseline → reset → latest → repro → comment + labelPlatform: DGX Sparklabel, x86_64 actual)f0ec6b8e) but untested against a live candidate (no qualifying issue in the candidate set yet)cut-release-tagtested at design level onlySigned-off-by: Prekshi Vyas prekshiv@nvidia.com
Summary by CodeRabbit
Documentation
Chores