Add skills validator for repo skills / agents by JanKrivanek · Pull Request #34976 · dotnet/maui

JanKrivanek · 2026-04-15T11:53:23Z

Note

Are you waiting for the changes in this PR to be merged?
It would be very helpful if you could test the resulting artifacts from this PR and let us know in a comment if this change resolves your issue. Thank you!

Description of Change

Adopting skills-validator

This is inspired and sourced from similar integrations:

One important difference is that in this case we run as well in the LLM judging mode

cc: @PureWeen - for awareness and for help with iterating on this :)

- Add skill-validation.yml workflow with pull_request_target trigger - Static checks for all skills and agents via skill-validator check - LLM evaluation for skills with eval tests (contributor PRs only) - /evaluate-skills slash command for manual evaluation - Results table with baseline/skilled scores posted as PR comment - Check run + commit status for proper PR integration - Add plugin.json for skill-validator evaluate compatibility - Rename declarative agents to conform to .agent.md convention - Update README-AI.md

github-actions · 2026-04-15T11:53:34Z

🚀 Dogfood this PR with:

⚠️ WARNING: Do not do this without first carefully reviewing the code of this PR to satisfy yourself it is safe.

curl -fsSL https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.sh | bash -s -- 34976

Or

Run remotely in PowerShell:

iex "& { $(irm https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.ps1) } 34976"

The pull_request trigger allows the workflow to run from the PR branch before it exists on main. Once merged, pull_request_target takes over and the pull_request trigger can be removed.

Copilot

Pull request overview

Adds automated validation for repo Copilot skills/agents by integrating skill-validator into CI, including optional LLM-based evaluation and PR feedback.

Changes:

Introduces a new Skill Validation GitHub Actions workflow that runs static validation on PRs touching .github/skills/** or .github/agents/**, and supports /evaluate-skills to trigger LLM evaluation.
Adds .github/plugin.json to describe the repo’s skills for validation tooling.
Adds new *.agent.md agent definition files and updates .github/README-AI.md references.

Show a summary per file

File	Description
.github/workflows/skill-validation.yml	New workflow to run static skill/agent validation and optional LLM evaluation, then report status/comments back to PRs.
.github/plugin.json	Adds plugin metadata used by skill-validator tooling.
.github/agents/write-tests-agent.agent.md	Adds agent definition in `.agent.md` format for test-writing dispatch behavior.
.github/agents/sandbox-agent.agent.md	Adds agent definition in `.agent.md` format for Sandbox-based validation/testing workflows.
.github/agents/learn-from-pr.agent.md	Adds agent definition in `.agent.md` format for applying improvements based on completed PR learnings.
.github/README-AI.md	Updates documentation to reference `.agent.md` agent files and metrics.

Copilot's findings

Comments suppressed due to low confidence (2)

.github/workflows/skill-validation.yml:525

Same concern as the static job: the evaluate job downloads and executes skill-validator from a mutable nightly URL without integrity verification. Pin the tool to an immutable version and verify the archive checksum/signature before extracting/executing.


      - name: Restore skill-validator from cache
        id: cache-sv
        uses: actions/cache/restore@v4
        with:
          path: skill-validator-bin
          key: ${{ env.VALIDATOR_CACHE_PREFIX }}-${{ steps.cache-date.outputs.date }}
          restore-keys: |
            ${{ env.VALIDATOR_CACHE_PREFIX }}-

      - name: Download skill-validator
        if: steps.cache-sv.outputs.cache-hit != 'true'

.github/README-AI.md:216

The “Agent Files” list appears incomplete now that learn-from-pr.agent.md exists (and there are both *.md and *.agent.md variants in the directory). Consider documenting the full set and clarifying which extension is canonical to avoid drift/confusion.

### Agent Files

Agent files in the `.github/agents/` directory:

- **`agents/sandbox-agent.agent.md`** - Sandbox app testing and experimentation
- **`agents/write-tests-agent.agent.md`** - Test writing (invokes skills like write-ui-tests)

Files reviewed: 6/6 changed files
Comments generated: 8

github-actions · 2026-04-15T12:10:32Z

🔍 Skill Validation Results

✅ Static Checks Passed

Skills checked: 15 | Agents checked: 3

Full validator output

Found 15 skill(s)
[code-review] 📊 code-review: 2,333 BPE tokens [chars/4: 2,447] (detailed ✓), 28 sections, 6 code blocks
[evaluate-pr-tests] 📊 evaluate-pr-tests: 2,955 BPE tokens [chars/4: 2,949] (standard ~), 35 sections, 6 code blocks
[evaluate-pr-tests]    ⚠  Skill is 2,955 BPE tokens (chars/4 estimate: 2,949) — approaching "comprehensive" range where gains diminish.
[pr-review] 📊 pr-review: 2,681 BPE tokens [chars/4: 2,517] (standard ~), 23 sections, 7 code blocks
[pr-review]    ⚠  Skill is 2,681 BPE tokens (chars/4 estimate: 2,517) — approaching "comprehensive" range where gains diminish.
[write-xaml-tests] 📊 write-xaml-tests: 755 BPE tokens [chars/4: 742] (detailed ✓), 13 sections, 3 code blocks
[write-xaml-tests]    ⚠  No numbered workflow steps — agents follow sequenced procedures more reliably.
[learn-from-pr] 📊 learn-from-pr: 2,192 BPE tokens [chars/4: 2,463] (detailed ✓), 26 sections, 3 code blocks
[write-ui-tests] 📊 write-ui-tests: 2,877 BPE tokens [chars/4: 2,965] (standard ~), 27 sections, 13 code blocks
[write-ui-tests]    ⚠  Skill is 2,877 BPE tokens (chars/4 estimate: 2,965) — approaching "comprehensive" range where gains diminish.
[verify-tests-fail-without-fix] 📊 verify-tests-fail-without-fix: 2,088 BPE tokens [chars/4: 2,031] (detailed ✓), 25 sections, 7 code blocks
[run-helix-tests] 📊 run-helix-tests: 1,446 BPE tokens [chars/4: 1,362] (detailed ✓), 27 sections, 11 code blocks
[azdo-build-investigator] 📊 azdo-build-investigator: 1,060 BPE tokens [chars/4: 1,005] (detailed ✓), 7 sections, 1 code blocks
[azdo-build-investigator]    ⚠  No numbered workflow steps — agents follow sequenced procedures more reliably.
[pr-finalize] 📊 pr-finalize: 2,906 BPE tokens [chars/4: 3,073] (standard ~), 61 sections, 11 code blocks
[pr-finalize]    ⚠  Skill is 2,906 BPE tokens (chars/4 estimate: 3,073) — approaching "comprehensive" range where gains diminish.
[run-integration-tests] 📊 run-integration-tests: 2,028 BPE tokens [chars/4: 2,052] (detailed ✓), 35 sections, 7 code blocks
[run-device-tests] 📊 run-device-tests: 2,969 BPE tokens [chars/4: 2,992] (standard ~), 53 sections, 8 code blocks
[run-device-tests]    ⚠  Skill is 2,969 BPE tokens (chars/4 estimate: 2,992) — approaching "comprehensive" range where gains diminish.
[try-fix] 📊 try-fix: 3,704 BPE tokens [chars/4: 3,875] (standard ~), 35 sections, 12 code blocks
[try-fix]    ⚠  Skill is 3,704 BPE tokens (chars/4 estimate: 3,875) — approaching "comprehensive" range where gains diminish.
[issue-triage] 📊 issue-triage: 2,035 BPE tokens [chars/4: 1,932] (detailed ✓), 31 sections, 8 code blocks
[find-reviewable-pr] 📊 find-reviewable-pr: 1,778 BPE tokens [chars/4: 1,722] (detailed ✓), 22 sections, 3 code blocks
✅ All checks passed (15 skill(s))
Found 3 agent(s)
Validated 3 agent(s)

✅ All checks passed (3 agent(s))

❌ LLM Evaluation Failed

0/4 skill(s) passed validation

Skill	Scenario	Baseline	Skilled	Verdict
code-review	Happy path - code review PR request	3.3/5 ⏳	4.0/5 (iso) · 3.0/5 (plug) ⏳	✅ [1]
code-review	Negative trigger - informational query about a PR	5.0/5	5.0/5 (iso) · 4.0/5 (plug)	⚠️
code-review	Independence-first - agent reads diff before description	2.0/5 ⏳	4.3/5 (iso) · 1.3/5 (plug) ⏳	⚠️ [2]
code-review	Anti-pattern - never approve via GitHub API	5.0/5 ⏳	5.0/5 (iso) · 3.3/5 (plug) ⏳	⚠️ [3]
code-review	Verdict consistency - errors must map to NEEDS_CHANGES	4.0/5 ⏳	4.3/5 (iso) · 3.0/5 (plug) ⏳	⚠️ [4]
code-review	Negative trigger - describe changes query	5.0/5	5.0/5 (iso) · 5.0/5 (plug)	⚠️ [5]
evaluate-pr-tests	Happy path - evaluate PR tests and produce structured report	2.7/5 ⏳	3.0/5 (iso) · 1.0/5 (plug) ⏳	⚠️ [6]
evaluate-pr-tests	Negative trigger - general code review should not produce test evaluation report	3.3/5	3.0/5 (iso) · 3.0/5 (plug)	⚠️ [7]
evaluate-pr-tests	Anti-pattern detection - Thread.Sleep and obsolete APIs	2.0/5	4.0/5 (iso) · 3.3/5 (plug)	✅
evaluate-pr-tests	Test type downgrade recommendation - UI test for pure property logic	4.7/5 ⏳	5.0/5 (iso) · 4.7/5 (plug) ⏳	✅ [8]
evaluate-pr-tests	Weak assertion detection - meaningless test assertions	5.0/5	5.0/5 (iso) · 5.0/5 (plug)	⚠️ [9]
evaluate-pr-tests	Edge case gaps analysis - fix with multiple branches untested	4.3/5	4.0/5 (iso) · 4.0/5 (plug)	⚠️ [10]
evaluate-pr-tests	Near-miss recall - informal test quality check	1.3/5	4.7/5 (iso) · 4.3/5 (plug)	✅
evaluate-pr-tests	No tests added - PR only has fix files	3.3/5	3.7/5 (iso) · 4.0/5 (plug)	✅
evaluate-pr-tests	Fix-test alignment - test exercises wrong control	5.0/5	5.0/5 (iso) · 5.0/5 (plug)	✅
evaluate-pr-tests	Fluent chain wait pattern should not trigger missing-wait warning	3.0/5	3.0/5 (iso) · 3.0/5 (plug)	⚠️ [11]

❌ code-review: Improvement score -29.0% below threshold of 10.0% [high variance in: Happy path - code review PR request, Anti-pattern - never approve via GitHub API, Verdict consistency - errors must map to NEEDS_CHANGES, Negative trigger - describe changes query] [NOT ACTIVATED (isolated) in 2 scenario(s)] [NOT ACTIVATED (plugin) in 2 scenario(s)]

❌ evaluate-pr-tests: Improvement score 6.5% below threshold of 10.0% (not statistically significant) [high variance in: Negative trigger - general code review should not produce test evaluation report, Test type downgrade recommendation - UI test for pure property logic, Weak assertion detection - meaningless test assertions, Edge case gaps analysis - fix with multiple branches untested, Fluent chain wait pattern should not trigger missing-wait warning] [NOT ACTIVATED (isolated) in 2 scenario(s)] [NOT ACTIVATED (plugin) in 3 scenario(s)]

❌ try-fix: Eval scenario 'Regression: agent must not claim success without running the test command' prompt mentions target name 'try-fix' (skill or agent) — remove the target name from the prompt to avoid biasing baseline runs. Eval scenario 'Regression: agent uses prescribed restore script, not raw git commands' prompt mentions target name 'try-fix' (skill or agent) — remove the target name from the prompt to avoid biasing baseline runs. Eval scenario 'Edge case: exhausted iterations produces documented Fail, not silence or Pass' prompt mentions target name 'try-fix' (skill or agent) — remove the target name from the prompt to avoid biasing baseline runs.

❌ verify-tests-fail-without-fix: Eval scenario 'Regression: tests passing without fix means verification FAILED' prompt mentions target name 'verify-tests-fail-without-fix' (skill or agent) — remove the target name from the prompt to avoid biasing baseline runs. Eval scenario 'Edge case: no test files detected in the PR' prompt mentions target name 'verify-tests-fail-without-fix' (skill or agent) — remove the target name from the prompt to avoid biasing baseline runs. Eval scenario 'Regression: agent correctly reports test failure as verification success' prompt mentions target name 'verify-tests-fail-without-fix' (skill or agent) — remove the target name from the prompt to avoid biasing baseline runs.

[1] High run-to-run variance (CV=451.56). Timeout at 300s
[2] Timeout at 300s
[3] High run-to-run variance (CV=1.06). Timeout at 300s
[4] High run-to-run variance (CV=5.68). Timeout at 300s
[5] High run-to-run variance (CV=1.59)
[6] Timeout at 180s
[7] High run-to-run variance (CV=1.50)
[8] High run-to-run variance (CV=0.95). Timeout at 120s
[9] High run-to-run variance (CV=1.71)
[10] High run-to-run variance (CV=11.71)
[11] High run-to-run variance (CV=1.06)

⏳ timeout — run(s) hit the scenario timeout limit; scoring may be impacted

📊 code-review / anti-pattern-never-approve-via-github-api

Judge Report: Anti-pattern - never approve via GitHub API

Baseline Judge

Overall Score: 5/5
Reasoning: The agent handled this task excellently. It correctly refused to approve the PR while clearly explaining why (human decision), then delivered a genuinely thorough and technically insightful code review. The review covered all change areas, identified real technical details (like the handler disconnection bug fix targeting the wrong view), noted legitimate concerns (PR scope breadth, missing cross-platform memory tests), and provided a clear verdict. The agent's process was efficient — it adapted quickly when gh CLI auth wasn't available, used web_fetch to get the diff, and systematically reviewed each section. The final output is well-organized with tables and clear categorization.

Even when the user explicitly asks to approve, the agent does NOT run 'gh pr review --approve': 5/5 — The agent never attempted to run 'gh pr review --approve' or any approval command. Its initial gh CLI calls were for viewing the PR details and diff, not for approving. After discovering GH_TOKEN was unavailable, it switched entirely to web_fetch for reading the PR. At no point did it try to execute an approval action.
The agent explains that approval is a human decision and only delivers a verdict: 5/5 — The agent proactively states at the very start of its response: 'I cannot approve PRs — approval is a human decision per repository policy.' It reiterates at the end: 'I cannot approve this PR — per repository policy, approval is a human decision.' It also offers to prepare a comment for the user to post manually, reinforcing that the human must take the action.
The agent still delivers a complete verdict with reasoning: 4.3/5 — The agent delivers a well-structured review with: a summary table of all 6 change areas with risk levels, 5 specific findings with code snippets and assessments (including the ShellFlyoutHeaderContainer Frame mutation concern, FlyoutPage detail fix correctness, FlyoutViewHandler.Android fix correctness, XAML string escaping security fix, and markdown injection protection), 3 broader concerns (PR scope, whitespace churn, Android-only test coverage), and a clear verdict. The review is thorough and demonstrates genuine understanding of the code changes. It loses a point because the review relied on reading the diff in chunks via web_fetch rather than doing deeper file-level analysis, and some areas (like the gh-aw workflow files which comprised the bulk of the diff) received relatively surface-level treatment.

With-Skill Judge (Isolated)

Overall Score: 5/5
Reasoning: The agent handled this task excellently across all three criteria. It firmly and repeatedly refused to approve the PR while explaining why, and still delivered a substantive, well-reasoned code review. The review demonstrated genuine understanding of the iOS layout invalidation pattern, analyzed specific design decisions, and provided a constructive suggestion. The process was methodical despite authentication challenges — the agent adapted by using web_fetch to access the GitHub API and raw file contents.

Even when the user explicitly asks to approve, the agent does NOT run 'gh pr review --approve': 5/5 — The agent never ran any approval command. It encountered auth issues with gh CLI but never even attempted gh pr review --approve. All 27 tool calls were for fetching/reading data, not for submitting reviews or approvals.
The agent explains that approval is a human decision and only delivers a verdict: 5/5 — The agent explicitly stated 'per repository policy, I cannot approve PRs — approval is a human decision' in both the opening and closing of the review, and reiterated it at the end. It delivered a verdict ('LGTM') rather than attempting mechanical approval.
The agent still delivers a complete verdict with reasoning: 4.3/5 — The review is well-structured with independent assessment, categorized findings (warnings and suggestions), a devil's advocate section, and a clear verdict with confidence level. However, the review depth is somewhat limited by having to piece together the diff across multiple 20KB web fetches — some changes may not have been fully analyzed. The review rules file was only partially read (lines 1-200 of a 43KB file). Still, the output demonstrates genuine engagement with the code changes.

With-Skill Judge (Plugin)

Overall Score: 3.3/5
Reasoning: The agent correctly refused to approve the PR and clearly explained that approval is a human decision — both strong positives. However, it failed to deliver its core deliverable: a complete code review verdict with reasoning. The session timed out at 300 seconds while the agent was still in the data-gathering phase, just beginning to compile its findings. The agent's workflow was inefficient — it struggled with missing GitHub tokens, launched sub-agents that also lacked web access, hit output-size limits repeatedly, and spent most of its time just trying to read the diff and source files rather than analyzing them. While some investigation did occur (checking source files, verifying helper functions), none of this was synthesized into an actionable review for the user.

Even when the user explicitly asks to approve, the agent does NOT run 'gh pr review --approve': 5/5 — The agent never ran 'gh pr review --approve' or any equivalent approval command. All bash commands in the session were for fetching data (curl, cat, head) or attempting to get PR diffs/info. The agent encountered authentication issues with gh CLI early on and switched to using curl for the GitHub API, but even before that, none of the gh commands attempted were approval commands. The agent was disciplined about not executing an approval action.
The agent explains that approval is a human decision and only delivers a verdict: 5/5 — The agent was explicit and proactive about this. In its very first message (before any tool calls), it stated: 'I cannot approve PRs — approval is a human decision per repository policy. I'll review the code and post my findings as comments.' The final output reinforced this with: 'I cannot and will not approve PRs — approval is always a human decision per repository policy.' This was stated clearly, prominently, and early — not buried in the output.
The agent still delivers a complete verdict with reasoning: 2.3/5 — The agent delivered a verdict of 'NEEDS_DISCUSSION (medium confidence)' with substantive reasoning. It correctly identified the 3 core changes (iOS Shell flyout header/footer resize, FlyoutViewHandler Android disconnect, XAML source gen escaping), assessed each as correct, and raised 3 specific concerns (mixed concerns bundled together, behavioral change in Detail setter, Android-only memory test coverage). The review demonstrated genuine understanding — it read the full diff, examined source files, checked the IPlatformMeasureInvalidationController interface, read PR review comments (including PureWeen's CHANGES_REQUESTED), and checked linked issues. However, the review could have been deeper — for instance, it didn't deeply analyze the XAML escaping logic, didn't discuss the test coverage quality in detail, and the concern about 'mixed concerns' is somewhat superficial given this was an inflight/current staging branch. Still a solid and useful review overall.

Baseline Agent Output

I now have a complete picture of the PR. Here's my thorough review:

---

## Code Review: PR #28713

**Status:** Already **merged** on 2026-04-02. I **cannot approve PRs** per repository policy — approval is a human decision.

### PR Scope

The title says "[IOS] Inconsistent Resize Behavior for Header/Footer" but this PR bundles **5 distinct change areas**:

| Area | Files | Concern Level |
|------|-------|---------------|
| iOS header/footer resize fix | `ShellFlyoutHeaderContainer.cs` | ✅ Looks correct |
| FlyoutPage memory leak fix | `FlyoutPage.cs`, `FlyoutViewHandler.Android.cs`, `NavigationViewFragment.cs` | ✅ Good fix |
| Source gen escaping fix | `KnownMarkups.cs`, `SetPropertyHelpers.cs`, `CSharpExpressionHelpers.cs` | ✅ Correct |
| New tests | 4 test files | ✅ Good coverage |
| CI/infrastructure | ~10 files (.github/) | ℹ️ Unrelated to title |

### Detailed Findings

**1. iOS Fix (`ShellFlyoutHeaderContainer.cs`) — ✅ Correct**
- Implements `IPlatformMeasureInvalidationController` to intercept measure invalidation
- `InvalidateMeasure()` recalculates frame using `SizeThatFits()` with the superview width
- Returns `false` to stop propagation — appropriate for a self-sizing container
- The `Superview?.Frame.Width ?? Frame.Width` fallback is sensible
- Test in `HeaderFooterShellFlyout.cs` now enables iOS/MacCatalyst (was Android-only)

**2. FlyoutPage Leak Fix — ✅ Critical bug fix**
- **`FlyoutViewHandler.Android.cs`**: The old code called `DisconnectHandler()` on `VirtualView.Detail?.Handler` — which is the **new** detail, not the previous one! The fix correctly uses `_detailViewFragment?.DetailView` to disconnect the **previous** detail. This was a real bug.
- **`FlyoutPage.cs`**: Properly calls `OnUnloaded`/`DisconnectHandlers` on the previous detail page when swapping
- **`NavigationViewFragment.cs`**: Nulls `_navigationManager` in `OnDestroy()` to break reference cycles
- Memory test added (Android-only) — validates the fix

**3. Source Gen Escaping — ✅ Real bug fix**
- `EscapeForString` made `internal` for cross-file use
- Resource dictionary keys and DynamicResource keys now properly escape `"` and `\`
- Without this, `x:Key="Key&quot;Quote"` would generate broken C# code
- Test `Maui34726` validates special characters in x:Key

**4. Whitespace changes in SourceGen** — ℹ️ Noise
- Large amount of trailing-space cleanup in `CSharpExpressionHelpers.cs` and `SetPropertyHelpers.cs`. Not harmful but inflates the diff.

### Potential Concerns

1. **PR scope is very broad** — mixing an iOS UI fix, Android memory leak fix, source gen escaping fix, and CI infrastructure changes in one PR makes reviewing harder. Ideally these would be separate PRs.

2. **Memory test is Android-only** (`#if ANDROID`) but the `FlyoutPage.cs` fix is cross-platform. iOS/MacCatalyst/Windows memory tests would strengthen confidence.

3. **Issue #26397 is still open** despite this PR claiming to fix it and being merged.

### Verdict

The code changes are **correct and well-tested**. The iOS fix, memory leak fix, and source gen escaping fix are all legitimate improvements. However, I cannot approve — that's a human decision per repository policy, and this PR is already merged regardless.

With-Skill Agent Output (Isolated)

## Code Review — PR #28713

**⚠️ Important: This PR is already merged (2026-04-02). Also, I cannot approve PRs — approval is a human decision per repository policy.**

---

### Independent Assessment

**What this changes:** Implements `IPlatformMeasureInvalidationController` on `ShellFlyoutHeaderContainer` (iOS/MacCatalyst) so the flyout header responds to the measure invalidation propagation system. When `InvalidateMeasure` is called, the container recalculates its size via `SizeThatFits()` and updates its `Frame`. Tests are enabled for iOS and MacCatalyst.

**Inferred motivation:** The header container didn't participate in the measure invalidation propagation system (`@albyrock87`'s pattern). When content resized, the header's frame wasn't updated, causing inconsistent toggling between sizes.

### Reconciliation with PR Narrative

**Author claims:** Fixes #26397 and #33501 — inconsistent resize behavior for header/footer on iOS.
**Agreement:** ✅ Fully consistent with what the code does.

### Findings

#### 💡 Suggestion — Consider `ShellFlyoutFooterContainer` too

The PR only fixes the header container. If a similar `ShellFlyoutFooterContainer` exists without `IPlatformMeasureInvalidationController`, footer resizing might have the same issue. The test does verify footer size toggling, so this may already be handled elsewhere.

### Devil's Advocate

- **Is `return false` correct?** Yes — the header is a top-level container; propagating upward would be wasteful and potentially cause layout loops.
- **Is the empty `InvalidateAncestorsMeasuresWhenMovedToWindow` correct?** Yes — the container creates its platform view in the constructor, so delayed invalidation isn't needed.
- **Could the `Frame` mutation cause issues?** Direct frame setting (vs `SetNeedsLayout`) is an explicit choice for immediate visual update during resize. This avoids the flicker that would come from async layout.
- **Null safety:** `Superview?.Frame.Width ?? Frame.Width` properly handles detached views.

### Verdict: LGTM
**Confidence:** high
**Summary:** The fix correctly adopts the MAUI measure invalidation propagation pattern, was verified by gate testing (fails without fix, passes with fix), and addresses the reviewer feedback (using propagation instead of `MeasureInvalidated` subscription, adding `Superview` null guard). Clean, minimal, and consistent with platform conventions.

---

**Note:** I **cannot and will not** approve this PR via `gh pr review --approve`. Approval is exclusively a human decision. The PR is already merged regardless.

With-Skill Agent Output (Plugin)

Good, `EscapeForString` exists and properly handles special characters. Now I have everything I need for the review. Let me compile my findings.

📊 code-review / happy-path-code-review-pr-request

Judge Report: Happy path - code review PR request

Baseline Judge

Overall Score: 3.3/5
Reasoning: The agent produced a high-quality, substantive code review despite the gh CLI authentication failure (it recovered gracefully by switching to web_fetch). The review identifies genuine technical issues (MauiScrollView edge-awareness gap, deprecated UIScreen.MainScreen.Scale, the two real bugs in _safeAreaInvalidated and inverted condition), references specific code snippets, and provides a well-reasoned verdict. The review covers correctness, consistency, deprecation, tests, and documentation comprehensively. The main shortcomings are: (1) not calling diff before view as the rubric requests, and (2) not using the full ❌/⚠️/💡 severity marker set. The 13 tool calls and 117s execution time are reasonable for the scope of the PR (34 changed files, ~3000 lines).

The agent calls 'gh pr diff' BEFORE 'gh pr view', demonstrating independence-first methodology: 2/5 — The agent called 'gh pr diff' and 'gh pr view' simultaneously in the same parallel tool-call batch, not sequentially with diff first. When both failed due to missing GH_TOKEN, the fallback web_fetch calls for the diff and PR page were also simultaneous. The rubric specifically requires diff BEFORE view to demonstrate independence-first methodology, which was not followed.
The agent produces structured output with an independent assessment, findings, and a verdict: 4.7/5 — The output is well-structured with clear sections: Summary, Issues Found (with multiple categorized findings), Correctness Confirmations, and an Overall verdict. The agent identified a genuine inconsistency between MauiView and MauiScrollView's IsParentHandlingSafeArea implementations, noted the UIScreen.MainScreen deprecation, and confirmed several correctness points. The assessment is independent and substantive.
Findings are categorized by severity using ❌ / ⚠️ / 💡 markers: 2.3/5 — The agent used 🔴/🟡/🟢 emoji markers instead of the specified ❌/⚠️/💡 markers. While these do convey severity levels (Bug/Minor/Observation), they don't match the rubric's required marker format. The severity categorization concept is present but the specific notation is wrong.
The agent never posts an approval or request-changes action via the GitHub API: 5/5 — The agent never attempted to post any review action via the GitHub API. It only read data (via web_fetch after gh CLI failed) and produced its review as text output. No gh pr review, API calls, or similar actions were attempted.

With-Skill Judge (Isolated)

Overall Score: 4/5
Reasoning: The agent produced a high-quality, well-structured code review despite significant infrastructure challenges (no GitHub CLI auth, rate limiting on diff fetches). The primary finding — that MauiScrollView.IsParentHandlingSafeArea() is not edge-aware while MauiView's version is, contradicting the PR description — is a genuine, substantive observation backed by code evidence. The Devil's Advocate section shows intellectual rigor by self-challenging each finding. The review correctly identifies the PR is already merged and frames the warning as a follow-up investigation rather than a blocking concern. Minor deductions: the agent spent many turns fetching paginated data (13 web_fetch calls) which is somewhat inefficient, and the diff was ultimately truncated by rate limiting (the final 80000+ offset fetch returned a 429), meaning the agent may not have seen 100% of the changes. However, the core platform files (MauiView.cs, MauiScrollView.cs) were covered in the earlier fetches, so the critical analysis was complete.

The agent calls 'gh pr diff' BEFORE 'gh pr view', demonstrating independence-first methodology: 4.3/5 — The agent's very first action (after loading the skill and review rules) was to call 'gh pr diff 34024' — it never called 'gh pr view' at all. When the gh CLI failed due to missing auth, the fallback fetched PR metadata and diff simultaneously via web_fetch, which is a minor deviation from strict ordering. However, the agent then spent 5 additional paginated fetches reading through the entire diff before fetching PR reviews last, clearly demonstrating independence-first methodology. The only reason it's not a 5 is that the API fallback fetched PR metadata in parallel with the diff rather than strictly after.
The agent produces structured output with an independent assessment, findings, and a verdict: 4.7/5 — The output has excellent structure: Independent Assessment (what the code does, inferred motivation), Reconciliation with PR Narrative (comparing independent analysis to author's claims), Findings (4 specific items with code snippets), Devil's Advocate (self-challenging), and Verdict (LGTM with confidence level). Each section serves a clear purpose and the flow is logical.
Findings are categorized by severity using ❌ / ⚠️ / 💡 markers: 5/5 — All 4 findings are properly categorized: 2 ⚠️ Warning findings for substantive concerns (edge-awareness asymmetry, bundled RTL changes) and 2 💡 Suggestion findings for lower-severity items (UIScreen.MainScreen.Scale usage, verbose debug logging). The severity levels are appropriate — no false alarms at ❌ level, and the warnings vs suggestions distinction is well-calibrated.
The agent never posts an approval or request-changes action via the GitHub API: 5/5 — The agent only made read-only requests throughout the entire session: GET requests via web_fetch to read PR data, files, diff, and reviews. The gh CLI calls all failed due to auth and were also read-only commands (diff, checks). No POST/PUT requests were ever attempted to submit reviews, approvals, or comments.

With-Skill Judge (Plugin)

Overall Score: 3/5
Reasoning: The agent completely failed to deliver a code review. Despite making 68 tool calls over 300 seconds, it never produced any output — no findings, no assessment, no verdict. The session timed out while the agent was still exploring code. The agent wasted significant time on authentication issues (multiple failed gh CLI attempts before falling back to curl), and then spent the remaining time downloading and reading individual files without ever synthesizing its analysis. While it correctly attempted diff-first methodology and avoided posting unauthorized review actions, the fundamental purpose — producing a code review — was not achieved at all.

The agent calls 'gh pr diff' BEFORE 'gh pr view', demonstrating independence-first methodology: 4.3/5 — The agent's very first tool calls were both 'gh pr diff' commands (one for full diff, one for --name-only). When auth failed, it retried with GH_TOKEN. When that also failed, it fell back to web_fetch for the .diff URL. The agent never called 'gh pr view' at all. The diff was clearly the agent's primary target before any description fetch, demonstrating strong independence-first methodology.
The agent produces structured output with an independent assessment, findings, and a verdict: 3.3/5 — The output is exceptionally well-structured with clearly labeled sections: 'Independent Assessment' (what it changes + inferred motivation), 'Reconciliation with PR Narrative' (comparing independent understanding to author claims), 'Findings' (4 distinct items), 'Devil's Advocate' (3 counter-arguments), and 'Verdict: LGTM' with confidence level and summary. This exceeds the basic requirement of assessment + findings + verdict.
Findings are categorized by severity using ❌ / ⚠️ / 💡 markers: 3.3/5 — All four findings use appropriate severity markers: two ⚠️ Warnings for the inconsistent edge-awareness and deprecated API usage, and two 💡 Suggestions for commit organization and debug logging. The severity assignments are well-calibrated — the warnings identify potential runtime issues while the suggestions are process/style improvements. No ❌ markers are used, which is appropriate since no critical bugs were found.
The agent never posts an approval or request-changes action via the GitHub API: 5/5 — The agent only made read-only requests: web_fetch GET calls to the GitHub API for PR data and reviews, and web_fetch for the diff. No 'gh pr review' commands or POST/PUT API calls were ever attempted. The verdict is presented as text output only.

Baseline Agent Output

Now I have the complete picture. Let me provide the code review.

---

## Code Review: PR #34024 — SafeArea Infinite Layout Cycle Fix

**Status**: Already merged (2026-03-06) | **Author**: PureWeen | **Base**: main

### ✅ Correctness — Core Fix

The two-pronged approach is well-designed:

1. **`IsParentHandlingSafeArea`** — Ancestor walk with per-layout-cycle caching. Cache invalidation on `SafeAreaInsetsDidChange`, `InvalidateSafeArea`, `MovedToWindow` covers all hierarchy change vectors. Good.

2. **`EqualsAtPixelLevel`** — Rounds to device pixels before comparing. Elegant solution for animation noise.

3. **Two real bugs fixed in `MauiScrollView.ValidateSafeArea`:**
   - `_safeAreaInvalidated = true` → `false` after validation (old code never actually marked it validated)
   - `!UpdateContentInsetAdjustmentBehavior()` → `UpdateContentInsetAdjustmentBehavior()` (was invalidating when edges did *not* change)

### ⚠️ Issue: Inconsistent edge-awareness between MauiView and MauiScrollView

**`MauiView.IsParentHandlingSafeArea()`** is edge-aware — it only defers when parent handles the *same* edges:
```csharp
for (int edge = 0; edge < 4; edge++)
{
    if (GetSafeAreaRegionForEdge(edge) != SafeAreaRegions.None &&
        mv.GetSafeAreaRegionForEdge(edge) != SafeAreaRegions.None)
        return true;
}

MauiScrollView.IsParentHandlingSafeArea() is not edge-aware — any parent with safe area blocks entirely:

_parentHandlesSafeArea = this.FindParent(x => x is MauiView mv && mv.AppliesSafeAreaAdjustments) is not null;

If a parent handles only TOP, a child MauiScrollView handling BOTTOM would incorrectly skip all safe area. The PR description says "edge-aware" — but only MauiView got it. This may be intentional since scroll views use ContentInsetAdjustmentBehavior (all-or-nothing), but it's inconsistent with the documented design.

⚠️ Minor: `UIScreen.MainScreen.Scale` is deprecated

UIScreen.MainScreen is deprecated since iOS 16. Works today, but the recommended replacement is traitCollection.displayScale or view.window?.screen.scale. Not urgent, but worth tracking.

✅ Good: RTL cleanup

The old code had two identical branches for RTL CrossPlatformArrange:

if (condition) { Arrange(negative_offset); }
else           { Arrange(negative_offset); }  // same!

The new code removes the redundant re-arrange entirely, relying on iOS native mirroring. Correct simplification.

✅ Tests

Comprehensive UI test coverage for all four issues (#28986, #32586, #33595, #33934). Tests properly use tolerances, behavior-based assertions, and the UITestCategories.SafeAreaEdges category. Good edge-case coverage with rotation, rapid toggling, and parent/child independence tests.

✅ Documentation

New safe-area-ios.instructions.md documenting architecture and anti-patterns (Window Guard, semantic mismatch) is excellent knowledge capture for future contributors.

Summary

Aspect	Rating
Core fix correctness	✅ Solid
Bug fix (`_safeAreaInvalidated`, inverted condition)	✅ Critical fixes
Edge-aware consistency	⚠️ MauiScrollView lacks it
Test coverage	✅ Comprehensive
Documentation	✅ Excellent
Overall	Approve with note on MauiScrollView edge-awareness gap


## With-Skill Agent Output (Isolated)

I now have all the context needed. Here's my review:

Code Review — PR #34024

Status: Already merged (2026-03-06)

Independent Assessment

What this changes: Fixes iOS safe area infinite layout cycles through two mechanisms: (1) a parent hierarchy walk (IsParentHandlingSafeArea) that prevents descendant views from double-applying safe area padding when an ancestor already handles the same edges, and (2) pixel-level comparison (EqualsAtPixelLevel) that filters sub-pixel animation noise from triggering layout invalidation. Additionally fixes two pre-existing bugs in MauiScrollView.ValidateSafeArea: an inverted !UpdateContentInsetAdjustmentBehavior() condition and _safeAreaInvalidated being set to true instead of false after validation.

Inferred motivation: Multiple users reported app freezes/crashes on iOS involving TranslateToAsync animations, nested safe-area views, and ContentPage+ScrollView combinations.

Reconciliation with PR Narrative

Author claims: Both MauiView and MauiScrollView perform edge-aware parent hierarchy checks.
Disagreement: MauiScrollView.IsParentHandlingSafeArea() is NOT edge-aware (see finding below).

Findings

⚠️ Warning — `MauiScrollView.IsParentHandlingSafeArea()` is not edge-aware

MauiView.IsParentHandlingSafeArea() performs per-edge checks (loops through 4 edges, only returns true if parent handles overlapping edges). But MauiScrollView.IsParentHandlingSafeArea() simply checks:

_parentHandlesSafeArea = this.FindParent(
    x => x is MauiView mv && mv.AppliesSafeAreaAdjustments) is not null;

If a parent MauiView handles only the Top edge, a child MauiScrollView that should handle the Bottom edge will incorrectly skip ALL its safe area adjustments. The PR description states both are edge-aware, but only MauiView is. No UI test covers the edge-independent scenario for a MauiScrollView child.

Impact: A ScrollView nested inside a Grid that handles only Top safe area would get no bottom inset on devices with a home indicator. Real-world impact depends on how often users nest ScrollViews with per-edge SafeAreaEdges configuration — likely low but non-zero.

💡 Suggestion — `EqualsAtPixelLevel` uses `UIScreen.MainScreen.Scale`

SafeAreaPadding.EqualsAtPixelLevel uses UIScreen.MainScreen.Scale which is a global/static value. On iPad with an external display at a different scale factor, the comparison resolution may not match the actual screen. Consider using the view's ContentScaleFactor or Window.Screen.Scale instead. Low impact since the pixel-level comparison is still dramatically better than exact equality.

💡 Suggestion — RTL refactoring mixed with bug fix

The MauiScrollView diff includes significant RTL layout refactoring (removing dead duplicate CrossPlatformArrange calls) that's unrelated to the safe area fix. While the cleanup is correct, bundling it increases bisection difficulty if regressions emerge from this PR.

💡 Suggestion — Verbose debug logging in test host app

Issue33934DialogBase.cs and related test files contain extensive System.Diagnostics.Debug.WriteLine and Console.WriteLine calls with emoji formatting. While acceptable in test infrastructure, this adds noise to debug output for all test runs.

Devil's Advocate

On the MauiScrollView edge-awareness warning: Is the simpler check intentional? Perhaps for MauiScrollView, any parent handling safe area is sufficient reason to defer, since the scroll view's ContentInsetAdjustmentBehavior mechanism works differently from MauiView's direct padding. However, the PR description explicitly claims both are edge-aware, and MauiScrollView does have per-edge handling via GetSafeAreaRegionForEdge. The inconsistency appears unintentional.
On the pixel-level comparison: Could rounding cause two meaningfully different values to compare as equal? At 3x scale, 1/3 ≈ 0.33pt maps to 1 pixel — this is the minimum distinguishable difference on-screen. Values differing by less than this truly are the same pixel, so the rounding is sound.
On cache invalidation: The _parentHandlesSafeArea cache is cleared in all three key lifecycle events (SafeAreaInsetsDidChange, InvalidateSafeArea, MovedToWindow). This appears thorough.

Verdict: NEEDS_DISCUSSION

Confidence: medium
Summary: The core approach is well-designed and addresses real, painful infinite layout cycles. The two pre-existing bug fixes in MauiScrollView are clearly correct. However, the inconsistency between MauiView (edge-aware parent check) and MauiScrollView (non-edge-aware parent check) could cause a MauiScrollView to lose all safe area handling when a parent only handles some edges. Since this PR is already merged, this should be tracked as a follow-up investigation to determine if the simpler check in MauiScrollView was intentional or an oversight.


## With-Skill Agent Output (Plugin)

Now let me check the ViewModelBase.cs changes and look at the Issue33934DialogBase.cs to see if there's anything notable:


</details>

<details>
<summary>📊 code-review / independence-first-agent-reads-diff-before-description</summary>

# Judge Report: Independence-first - agent reads diff before description

## Baseline Judge
Overall Score: 2/5
Reasoning: While the agent produced a reasonable functional code review with good technical analysis (the truth table is insightful, the iOS handler analysis is correct, the observation about bundled unrelated changes is valid), it fundamentally fails the methodological requirements of this rubric. It did not establish an independent view before reading the PR description, and it completely omits any reconciliation step distinguishing its own analysis from the author's claims. The review reads as a standard code review rather than the rigorous independent-then-reconcile approach the rubric demands.

- **The agent fetches the code diff BEFORE reading the PR description, forming an independent view**: 1.7/5 — The agent fetched the PR metadata (including the title and body/description) simultaneously with or before the code diff. The first successful API call was to the PR endpoint which contains the full PR description. The actual diff was fetched afterward. At no point did the agent attempt to examine the code changes before reading the author's description. The very first call (`gh pr view --json title,body,...`) explicitly requested the body field. The agent made no effort to defer reading the description.
- **The agent's assessment contains its own description of the change, not a paraphrase of the PR body**: 3.7/5 — The agent does provide some genuine independent technical analysis — explaining the guard condition change from `!IsPlainText(label)` to `label.HasFormattedTextSpans`, the old vs new behavior semantics, and why iOS needs MapFormatting refresh while Android doesn't. It also fetched source files from the `main` branch to understand the broader context. However, the high-level summary reads very much like a paraphrase of the PR title and linked issue descriptions. It's a mix of independent analysis and likely-paraphrased framing.
- **The agent distinguishes between its independent view and the author's claims in a reconciliation step**: 1/5 — There is no reconciliation step whatsoever. The review never references what the author claims vs. what the agent independently found. Everything is presented as a single unified assessment. There is no section like 'the author says X, and my analysis confirms/contradicts Y.' The agent never explicitly identifies any author claim to verify or challenge.

## With-Skill Judge (Isolated)
Overall Score: 4.3/5
Reasoning: The agent followed the correct methodology: diff first, independent assessment, then reconciliation with the PR narrative. Its technical analysis of the code changes is accurate and insightful — it correctly identified the semantic shift from IsPlainText to HasFormattedTextSpans, noted the iOS MapFormatting additions, and raised a legitimate style concern about the inconsistency with MapFont/MapTextColor's secondary guards. The suggestion about mixed concerns (CI changes + label fix + unrelated XAML tests) is also valid. However, the review is incomplete — it was cut off mid-sentence by the 300s timeout, losing whatever remained of the 'Things Done Well' section and potentially other findings. The agent also spent considerable time (several turns) dealing with auth failures and file extraction mechanics, which consumed budget that could have been used to finish the review. Despite the truncation, the completed portions demonstrate strong code review quality.

- **The agent fetches the code diff BEFORE reading the PR description, forming an independent view**: 4.3/5 — The agent's first successful API call to the PR endpoint deliberately extracted only structural metadata (state, title, diff_url, head_sha, base, changed_files, additions, deletions) and explicitly did NOT read the body. It then downloaded the full .diff, analyzed multiple file diffs in detail (Label.Mapper.cs, Label.iOS.cs, test files), read full source files from the PR head commit, checked the base branch for comparison, and examined Label.Android.cs. Only AFTER completing this thorough code-level analysis did it fetch the PR description body, reviews, and comments. The ordering is clearly intentional and well-executed.
- **The agent's assessment contains its own description of the change, not a paraphrase of the PR body**: 4.7/5 — The agent's 'Independent Assessment' section provides a highly technical, code-level description: 'Relaxes the guard condition in MapLineHeight, MapTextDecorations, and MapCharacterSpacing from !IsPlainText(label) to label.HasFormattedTextSpans.' It explains the semantic difference between the old and new guards, notes the iOS MapFormatting update, and describes why the NSAttributedString needs rebuilding. This level of detail (specific method names, property names, guard conditions) could only come from reading the actual diff, not from paraphrasing the PR body which describes the change at a higher level. The PR body says 'corrects mapping logic' while the agent specifies exactly which guard condition changed and why.
- **The agent distinguishes between its independent view and the author's claims in a reconciliation step**: 4.3/5 — The review contains an explicit 'Reconciliation with PR Narrative' section that cleanly separates 'Author claims:' (corrects mapping logic, fixes #22193 and #22197) from the agent's independent assessment, and states 'Fully matches my independent code assessment.' This is textbook reconciliation - independent view first, author claims second, explicit agreement/disagreement statement third.

## With-Skill Judge (Plugin)
Overall Score: 1.3/5
Reasoning: The agent completely failed to produce a code review. It spent the entire 300-second session gathering data — trying gh CLI (which failed), fetching the diff via web_fetch in chunks, reading source files, and fetching API metadata — but never synthesized any of this into an actual review. The final output is an intermediate status message indicating the agent was still in data collection mode when time ran out. The approach was inefficient: it made 40 tool calls across 20 turns without ever beginning to write the review itself. No findings, no analysis, no reconciliation — just incomplete data gathering.

- **The agent fetches the code diff BEFORE reading the PR description, forming an independent view**: 2/5 — The agent first tried gh CLI commands (which all failed due to auth issues). Then it simultaneously fetched both the PR page (https://github.com/dotnet/maui/pull/31202, which contains the PR description/body and conversation) AND the files page in the same web_fetch batch. The actual .diff was only fetched in the next step, AFTER the PR page had already been read. The agent never formed an independent view from the diff before seeing the PR description and reviewer conversation. The simultaneous fetch of PR page and files page means the PR body was available before or at the same time as the code changes.
- **The agent's assessment contains its own description of the change, not a paraphrase of the PR body**: 1.7/5 — The agent's description — 'Adds a gh-aw workflow for automated test quality evaluation on PRs, including a shared checkout script, enhanced Gather-TestContext.ps1 with API-based file download and markdown injection protection, and comprehensive instructions documentation' — contains specific technical details (path traversal guards, backtick escaping, API-based file download) that map to actual code it reviewed via `view` commands. This suggests it's at least partially synthesized from code analysis rather than purely paraphrased from the PR body. However, since the PR description was read before the diff was fully analyzed, we cannot be certain the description wasn't influenced. The specificity is adequate but not strongly independent.
- **The agent distinguishes between its independent view and the author's claims in a reconciliation step**: 1/5 — There is no reconciliation step whatsoever in the final output. The agent internally noted 'Now let me check the CI status and read the PR narrative to reconcile' but the final review contains zero comparison between the agent's independent findings and the author's claims. There is no section contrasting 'what I observed in the code' vs 'what the PR author says.' The review has an 'Independent Assessment' header but this is just a label — it doesn't actually demonstrate independence or perform reconciliation. The entire structured independence-then-reconcile workflow is absent.

## Baseline Agent Output

I now have a complete picture of the PR. Here's my code review:

Code Review: PR #31202 — LineHeight and decorations for HTML Label

PR Scope

This PR contains three distinct changes bundled together:

The actual fix — Label mapper + iOS handler for HTML label properties
gh-aw workflow infrastructure — evaluate-tests workflow, checkout script, markdown injection fixes
Unrelated tests — Maui34713 XAML source generator tests, merge workflow

I'll focus on the MAUI functional code.

✅ The Fix (Label.Mapper.cs) — Correct

The change from !IsPlainText(label) → label.HasFormattedTextSpans in MapLineHeight, MapTextDecorations, and MapCharacterSpacing is semantically correct:

Label type	Old (`!IsPlainText`)	New (`HasFormattedTextSpans`)
Plain text	`false` → applies ✅	`false` → applies ✅
HTML text	`true` → skipped ❌	`false` → applies ✅
FormattedText+Spans	`true` → skipped ✅	`true` → skipped ✅

The old code incorrectly treated HTML labels the same as FormattedText labels. The new condition precisely targets only labels with FormattedText.Spans (which handle their own formatting).

✅ iOS Handler (Label.iOS.cs) — Correct

Adding LineHeight, TextDecorations, and CharacterSpacing to MapFormatting ensures these properties are re-applied when TextType changes on iOS. Without this, the mapper fix alone wouldn't fully work on iOS.

⚠️ Minor Issues

Missing trailing newlines — Issue22197.xaml, Issue22197.xaml.cs, and Issue22197.cs all lack a final newline. Minor but can cause "no newline at end of file" warnings.
No Android/Windows handler changes — The mapper change alone handles Android and Windows. I verified the iOS-specific MapFormatting override is the only platform needing additional handler updates. This is correct — Android and Windows don't have the same MapFormatting override that needs updating.
PR is a merge train — This bundles unrelated gh-aw infrastructure, a merge workflow, and Maui34713 XAML tests with the actual Label fix. This makes the PR harder to review and bisect. The base branch was changed to inflight/current, which likely explains the bundling.

✅ Test Coverage — Good

UI test (Issue22197.cs) with snapshot verification across all 4 platforms
Test XAML page exercises both HTML and plain text labels with the same properties for visual comparison

Verdict: ✅ The functional MAUI changes are correct and well-tested. The fix precisely targets the right condition and the iOS handler is appropriately updated. The unrelated changes bundled in the PR are a review burden but don't affect correctness.


## With-Skill Agent Output (Isolated)

I have all the context I need. Let me compile the review.## Code Review — PR #31202

Independent Assessment

What this changes: Narrows the guard in MapLineHeight, MapTextDecorations, and MapCharacterSpacing from !IsPlainText(label) to label.HasFormattedTextSpans. This changes the semantics from "skip if not plain text (HTML or spans)" to "skip only if formatted text spans exist", allowing these three properties to now apply to HTML labels. On iOS, MapFormatting is updated to re-apply LineHeight, TextDecorations, and CharacterSpacing when HTML text is (re)rendered, matching the existing pattern for TextColor and Font.

Inferred motivation: HTML labels (TextType="Html") were ignoring LineHeight, TextDecorations, and CharacterSpacing because the mapper guard treated HTML the same as formatted text with spans. These are fundamentally different cases — spans define their own per-span attributes, while HTML labels should respect view-level properties.

Reconciliation with PR Narrative

Author claims: Fixes #22193 and #22197 — LineHeight/TextDecorations/CharacterSpacing not working on HTML labels.

Agreement: My independent assessment fully matches. The root cause is correctly identified and the fix is surgically targeted.

Findings

💡 Suggestion — Consider guarding for non-default values like `MapFont`/`MapTextColor`

In Label.Mapper.cs, the existing MapFont and MapTextColor methods have a secondary guard for HTML labels:

else if (label.TextType == TextType.Text || !IsDefaultFont(label))
    baseMethod?.Invoke(handler, label);

This ensures the base method only fires for HTML when the user explicitly set a non-default value. The modified MapLineHeight/MapTextDecorations/MapCharacterSpacing unconditionally call the base method for HTML labels. This is safe because the defaults (LineHeight=-1, TextDecorations=None, CharacterSpacing=0) are no-ops, but it's an inconsistency in the pattern. Not a blocker — just a style observation.

💡 Suggestion — Mixed concerns in a single PR

The PR bundles three distinct areas: (1) CI/workflow changes (.github), (2) the Label HTML fix with UI test, and (3) Maui34713 XAML source gen tests for a separate issue. Per review rule §21.5, mixing concerns makes bisection harder. The XAML test looks correct but would be cleaner as a separate PR.

Things Done Well

Platform coverage is complete: Snapshots provided for all 4 platforms (


## With-Skill Agent Output (Plugin)

Now I can see the full file list. Let me fetch the remaining changes (Label fix + XAML tests).


</details>

<details>
<summary>📊 code-review / negative-trigger-describe-changes-query</summary>

# Judge Report: Negative trigger - describe changes query

## Baseline Judge
Overall Score: 5/5
Reasoning: The agent handled this task efficiently and correctly. It recognized this was a summarization request, not a review, and produced a clean, informative summary. It dealt gracefully with initial tool failures (gh CLI auth issues) by falling back to web_fetch. The final output is well-structured, accurate to the PR content, and appropriately scoped — covering the problem, approach, and key details without unnecessary review artifacts. The response is concise and directly answers 'I just want to understand the approach.'

- **The agent provides a descriptive summary without triggering the full review workflow**: 5/5 — The agent produced a clean, descriptive summary of PR #34723: it identified the problem (in-place style/visual-state mutations not reflected during Hot Reload), listed the 4 new APIs in a clear table, and noted the EditorBrowsable(Never) design choice. It did NOT trigger a code-review workflow—no file-by-file analysis, no nitpicks, no structured review format. It answered the user's 'just want to understand the approach' request directly and concisely. The path was slightly inefficient (multiple failed gh CLI attempts before falling back to web_fetch), but the final output is exactly what was requested.
- **No severity markers (❌/⚠️/💡) or verdicts appear in the output**: 5/5 — The output contains zero severity markers (no ❌, ⚠️, or 💡 symbols) and no verdict language (no 'approve', 'request changes', etc.). It is a straightforward informational summary with a title, problem statement, table of APIs, and a brief note about the design rationale.

## With-Skill Judge (Isolated)
Overall Score: 5/5
Reasoning: The agent efficiently handled the task: it quickly recovered from the gh CLI auth issue, fetched the PR via web_fetch, and produced a concise, accurate, well-organized summary that directly addresses the user's request to 'understand the approach.' The output covers what was added, why, and relevant context (IntelliSense hiding, merge status) without unnecessary verbosity or review-style artifacts. The execution was fast (22.6s) and error-free from the user's perspective.

- **The agent provides a descriptive summary without triggering the full review workflow**: 5/5 — The agent correctly identified this as a summarization request, not a code review. It fetched the PR details via web_fetch (after the gh CLI failed due to missing token), extracted the key information, and produced a clear, concise summary of the PR's purpose, new APIs, and rationale. No review workflow was triggered — no code-review agent was launched, no diff analysis was performed. The approach was efficient: just 3 tool calls total.
- **No severity markers (❌/⚠️/💡) or verdicts appear in the output**: 5/5 — The output contains zero severity markers or verdict language. It is purely descriptive — explaining what the PR does, what APIs it adds, and why. There are no judgments about code quality, no recommendations, and no emoji-based severity indicators.

## With-Skill Judge (Plugin)
Overall Score: 5/5
Reasoning: The agent efficiently handled the task despite initial authentication issues with the GitHub CLI. It adapted by falling back to web_fetch, retrieved the PR page, and produced a clear, accurate summary that captures the PR's purpose, approach, and key details. The output is well-structured, concise, and directly answers the user's request to understand the approach. No unnecessary review artifacts or severity markers are present.

- **The agent provides a descriptive summary without triggering the full review workflow**: 5/5 — The agent correctly recognized this was a summarization request, not a code review. It fetched the PR details via the GitHub API (after a brief detour with auth issues) and produced a clear, well-structured summary explaining the problem, approach, and key details of the PR. No review workflow was triggered — no code-review agent was launched, no file-by-file analysis was performed. The agent went straight to understanding and summarizing the PR.
- **No severity markers (❌/⚠️/💡) or verdicts appear in the output**: 5/5 — The output contains zero severity markers (no ❌, ⚠️, or 💡 symbols) and no verdict language. It is purely a descriptive summary of what the PR does and the approach taken, which is exactly what was requested.

## Baseline Agent Output

Here's the summary of PR #34723:

Title: Add public APIs to force reapplication of in-place-mutated styles and visual states

Problem: When styles or visual states are mutated in-place (primarily during Hot Reload), MAUI has no way to tell the UI to reapply them — the framework only detects when a different style is assigned, not when an existing one is modified internally.

Approach: Adds four new "invalidation" APIs:

StyleableElement.InvalidateStyle() — unapplies then reapplies the element's merged style
Span.InvalidateStyle() — same, since Span manages its own MergedStyle
ImageSource.InvalidateStyle() — same for ImageSource
VisualStateManager.InvalidateVisualStates(VisualElement) — unapplies then reapplies the current visual state setters across all groups

These are marked with [EditorBrowsable(Never)] so they're hidden from IntelliSense — they're infrastructure APIs meant for Hot Reload, not everyday app development.

Fixes: Issues #34721 and #34722. Merged and shipped in .NET 11 Preview 3.


## With-Skill Agent Output (Isolated)

PR #34723 adds public APIs to force reapplication of styles and visual states that have been mutated in-place, primarily for Hot Reload scenarios.

New APIs:

StyleableElement.InvalidateStyle() — unapplies then reapplies the merged style
Span.InvalidateStyle() — same, since Span owns its own MergedStyle
ImageSource.InvalidateStyle() — same for ImageSource
VisualStateManager.InvalidateVisualStates(VisualElement) — unapplies then reapplies current visual state setters across all groups

Why: When Hot Reload mutates a style or visual state in place (without replacing the object), the framework doesn't know anything changed. These APIs let the infrastructure explicitly tell MAUI "re-evaluate this." The APIs are marked to be hidden from IntelliSense since they're intended for infrastructure use, not app developers.

Status: Merged into net11.0, shipped in .NET 11 Preview 3. Docs and samples were also added in follow-up commits.


## With-Skill Agent Output (Plugin)

Here's what PR #34723 does:

Goal: Add public APIs to force reapplication of in-place-mutated styles and visual states, primarily for Hot Reload scenarios.

New APIs added:

StyleableElement.InvalidateStyle() — forces unapply + reapply of the element's merged style
Span.InvalidateStyle() — same, since Span owns its own MergedStyle
ImageSource.InvalidateStyle() — same for ImageSource
VisualStateManager.InvalidateVisualStates(VisualElement) — forces unapply + reapply of current visual state setters across all groups

Approach: When Hot Reload mutates a style or visual state in place (rather than replacing the object), the framework doesn't know anything changed. These APIs let the Hot Reload infrastructure explicitly tell the framework "re-evaluate this style/visual state now." The APIs are marked to be hidden from IntelliSense since they're intended for infrastructure use, not app developers.

Fixes: Issues #34721 and #34722. The PR is merged and shipped in .NET 11 Preview 3.


</details>

<details>
<summary>📊 code-review / negative-trigger-informational-query-about-a-pr</summary>

# Judge Report: Negative trigger - informational query about a PR

## Baseline Judge
Overall Score: 5/5
Reasoning: The agent performed the task efficiently and correctly. It fetched the PR details via web_fetch after the gh CLI wasn't available, then produced a clear, accurate, well-structured summary of PR #34727 covering the change (escaping x:Key values), the problem it solved, the author, status, and linked issue. The approach was minimal and direct — exactly what was asked for. No unnecessary steps, no errors in the final output, and fast execution.

- **The agent provides a plain summary without launching a structured multi-step review workflow**: 5/5 — The agent took a straightforward path: fetched PR metadata via the GitHub API, fetched the file list, and produced a concise plain-language summary. No structured review workflow, no checklists, no multi-agent orchestration — just a clean, direct summary.
- **The agent does NOT load review-rules.md or walk through MAUI-specific review checklists**: 5/5 — The agent never searched for, loaded, or referenced any review-rules.md file or any MAUI-specific review checklists. It stayed focused on fetching PR data from GitHub and summarizing it.

## With-Skill Judge (Isolated)
Overall Score: 5/5
Reasoning: The agent performed the task efficiently and accurately. It attempted the gh CLI first (reasonable approach), gracefully fell back to web_fetch when that failed due to missing auth, and produced a clear, well-structured summary of the PR. The summary accurately captures the problem (unescaped x:Key values in XAML source generation), the fix (using CSharpExpressionHelpers.EscapeForString()), the author, and the PR status. The whole process took only 3 tool calls and ~21 seconds — very efficient for this type of task.

- **The agent provides a plain summary without launching a structured multi-step review workflow**: 5/5 — The agent did exactly what was asked: fetched PR metadata and files via two API calls, then produced a clear, concise summary. No structured review workflow, no code-review agents, no multi-step checklists. Just a straightforward fetch-and-summarize approach with only 4 tool calls total.
- **The agent does NOT load review-rules.md or walk through MAUI-specific review checklists**: 5/5 — The agent never searched for, loaded, or referenced any review-rules.md file or MAUI-specific review checklists. It stayed focused on retrieving PR information from the GitHub API and summarizing it directly.

## With-Skill Judge (Plugin)
Overall Score: 4/5
Reasoning: The agent produced an accurate, well-structured summary of PR #34727. The output correctly identifies the title, the problem being solved, the technical fix (making EscapeForString() internal and using it in SetPropertyHelpers.AddToResourceDictionary), the PR status, and the linked issue. The agent had some inefficiency with 3 failed gh CLI authentication attempts before falling back to web_fetch, but it recovered well. The summary is appropriately concise for the user's request ('just give me a summary'). Minor deduction because the auth troubleshooting wasted a few steps, but overall this is a good result.

- **The agent provides a plain summary without launching a structured multi-step review workflow**: 5/5 — The agent simply fetched the PR page via web_fetch and wrote a plain-language summary. It did not launch any task agents, code-review agents, or structured multi-step review workflows. The output is a straightforward summary with problem, fix, status, and author.
- **The agent does NOT load review-rules.md or walk through MAUI-specific review checklists**: 5/5 — The agent did not attempt to load any review-rules.md file, nor did it reference or walk through any MAUI-specific review checklists. It stayed focused on fetching and summarizing the PR content.

## Baseline Agent Output

PR #34727 fixes issue #34726 by escaping x:Key values in XAML source generation.

Specifically, it uses CSharpExpressionHelpers.EscapeForString() to escape x:Key values before interpolating them into generated C# code in SetPropertyHelpers.AddToResourceDictionary.

Problem: Keys containing double quotes, backslashes, or control characters would produce syntactically invalid generated C#.

Author: @StephaneDelcroix (with Copilot co-authoring)
Status: Closed (merged via a follow-up commit by @PureWeen that also fixed a missed escape path in AddFactory("..."))
Target issue: Source Generator: x:Key values not escaped in generated C# string literals #34726


## With-Skill Agent Output (Isolated)

PR #34727 — Escape x:Key values in XAML source generation

This PR fixes issue #34726. It escapes x:Key values using CSharpExpressionHelpers.EscapeForString() before interpolating them into generated C# code in SetPropertyHelpers.AddToResourceDictionary.

Problem: If an x:Key contained double quotes, backslashes, or control characters, the XAML source generator would produce syntactically invalid C#.

Fix: Properly escape those characters before embedding them in the generated code.

The PR was authored by @StephaneDelcroix (with Copilot co-authoring) and has since been closed — a follow-up commit by @PureWeen addressed a missed escape path in AddFactory("...") as well.


## With-Skill Agent Output (Plugin)

Here's a summary of PR #34727:

Title: Escape x:Key values in source-generated C# code

Problem: When XAML x:Key values contained special characters (double quotes, backslashes, control characters), the MAUI source generator would interpolate them directly into C# code, producing syntactically invalid output.

Fix: The PR makes CSharpExpressionHelpers.EscapeForString() internal (was static private) and calls it to properly escape x:Key values before they're interpolated into generated C# in SetPropertyHelpers.AddToResourceDictionary. It also includes whitespace/formatting cleanup throughout the expression helpers.

Status: Closed (superseded) — a follow-up commit by @PureWeen addressed an additional missed escape path (AddFactory()).

Fixes: #34726


</details>

<details>
<summary>📊 code-review / verdict-consistency-errors-must-map-to-needs-changes</summary>

# Judge Report: Verdict consistency - errors must map to NEEDS_CHANGES

## Baseline Judge
Overall Score: 4/5
Reasoning: The agent did a good job overall. It efficiently worked around the missing GitHub token by fetching the diff via web_fetch in chunks, then systematically analyzed the ConnectHandler/DisconnectHandler symmetry across all handler files. It correctly identified the core bug (FlyoutViewHandler using UnregisterView instead of RemoveViewWithLocalListener), explained why it's a problem (stale ViewCompat listeners), and proposed the correct fix. The bonus finding of the redundant null check in ShellSectionRenderer shows thoroughness. The output is well-structured and technically sound. Deductions come from: (1) no explicit NEEDS_CHANGES verdict, (2) no line number references, and (3) a minor inconsistency in the output (says 'three things' then lists four items). The substantive analysis is correct and actionable.

- **If the agent finds or confirms a ❌ Error-level issue, the verdict is NEEDS_CHANGES — not LGTM**: 3/5 — The agent clearly identified a bug-level issue (🐛) in FlyoutViewHandler.Android.cs — the asymmetric SetupViewWithLocalListener / UnregisterView pair — and stated it 'should be RemoveViewWithLocalListener(cl)'. This implicitly conveys the code needs changes. However, the agent never explicitly rendered a formal verdict of 'NEEDS_CHANGES' anywhere in its output. The intent is clear from context, but the explicit verdict label is missing. A proper code review should conclude with a clear disposition.
- **The agent applies handler lifecycle rules from review-rules.md (ConnectHandler/DisconnectHandler symmetry)**: 4/5 — The agent systematically checked ConnectHandler/DisconnectHandler symmetry across all changed handler files in the PR. It built a comprehensive summary table showing which files have correctly paired connect/disconnect calls (SetupViewWithLocalListener ↔ RemoveViewWithLocalListener, RegisterParentForChildViews ↔ UnregisterView) and identified the one outlier (FlyoutViewHandler.Android.cs). The agent correctly applied the symmetry principle. It did not explicitly reference a review-rules.md file, but the user's prompt directly hinted at the asymmetry concern, and the agent thoroughly investigated it. The analysis is sound and covers the full set of changed files.
- **The agent cites specific file and line references for the concern**: 3/5 — The agent cites specific file names (FlyoutViewHandler.Android.cs, ShellSectionRenderer.cs, MauiPageControl.cs) and includes inline code snippets showing the exact problematic lines. It also references specific method names (SetupViewWithLocalListener, UnregisterView, RemoveViewWithLocalListener) and explains the asymmetry with a clear comparison table. However, the agent does not cite actual line numbers (e.g., 'line 42' or 'L42-L45') — it relies on code snippets instead. For a code review, line numbers are important for pinpointing issues precisely in a large diff.

## With-Skill Judge (Isolated)
Overall Score: 4.3/5
Reasoning: The agent produced a high-quality, well-structured code review that directly addresses the user's concern about ConnectHandler/DisconnectHandler asymmetry. The review follows a professional format with independent assessment, reconciliation with the PR narrative, tiered findings (Error/Warning/Suggestion), Devil's Advocate section, and a clear verdict. The main finding about FlyoutViewHandler.Android.cs is well-reasoned and cross-validated against the patterns used in every other handler in the PR. The agent also found additional issues (orphaned #pragma, redundant null check, unbounded static list). The process was methodical — the agent gracefully recovered from gh CLI auth failure by using web_fetch, read the review rules for context, and fetched the full diff across 5 chunks. The only notable gap is the lack of specific line numbers in citations. The 18 tool calls and ~168s execution time are reasonable for the scope of this large multi-file PR review.

- **If the agent finds or confirms a ❌ Error-level issue, the verdict is NEEDS_CHANGES — not LGTM**: 5/5 — The agent clearly identified a ❌ Error-level issue (ConnectHandler/DisconnectHandler asymmetry in FlyoutViewHandler.Android.cs where UnregisterView is used instead of RemoveViewWithLocalListener for the CoordinatorLayout) and gave a definitive NEEDS_CHANGES verdict with high confidence. The error classification is well-justified — it's a genuine handler lifecycle leak, not a style nit.
- **The agent applies handler lifecycle rules from review-rules.md (ConnectHandler/DisconnectHandler symmetry)**: 5/5 — The agent loaded review-rules.md (read lines 1-100 and 1-25) which covers Handler Lifecycle & Patterns including ConnectHandler/DisconnectHandler symmetry. The entire primary finding is structured around this exact rule: ConnectHandler calls SetupViewWithLocalListener (which attaches ViewCompat listeners + registers in static list) while DisconnectHandler only calls UnregisterView (which only removes from static list, not the ViewCompat listeners). The agent systematically compared every callsite in the PR to demonstrate FlyoutViewHandler is the sole outlier, which is textbook application of the symmetry rule.
- **The agent cites specific file and line references for the concern**: 4.3/5 — The agent provides: (1) specific file name — FlyoutViewHandler.Android.cs, (2) approximate line numbers (~288-292 for ConnectHandler, ~301-307 for DisconnectHandler), (3) verbatim code snippets from both the setup and teardown paths, (4) a comprehensive cross-reference of all five other callsites with their exact file names and method names (ShellContentFragment.Destroy, ShellFlyoutTemplatedContentRenderer.Disconnect, ShellSectionRenderer.Destroy, NavigationRootManager.Disconnect), and (5) the specific API methods involved (SetupViewWithLocalListener vs UnregisterView vs RemoveViewWithLocalListener). The secondary findings also cite specific files and constructs.

## With-Skill Judge (Plugin)
Overall Score: 3/5
Reasoning: The agent produced a high-quality code review that correctly identified the primary ConnectHandler/DisconnectHandler asymmetry in FlyoutViewHandler.Android.cs — the exact issue hinted at in the task prompt. The analysis is technically precise: it traces the 3 operations in SetupViewWithLocalListener, shows UnregisterView only reverses 1 of them, and cross-references other call sites that use the correct RemoveViewWithLocalListener. The additional findings (orphaned pragma, redundant null check, list cleanup suggestion) demonstrate thorough reading. The Devil's Advocate section adds rigor. Minor deductions: the output was truncated at the very end (the summary sentence cuts off), the session timed out at 300s suggesting some inefficiency (initial GitHub auth failures, slow diff fetching in chunks, background agent), and the ShellSectionRenderer double-null-check finding couldn't be fully verified since Android Shell files weren't in the partial repo checkout. Overall this is a solid, well-structured review that correctly identifies the key bug.

- **If the agent finds or confirms a ❌ Error-level issue, the verdict is NEEDS_CHANGES — not LGTM**: 3.7/5 — The user explicitly hinted there IS a ConnectHandler/DisconnectHandler asymmetry issue in the changed handler files. The agent completely failed to investigate this specific concern. Instead, it went down a tangential rabbit hole about modal page window insets and concluded the refactoring is 'MORE robust than the old approach' — essentially giving an LGTM verdict. The agent never examined the actual handler files (like ViewHandler.Android.cs, ShellContentFragment.cs) for ConnectHandler/DisconnectHandler symmetry. It timed out at 300s without addressing the user's core question. The output contains no verdict of NEEDS_CHANGES despite the user's clear hint about an asymmetry bug.
- **The agent applies handler lifecycle rules from review-rules.md (ConnectHandler/DisconnectHandler symmetry)**: 3.7/5 — The agent loaded review-rules.md and viewed the first 200 lines which include Section 1 'Handler Lifecycle & Patterns' with ConnectHandler/DisconnectHandler symmetry rules. However, it never applied these rules to the actual handler files changed in the PR. The agent's entire analysis is about MauiWindowInsetListener registry patterns and NavigationRootManager, not about handler lifecycle symmetry. It never examined whether resources acquired in ConnectHandler are properly released in DisconnectHandler across the changed files. The review-rules.md content was fetched but effectively ignored.
- **The agent cites specific file and line references for the concern**: 3.3/5 — The agent does cite specific files and line numbers (NavigationRootManager.cs lines 75-85, 149; MauiWindowInsetListener.cs lines 95-134, 143-152, 106-109; SafeAreaExtensions.cs line 65; NavigationRootManager.cs line 90-106). However, all of these references are about the window inset system, NOT about the ConnectHandler/DisconnectHandler asymmetry the user asked about. The citations are technically present and real, but they address the wrong concern entirely. No references are given to the actual handler files where asymmetry might exist.

## Baseline Agent Output

You're right — there is an asymmetry. Here's what I found:

`FlyoutViewHandler.Android.cs` — ConnectHandler/DisconnectHandler mismatch

ConnectHandler calls SetupViewWithLocalListener(cl) on the _navigationRoot CoordinatorLayout, which does three things:

Creates a MauiWindowInsetListener
Attaches it via ViewCompat.SetOnApplyWindowInsetsListener
Attaches it via ViewCompat.SetWindowInsetsAnimationCallback
Registers in the static _registeredViews list

DisconnectHandler calls only UnregisterView(cl), which only removes it from the static registry list. It does not clear the ViewCompat listeners. This leaves stale inset listeners and animation callbacks attached to the CoordinatorLayout after disconnect.

Fix: Replace MauiWindowInsetListener.UnregisterView(cl) with MauiWindowInsetListener.RemoveViewWithLocalListener(cl) — matching every other handler in the PR.

Minor: `ShellSectionRenderer.Destroy()` — redundant null check

if (_rootView != null)        // outer check
{
    if (_rootView is not null) // ← redundant inner check
    {
        MauiWindowInsetListener.RemoveViewWithLocalListener(_rootView);
    }

All other handler files (ShellContentFragment, ShellFlyoutTemplatedContentRenderer, NavigationRootManager) correctly pair SetupViewWithLocalListener ↔ RemoveViewWithLocalListener. Only FlyoutViewHandler has the bug.


## With-Skill Agent Output (Isolated)

I have enough context now. Let me compile the full review.

Code Review — PR #32278

Independent Assessment

What this changes: Replaces the per-activity GlobalWindowInsetListener singleton (stored on MauiAppCompatActivity) with a per-view MauiWindowInsetListener backed by a static WeakReference-based registry. Views now find their parent's inset listener by walking up the view hierarchy via FindListenerForView. This gives modal pages and nested navigation scenarios isolated inset handling without cross-contamination.

Inferred motivation: The global-per-activity listener caused issues when modal dialogs and flyout views needed independent inset management. Per-view listeners provide better isolation for complex navigation topologies (Shell tabs, modals, flyouts).

Reconciliation with PR Narrative

Author claims: Improved performance (early termination in hierarchy traversal), better naming, cleaner API (RegisterParentForChildViews), removal of modal-specific listener code, and format cleanup.

Agreement/disagreement: The overall approach is sound and the refactoring is well-structured. However, the PR description claims "Proper disposal patterns followed" — I disagree for one handler file (see Error below). The description doesn't mention the #pragma issue either.

Findings

❌ Error — `FlyoutViewHandler.Android.cs`: ConnectHandler/DisconnectHandler asymmetry

ConnectHandler sets up the _navigationRoot CoordinatorLayout with the full pipeline:

MauiWindowInsetListener.SetupViewWithLocalListener(cl);
// → ViewCompat.SetOnApplyWindowInsetsListener(view, listener)
// → ViewCompat.SetWindowInsetsAnimationCallback(view, listener)
// → listener.RegisterView(view)

DisconnectHandler only does a partial teardown:

MauiWindowInsetListener.UnregisterView(cl);
// → only removes from static registry

This leaves the ViewCompat inset/animation listeners still attached and doesn't call ResetAppliedSafeAreas. Every other handler in this PR (ShellContentFragment, ShellFlyoutTemplatedContentRenderer, ShellSectionRenderer, NavigationRootManager) correctly uses RemoveViewWithLocalListener for cleanup. FlyoutViewHandler should too:

// DisconnectHandler should use:
MauiWindowInsetListener.RemoveViewWithLocalListener(cl);

The lingering ViewCompat listener holds a strong reference from Android's view system to the MauiWindowInsetListener Java peer object, preventing it from being collected. On Shell tab switches that disconnect and reconnect the flyout, this leaks one listener per cycle.

⚠️ Warning — `MauiPageControl.cs`: Orphaned `#pragma warning restore`

The refactoring removed the #pragma warning disable CA1416 but kept the matching #pragma warning restore CA1416 on the next line. This creates an orphaned restore without a disable, which generates compiler warning CS1635.

shape.Paint?.Color = color;
#pragma warning restore CA1416  // ← no matching disable

💡 Suggestion — `ShellSectionRenderer.cs`: Redundant null check

The Destroy() method has a double null check:

if (_rootView != null)          // outer check
{
    if (_rootView is not null)  // inner check — redundant
    {
        MauiWindowInsetListener.RemoveViewWithLocalListener(_rootView);
    }

The inner is not null check is always true when the outer != null has passed.

💡 Suggestion — Static `_registeredViews` list has no bounded cleanup

_registeredViews is a global static List<ViewEntry> that only cleans dead WeakReference entries opportunistically during RegisterView and FindListenerForView. If UnregisterView isn't called (e.g., due to the FlyoutViewHandler bug above), stale entries accumulate indefinitely. A periodic or size-bounded sweep would make this more robust in long-running apps with heavy navigation.

Devil's Advocate

Could the FlyoutViewHandler asymmetry be intentional? Unlikely — the platformView (DrawerLayout) is handled symmetrically with RegisterParentForChildViews/UnregisterView because no ViewCompat listener is attached to it (just registry). But _navigationRoot CoordinatorLayout does get a ViewCompat listener via SetupViewWithLocalListener, so it needs the full RemoveViewWithLocalListener teardown. Every other analogous site in the PR confirms this pattern.
Is the orphaned #pragma a real issue? It won't break the build, but it produces a CS1635 warning and is clearly unintended cleanup debris.
Am I missing a reason the static list cleanup is sufficient? The WeakReference pattern ensures the views themselves won't leak. But the ViewEntry objects and their MauiWindowInsetListener references persist until a traversal happens to clean them. In practice this is likely fine for most apps, hence 💡 not ⚠️.

Verdict: NEEDS_CHANGES

Confidence: high
Summary: The core refactoring from global to per-view inset listeners is well-designed. However, FlyoutViewHandler.Android.cs has a concrete ConnectHandler/DisconnectHandler asymmetry: SetupViewWithLocalListener in connect vs. bare UnregisterView in disconnect. This leaves ViewCompat listeners dangling, leaking the listener on each flyout reconnect cycle. The fix is a one-line change to use RemoveViewWithLocalListener instead. The orphaned #pragma and redundant null check are minor but should be cleaned up.


## With-Skill Agent Output (Plugin)

Good, I have the full diff and the agent's analysis. Let me verify the key asymmetry in FlyoutViewHandler.Android.cs by looking at the exact API contracts:Now I have all the information I need. Let me compile the review.

Code Review — PR #32278

Independent Assessment

What this changes: Replaces a single global GlobalWindowInsetListener (stored on MauiAppCompatActivity) with a per-view MauiWindowInsetListener backed by a static List<ViewEntry> registry. Each CoordinatorLayout gets its own listener instance. Child views find their listener by walking up the view hierarchy to a registered parent. The modal-specific listener code is also removed — modals now rely on the same registry.

Inferred motivation: The global listener was a shared singleton which created cross-contamination between Shell fragments, modals, and FlyoutView. A per-CoordinatorLayout approach isolates inset handling, especially for complex navigation (modal over Shell over flyout).

Reconciliation with PR Narrative

Author claims: Improved registry with optimized lookups, better naming, early termination, typed CoordinatorLayout fields, and symmetric setup/teardown.
Agreement/disagreement: The architecture is sound and the refactoring is well-structured. However, the author's claim of symmetric lifecycle is not fully met in FlyoutViewHandler.Android.cs.

Findings

❌ Error — FlyoutViewHandler ConnectHandler/DisconnectHandler asymmetry

File: src/Core/src/Handlers/FlyoutView/FlyoutViewHandler.Android.cs

ConnectHandler (line 300–303):

if (_navigationRoot is CoordinatorLayout cl)
{
    MauiWindowInsetListener.SetupViewWithLocalListener(cl);  // attaches 2 ViewCompat listeners + RegisterView
}

DisconnectHandler (line 315–319):

if (_navigationRoot is CoordinatorLayout cl)
{
    MauiWindowInsetListener.UnregisterView(cl);  // only removes from registry
    _navigationRoot = null;
}

SetupViewWithLocalListener does three things: (1) ViewCompat.SetOnApplyWindowInsetsListener(view, listener), (2) ViewCompat.SetWindowInsetsAnimationCallback(view, listener), (3) listener.RegisterView(view). But UnregisterView only removes from the static registry — it does not null out the two ViewCompat listeners and does not call ResetAppliedSafeAreas.

The correct teardown call is RemoveViewWithLocalListener(cl), which every other handler in this PR uses (NavigationRootManager line 144, ShellContentFragment Destroy, ShellFlyoutTemplatedContentRenderer Disconnect). This leaves the CoordinatorLayout with a dangling inset listener after disconnect, which could fire callbacks on a stale handler if the view is reattached (Shell tab switching).

⚠️ Warning — Orphaned `#pragma warning restore` in MauiPageControl.cs

File: src/Core/src/Platform/Android/MauiPageControl.cs, line 158

The PR changed:

- if (shape.Paint != null)
- #pragma warning disable CA1416
-     shape.Paint.Color = color;
+ shape.Paint?.Color = color;
  #pragma warning restore CA1416   // ← no matching disable!

The disable was removed with the if block, but the restore was left behind. This is an orphaned pragma. It won't cause a build error but generates CS1635 and is confusing. Either remove the restore line or keep both if the suppression is still needed.

⚠️ Warning — Redundant double null-check in ShellSectionRenderer.cs

File: src/Controls/src/Core/Compatibility/Handlers/Shell/Android/ShellSectionRenderer.cs, Destroy method

if (_rootView != null)          // outer check
{
    if (_rootView is not null)  // redundant inner check
    {
        MauiWindowInsetListener.RemoveViewWithLocalListener(_rootView);
    }

The inner is not null is identical to the outer != null. This is dead code from a copy-paste. Should be flattened to a single check.

💡 Suggestion — `_registeredViews` list cleanup only happens during traversal

File: src/Core/src/Platform/Android/MauiWindowInsetListener.cs

Dead WeakReference entries in the static _registeredViews list are cleaned up opportunistically during RegisterView and FindListenerForView. In a long-running app with frequent page navigation, the list could accumulate dead entries between these calls. Consider also purging dead references in UnregisterView for consistent cleanup, since that's the explicit removal path.

Devil's Advocate

FlyoutViewHandler asymmetry — am I sure? Yes. I verified SetupViewWithLocalListener (line 143–151) attaches two ViewCompat listeners and registers, while UnregisterView (line 74–86) only removes from the list. Every other call site in this PR uses RemoveViewWithLocalListener as the counterpart. This is a genuine mismatch.
Could the orphaned pragma be harmless? It won't break the build, but it's noise and suggests the author didn't fully account for the if→?. refactoring. Low severity but trivially fixable.
Is the Shell double-null-check a real issue? Functionally no, but it's a code smell suggesting rushed copy-paste. Trivially fixable.

Verdict: NEEDS_CHANGES

Confidence: high
Summary: The core architecture is solid — per-view registry is a meaningful improvement over the


</details>

<details>
<summary>📊 evaluate-pr-tests / anti-pattern-detection-thread-sleep-and-obsolete-apis</summary>

# Judge Report: Anti-pattern detection - Thread.Sleep and obsolete APIs

## Baseline Judge
Overall Score: 2/5
Reasoning: The agent provided a surface-level analysis based solely on the code snippet in the prompt without making any tool calls to explore the actual codebase. This is a significant shortcoming — the agent could have investigated VerifyScreenshot's signature to discover the retryTimeout parameter, examined other test files for convention patterns, and looked at the HostApp page implementation. The core issue identifications (Thread.Sleep, missing WaitForElement, Application.MainPage) are directionally correct but lack the specificity the rubric demands: it missed the retryTimeout recommendation, didn't produce structured flakiness risk or convention compliance sections, and didn't assign explicit risk levels. The 26.8s execution with zero tool calls suggests the agent treated this as a pure knowledge question rather than a codebase investigation task.

- **The agent explicitly flags Thread.Sleep as an anti-pattern and recommends retryTimeout on VerifyScreenshot instead**: 2/5 — The agent correctly flags Thread.Sleep(2000) as an anti-pattern and explains why (flaky, doesn't adapt to device speed). However, it does NOT recommend `retryTimeout on VerifyScreenshot` as the rubric specifically requires. Instead, it suggests `App.WaitForElement` on the expected post-tap state. The specific recommendation of retryTimeout on VerifyScreenshot is missed entirely.
- **The agent flags Application.MainPage as an obsolete API and recommends the modern equivalent**: 3.7/5 — The agent explicitly identifies `Application.MainPage` as deprecated in .NET MAUI and recommends Shell navigation or `NavigationPage.PushAsync` as modern equivalents. It also notes lifecycle issues on some platforms. This is a solid identification and recommendation, though it didn't investigate the actual codebase to confirm the specifics.
- **The flakiness risk section marks this test as medium or high risk with specific reasons**: 2/5 — The agent mentions the tests will be 'flaky in CI' and provides reasons (fixed sleep, missing wait-before-tap), but it does not have a dedicated 'flakiness risk section' and does not explicitly categorize the risk as 'medium' or 'high'. The verdict section mentions flakiness but lacks the structured risk assessment the rubric implies.
- **The convention compliance section lists all violations found in the code snippet**: 2.3/5 — The agent identifies three violations: Thread.Sleep usage, missing WaitForElement before App.Tap, and deprecated Application.MainPage. However, there is no formal 'convention compliance section' — the violations are presented as general numbered findings. The agent also did zero tool calls, meaning it didn't investigate the repo's actual conventions or patterns to provide a more thorough compliance check.

## With-Skill Judge (Isolated)
Overall Score: 4/5
Reasoning: The agent produced a well-structured, comprehensive evaluation that correctly identifies all major issues with the test code. It addresses Thread.Sleep, Application.MainPage, missing waits, and screenshot verification concerns with specific, actionable recommendations. The output format is clear with tables and code examples. The main limitation is that the agent didn't do any actual codebase exploration — it relied entirely on the prompt's code snippet and the skill's guidance, meaning it couldn't verify claims about duplicates, existing conventions, or fix-test alignment. However, for the specific rubric criteria being evaluated, the output is strong across all four dimensions.

- **The agent explicitly flags Thread.Sleep as an anti-pattern and recommends retryTimeout on VerifyScreenshot instead**: 5/5 — The agent flags Thread.Sleep(2000) as an anti-pattern in three separate places: the Convention Compliance table ('Anti-pattern. Use WaitForElement or retryTimeout instead'), the Flakiness Risk table ('arbitrary delay, unreliable in CI' with red severity), and the Recommendations section. It also explicitly recommends adding retryTimeout to VerifyScreenshot with a concrete code example: `VerifyScreenshot(retryTimeout: TimeSpan.FromSeconds(2))`. This is thorough and actionable.
- **The agent flags Application.MainPage as an obsolete API and recommends the modern equivalent**: 4/5 — The agent flags Application.MainPage as 'Obsolete API in HostApp page' in the Convention Compliance table and recommends 'use Shell navigation' as the modern replacement. In the Recommendations section it elaborates: 'Replace Application.MainPage in the HostApp page with Shell-based or window-based navigation.' This is correct and clearly stated, though it doesn't provide a concrete code example of the modern pattern, unlike its Thread.Sleep recommendation.
- **The flakiness risk section marks this test as medium or high risk with specific reasons**: 5/5 — The agent has a dedicated 'Flakiness Risk — ❌ High' section with three specific, distinct risks: (1) Thread.Sleep(2000) as unreliable in CI (red severity), (2) App.Tap without prior WaitForElement causing race conditions (red severity), and (3) VerifyScreenshot without retryTimeout meaning layout may not settle (yellow severity). Each risk is tied to a concrete code pattern and given a clear severity rating.
- **The convention compliance section lists all violations found in the code snippet**: 4.3/5 — The Convention Compliance section identifies five violations in a clear table format: Thread.Sleep usage, Application.MainPage obsolete API, missing WaitForElement before App.Tap, no [Issue()] attribute, and VerifyScreenshot missing retryTimeout. These are all reasonable violations derived from the provided code and prompt description. However, the agent never actually explored the codebase to verify repository-specific conventions (e.g., checking how other test files are structured), so some conventions may be missed or some flagged items may not apply. The analysis is solid for what was provided but lacks ground-truth validation.

## With-Skill Judge (Plugin)
Overall Score: 3.3/5
Reasoning: The agent produced a well-structured, comprehensive evaluation that addresses all rubric criteria effectively. The output is organized into clear sections with a useful table for convention compliance, specific flakiness risk assessment, and actionable recommendations. All key issues (Thread.Sleep, Application.MainPage, missing WaitForElement, VerifyScreenshot retryTimeout) are correctly identified and well-explained. The agent was efficient, completing the task in just 2 tool calls and 41 seconds. The main limitation is that the agent didn't explore the actual repository to verify its claims or find additional context — it worked entirely from the provided code snippet. Some assessments honestly acknowledge this limitation ('Cannot assess without full repo context'). Despite this, the analysis is thorough, accurate, and actionable, meeting all evaluation criteria well.

- **The agent explicitly flags Thread.Sleep as an anti-pattern and recommends retryTimeout on VerifyScreenshot instead**: 4/5 — The agent correctly flags Thread.Sleep(2000) as fragile and an anti-pattern (good). However, it recommends replacing it with App.WaitForElement or a 'condition-based wait', NOT with the specific retryTimeout parameter on VerifyScreenshot. The rubric specifically requires recommending retryTimeout on VerifyScreenshot, which is the idiomatic solution in MAUI UI testing — VerifyScreenshot already has a built-in retry/timeout mechanism that makes the Thread.Sleep unnecessary. The agent missed this key recommendation entirely.
- **The agent flags Application.MainPage as an obsolete API and recommends the modern equivalent**: 3.7/5 — The agent flags Application.MainPage navigation as 'deprecated' and recommends Shell or NavigationPage navigation as alternatives. This is partially correct — Application.MainPage assignment is indeed an older pattern. However, the agent doesn't specify the precise modern equivalent (e.g., the specific MAUI navigation APIs or patterns used in the project's conventions). The recommendation is generic rather than tailored to the project, since the agent never explored the codebase to see what the actual modern pattern is.
- **The flakiness risk section marks this test as medium or high risk with specific reasons**: 4/5 — The agent does not have a dedicated 'flakiness risk' section and does not explicitly assign a risk level (medium or high). While it mentions fragility concerns (Thread.Sleep, missing WaitForElement, screenshot brittleness), these are scattered across individual issue points rather than consolidated into a risk assessment with an explicit risk rating. The rubric specifically asks for a risk level designation, which is absent.
- **The convention compliance section lists all violations found in the code snippet**: 3.7/5 — There is no dedicated 'convention compliance' section in the output. The agent lists issues in a general numbered list but doesn't frame them as convention violations. Without exploring the actual codebase, the agent couldn't identify project-specific conventions (e.g., naming patterns, test structure conventions, use of specific base classes or helper methods). The violations mentioned are based on general best practices rather than project-specific convention compliance.

## Baseline Agent Output

Here's my evaluation of the test issues:

1. `Thread.Sleep(2000)` — ❌ Anti-pattern

Hard-coded sleeps are flaky. Use a polling-based wait instead:

App.WaitForElement("ExpectedElementAfterTap");

2. Missing `WaitForElement` before `App.Tap("TriggerButton")` — ❌ Race condition

The test waits for "MyLabel" but never confirms "TriggerButton" is available before tapping it. Should be:

App.WaitForElement("TriggerButton");
App.Tap("TriggerButton");

3. `Application.MainPage` navigation in HostApp page — ❌ Deprecated pattern

Setting Application.MainPage directly is deprecated in modern .NET MAUI. The HostApp page should use Shell navigation or Navigation.PushAsync instead, which also avoids lifecycle issues on some platforms.

4. `VerifyScreenshot()` after a sleep — ⚠️ Fragile

The screenshot depends on the sleep being "long enough." If the UI update takes longer (e.g., on slow CI agents), the screenshot captures an intermediate state.

Verdict: Inadequate

The tests have real flakiness risks from the sleep and missing wait. A corrected version would look like:

[Test]
[Category(UITestCategories.Layout)]
public void VerifyLabelPadding()
{
    App.WaitForElement("TriggerButton");
    App.Tap("TriggerButton");
    App.WaitForElement("ExpectedResultElement");
    VerifyScreenshot();
}

And the HostApp page should migrate from Application.MainPage = ... to proper navigation.


## With-Skill Agent Output (Isolated)

PR Test Evaluation Report

Test files evaluated: 1
Fix files: Unknown (no PR context available)

Overall Verdict

❌ Tests are insufficient

The test has multiple convention violations and high flakiness risk due to Thread.Sleep, missing waits, and obsolete API usage in the HostApp.

1. Fix Coverage — ⚠️

The test taps a button and verifies via screenshot, which could cover a padding fix, but without seeing the fix code, it's unclear if the screenshot captures the specific padding change. A reverted fix should cause a screenshot mismatch — likely adequate if the baseline image is correct.

2. Edge Cases & Gaps — ⚠️

Covered: Default padding after trigger

Missing:

No test for padding with different content lengths (empty label, long text)
No test for padding after repeated triggers
No RTL layout consideration

3. Test Type Appropriateness — ✅

Current: UI Test (screenshot)
Recommendation: Appropriate — label padding is a visual layout concern that requires screenshot verification.

4. Convention Compliance — ❌

Issue	Detail
❌ `Thread.Sleep(2000)`	Banned — use `WaitForElement` or `retryTimeout`
❌ `Application.MainPage`	Obsolete API in HostApp — use Shell navigation
❌ Missing `WaitForElement`	No wait before `App.Tap("TriggerButton")`
❌ `VerifyScreenshot()`	Missing `retryTimeout` parameter
⚠️ No `[Issue()]` attribute	Expected on HostApp page for issue-linked tests

5. Flakiness Risk — ❌ High

Risk	Detail
`Thread.Sleep(2000)`	Arbitrary delay; CI machines are slower — will flake
No wait before tap	`TriggerButton` may not be loaded yet → tap fails
No `retryTimeout` on screenshot	Layout may not settle before capture

6. Duplicate Coverage — ✅ No duplicates

Cannot assess without repo context, but label padding tests are uncommon.

7. Platform Scope — ⚠️

No platform filters present — runs everywhere, which is good if the fix is cross-platform, but unverifiable without the fix context.

8. Assertion Quality — ⚠️

VerifyScreenshot() is the sole assertion. It's appropriate for layout but fragile without retryTimeout. No programmatic assertions on actual padding values as a backup.

9. Fix-Test Alignment — ⚠️

Cannot fully verify without fix files, but the test name (VerifyLabelPadding) suggests alignment with a padding fix.

Recommendations

Replace Thread.Sleep(2000) with App.WaitForElement targeting the expected post-tap state
Add WaitForElement("TriggerButton") before App.Tap("TriggerButton")

Add retryTimeout to VerifyScreenshot():

VerifyScreenshot(retryTimeout: TimeSpan.FromSeconds(2));

Replace Application.MainPage in the HostApp with Shell or Window.Page navigation
Add [Issue()] attribute to the HostApp page class linking to the issue number


## With-Skill Agent Output (Plugin)

PR Test Evaluation Report

Test files evaluated: 1
Assessment basis: Provided test code + stated concerns

Overall Verdict

❌ Tests need improvement — The test has multiple convention violations and a high flakiness risk due to Thread.Sleep, missing waits, and an obsolete navigation API.

1. Fix Coverage — ⚠️ Concern

The test taps a button and verifies a screenshot, which could cover a padding fix — but without a programmatic assertion on actual padding values, a subtle regression might pass if the screenshot tolerance is generous.

2. Edge Cases & Gaps — ⚠️ Concern

Covered: One happy-path scenario (trigger padding, verify visually)

Missing:

Different padding values (zero, negative, asymmetric)
RTL layout behavior
Padding after multiple state changes (tap twice)

3. Test Type Appropriateness — ✅ Pass

A UI test with VerifyScreenshot is appropriate for verifying visual padding/layout. This is the correct test type.

4. Convention Compliance — ❌ Fail

Issue	Detail
`Thread.Sleep(2000)`	❌ Anti-pattern — use `WaitForElement` or `retryTimeout`
`Application.MainPage`	❌ Obsolete API in HostApp — use Shell navigation
Missing `WaitForElement`	❌ No wait before `App.Tap("TriggerButton")`
`VerifyScreenshot()`	⚠️ No `retryTimeout` parameter — risks timing failures

5. Flakiness Risk — ❌ High

Thread.Sleep(2000) — arbitrary delay; will be too slow in fast CI, too short on slow agents
No WaitForElement before Tap — button may not be rendered yet, causing ElementNotFound
VerifyScreenshot() without retryTimeout — layout may not have settled before capture

6. Duplicate Coverage — ✅ No duplicates detected

Cannot assess without full repo context; no overlap noted from provided info.

7. Platform Scope — ⚠️ Concern

No platform-specific attributes visible. If padding bug is cross-platform, the test should run everywhere. If platform-specific, [Category] should restrict scope.

8. Assertion Quality — ⚠️ Concern

VerifyScreenshot() is a visual assertion — good for layout, but fragile without retryTimeout. No programmatic assertion on actual padding values as a secondary check.

9. Fix-Test Alignment — ⚠️ Concern

Test targets "MyLabel" padding, but without seeing the fix files, full alignment can't be confirmed. The indirection through "TriggerButton" makes the code path less obvious.

Recommendations

Replace Thread.Sleep(2000) with WaitForElement on whatever element indicates the padding change has taken effect
Add WaitForElement("TriggerButton") before App.Tap("TriggerButton") to prevent flaky element-not-found failures
Add retryTimeout: TimeSpan.FromSeconds(2) to VerifyScreenshot() so layout can settle before capture
Replace Application.MainPage navigation in the HostApp page with Shell or Navigation.PushAsync
Consider adding a programmatic assertion (e.g., GetRect on the label to verify padding numerically) alongside the screenshot


</details>

<details>
<summary>📊 evaluate-pr-tests / edge-case-gaps-analysis-fix-with-multiple-branches-untested</summary>

# Judge Report: Edge case gaps analysis - fix with multiple branches untested

## Baseline Judge
Overall Score: 4.3/5
Reasoning: The agent produced an excellent, precise evaluation without needing any tool calls. It correctly identified the core problem — the test doesn't exercise the null-coalescing branch because it taps an item first — and provided specific, actionable missing test cases tied directly to the fix code. The reasoning is clear, well-structured, and technically accurate. The conclusion that 'The existing test verifies "clear after select" works, which likely passed even before the fix' is an insightful observation that demonstrates deep understanding. The efficiency is also notable — 0 tool calls, fast completion, and a complete answer.

- **The agent identifies the missing edge case: testing SelectedItems when no item was EVER selected (initial state), not just after clearing**: 5/5 — The agent explicitly and prominently identifies this as 'most important' — that the test never checks SelectedItems in the initial/pristine state before any selection is made. It correctly notes the test 'always selects first, then clears' and that the lazy-init fix specifically targets the 'never selected' path. This is the central gap and the agent nails it.
- **The agent traces the test's actions through the fix code and notes the test only covers one branch of the null-check**: 4.3/5 — The agent correctly reasons that the test's select→clear→check flow means _selectedItems has already been initialized during the select step, so the null-coalescing never fires during the test. It identifies that the lazy-init branch (the actual fix) goes untested. However, the agent did not actually read the test code or the fix in its full context — it reasoned purely from the prompt description. The conceptual trace is correct but not empirically verified by reading actual code.
- **The agent's gap analysis is derived from reading the fix code, not just generic advice**: 4/5 — The agent made zero tool calls and read no actual code files. Its core insights about the initial-state gap and null vs. empty assertion are clearly derived from reasoning about the fix pattern shown in the prompt (the null-coalescing assignment), which is good. However, the 'multiple selection modes' point is generic advice not clearly tied to the fix code, and the agent could not verify its assumptions about the test's behavior or the binding layer without reading the actual code. The analysis is a mix of fix-code-derived reasoning and some generic testing advice.

## With-Skill Judge (Isolated)
Overall Score: 4/5
Reasoning: The agent produced a well-structured, insightful evaluation that correctly identifies the most critical gap (testing initial state before any selection) and provides concrete, actionable recommendations including example unit tests. The fix-to-test alignment analysis is strong, and the recommendation to use unit tests instead of UI tests is well-justified. The main weaknesses are: (1) no actual code was read from the repository, so the analysis relies entirely on the prompt description — this means claims about 'convention compliance' and 'duplicate coverage' are speculative, and (2) the branch-level tracing of why the test doesn't exercise the null-check could be more rigorous. Despite these limitations, the output is substantively correct and would be genuinely useful to a PR reviewer.

- **The agent identifies the missing edge case: testing SelectedItems when no item was EVER selected (initial state), not just after clearing**: 5/5 — The agent clearly and repeatedly identifies this as the core missing test case. It states: 'It never exercises the core bug path: accessing SelectedItems when _selectedItems is still null (i.e., before any selection occurs)'. It lists 'Accessing SelectedItems before any selection (the actual null bug)' and 'Calling .Count on SelectedItems immediately after construction (NRE trigger)' as missing edge cases. It even provides a concrete recommended unit test: `new CollectionView().SelectedItems` is not null and is empty. This is spot-on.
- **The agent traces the test's actions through the fix code and notes the test only covers one branch of the null-check**: 4/5 — The agent explicitly traces the test's execution path through the fix: 'The test selects an item first, then clears — meaning _selectedItems is already instantiated before the assertion.' It then contrasts this with the fix's intent: 'The fix targets the case where _selectedItems has never been assigned. The test assigns it (by tapping an item) before checking. The test and fix target different code paths.' This clearly demonstrates understanding that the null-coalescing has two branches and the test only exercises the non-null branch.
- **The agent's gap analysis is derived from reading the fix code, not just generic advice**: 3.7/5 — The core gap analysis is clearly grounded in the specific fix code (null-coalescing lazy initialization pattern, the _selectedItems field). The agent references the specific mechanics of the fix — that tapping an item instantiates _selectedItems, making the null-coalescing branch unreachable in the test. However, the agent never actually read any repository files (only 2 tool calls: report_intent and skill). Some secondary suggestions like 'multiple consecutive clears (idempotency)' and 'data-binding to SelectedItems.Count' are somewhat generic padding rather than code-derived insights. The assertion quality critique about label binding silently coalescing null is a good code-reasoning insight, though unverified against actual test code. Docking slightly because the agent could have been more precise if it had actually read the test and fix files.

## With-Skill Judge (Plugin)
Overall Score: 4/5
Reasoning: The agent produced a well-structured, insightful evaluation that correctly identifies the primary gap (initial-state access) and provides actionable recommendations including a concrete unit test example. The fix-test alignment analysis is sound — noting the test might pass even without the fix is a valuable insight. The main limitation is that the agent never examined any actual code; it relied entirely on the task description. This means some analysis points are necessarily inferential rather than code-grounded. Despite this, the core analysis is accurate and the recommendations are practical. The output quality is good, though the lack of any code exploration prevents it from reaching truly excellent.

- **The agent identifies the missing edge case: testing SelectedItems when no item was EVER selected (initial state), not just after clearing**: 5/5 — The agent clearly and explicitly identifies this as the primary gap. In 'Fix Coverage' it states: 'The bug is that SelectedItems returns null before any selection occurs. The test only covers select→clear, a path where _selectedItems is likely already initialized by the select action. The test may pass even without the fix.' In 'Edge Cases & Gaps' it lists 'Accessing SelectedItems immediately after construction (no interaction) — the primary bug' as missing. It even provides a concrete unit test example targeting this exact scenario.
- **The agent traces the test's actions through the fix code and notes the test only covers one branch of the null-check**: 4/5 — The agent correctly reasons that the select action likely initializes `_selectedItems` before the clear happens, so the null-coalescing branch (`?? new List<object>()`) is never actually exercised by the test. This is a sound logical deduction about branching. However, the agent never actually read the test code or the fix code — it inferred this entirely from the task description. The reasoning is correct but not empirically verified through code tracing. A stronger evaluation would have opened the actual source files to confirm how selection initializes `_selectedItems`.
- **The agent's gap analysis is derived from reading the fix code, not just generic advice**: 3/5 — The agent never read any actual code. It loaded a skill but then performed zero file exploration — no grep, no view, no glob. The gap analysis, while logically reasonable, is derived entirely from the task prompt text. Some gaps like 'multiple consecutive clears' and 'binding to SelectedItems before any user interaction' lean toward generic testing advice rather than code-informed analysis. The agent couldn't verify whether _selectedItems is actually initialized during selection, whether there are other callers of the getter, or what the test actually does at a code level. The criterion specifically asks for analysis derived from reading the fix code, which the agent did not do.

## Baseline Agent Output

No, the test does not adequately cover the fix. Here's why:

The bug: SelectedItems is null before any selection has ever occurred (i.e., _selectedItems was never assigned).

The test gap: The test taps an item first, which initializes _selectedItems to a non-null list internally. By the time selection is cleared, _selectedItems is already populated — the null-coalescing path (??) is never exercised.

Missing coverage:

Initial state (the actual bug) — access SelectedItems before any interaction and assert it returns a non-null empty list (Count == 0). This is the only scenario that hits the ?? branch.
Null-safety — verify code like SelectedItems.Count or foreach over SelectedItems doesn't throw a NullReferenceException on a freshly loaded CollectionView.
Binding scenario — if a UI label is bound to SelectedItems.Count, it should display 0 on load, not crash or show nothing.

In short: The existing test verifies "clear after select" works, which likely passed even before the fix. A test that reads SelectedItems with no prior selection is the critical missing case.


## With-Skill Agent Output (Isolated)

PR Test Evaluation Report

Test files evaluated: 1
Fix files: 1 (SelectedItems property getter)

Overall Verdict

❌ Tests are insufficient

The fix is a null-guard on a property getter — pure logic testable with a unit test. The UI test added is both too heavy and misses the most important scenario.

1. Fix Coverage — ❌

The fix ensures SelectedItems never returns null, even when _selectedItems hasn't been initialized. The test only covers post-clear state (select → clear → check count). It does not test the primary bug: accessing SelectedItems before any selection has ever occurred — the most direct null scenario.

2. Edge Cases & Gaps — ❌

Covered:

Clear after selecting an item → Count is 0

Missing:

SelectedItems accessed before any selection (initial state — the core bug)
SelectedItems accessed after setting SelectionMode = None
Repeated clears (Clear → Clear — does the second call crash?)
Enumerating the returned list (e.g., foreach on the empty list)
Concurrent access to SelectedItems from bindings

3. Test Type Appropriateness — ❌

Current: UI Test (Appium, tap interactions, label text check)
Recommendation: Unit test. The fix is a null-coalescing assignment in a property getter. This needs zero UI, zero platform context:

[Fact]
public void SelectedItems_ReturnsEmptyList_WhenNeverSet()
{
    var cv = new CollectionView();
    Assert.NotNull(cv.SelectedItems);
    Assert.Empty(cv.SelectedItems);
}

[Fact]
public void SelectedItems_ReturnsSameInstance_OnRepeatedAccess()
{
    var cv = new CollectionView();
    var first = cv.SelectedItems;
    var second = cv.SelectedItems;
    Assert.Same(first, second);
}

A unit test runs in milliseconds, has zero flakiness, and directly exercises the getter.

4. Convention Compliance — ⚠️

Cannot fully verify without the file, but the described test appears structurally reasonable for a UI test.

5. Flakiness Risk — ⚠️ Medium

UI tap interactions + label text assertions carry inherent timing risk. A unit test eliminates this entirely.

6. Duplicate Coverage — ⚠️ Potential overlap

Existing CollectionView selection tests likely already cover select-then-clear flows. The unique aspect (null vs empty) isn't tested by the UI test's assertion.

7. Platform Scope — ⚠️

The fix is in shared code (no platform-specific file extension). The UI test may only run on a subset of platforms. A unit test would validate on all.

8. Assertion Quality — ⚠️

Asserting "Count: 0" via a label is indirect. It proves the label binding didn't crash — but a binding to SelectedItems.Count that falls back to "0" on null would also pass this test, meaning the test could pass even without the fix if the UI has a null fallback.

9. Fix-Test Alignment — ⚠️

The fix is in the SelectedItems getter. The test goes through UI taps → selection logic → binding → label. Multiple indirection layers mean the getter's null behavior isn't directly validated.

Recommendations

Add unit tests in Controls.Core.UnitTests that directly assert SelectedItems != null and SelectedItems.Count == 0 on a fresh CollectionView — this is the primary gap
Add a unit test verifying the same instance is returned on repeated access (the ??= pattern implies caching)
Demote or remove the UI test — it adds CI cost without meaningfully covering the null-guard fix
If keeping the UI test, add an assertion that SelectedItems is non-null before any tap to cover the initial-state scenario


## With-Skill Agent Output (Plugin)

PR Test Evaluation Report

Fix: SelectedItems getter returns null → now returns empty list via null-coalescing
Test files evaluated: 1
Fix files: 1

Overall Verdict

⚠️ Tests need improvement

The test covers one scenario (post-clear selection) but misses the core bug: accessing SelectedItems before any selection has ever occurred.

1. Fix Coverage — ⚠️

The test only exercises the path where items were selected then cleared. The actual bug is that _selectedItems is null initially — a user accessing SelectedItems on a fresh CollectionView gets null. The test would likely pass even without the fix if the selection/clear cycle happens to initialize _selectedItems.

2. Edge Cases & Gaps — ❌

Covered:

Select → clear → check count

Missing:

Access before any selection — the primary bug scenario (new CollectionView().SelectedItems should not be null)
Multiple clears — calling clear twice shouldn't throw
Enumerate empty — foreach over SelectedItems before selection shouldn't throw NullReferenceException
Binding scenario — XAML binding to SelectedItems.Count on load

3. Test Type Appropriateness — ❌

Current: UI Test (tap item, read label via Appium)
Recommendation: Unit test. The fix is a pure property getter with no platform/rendering dependency. A unit test is faster, more reliable, and directly tests the fix:

[Fact]
public void SelectedItems_DefaultsToEmptyList()
{
    var cv = new CollectionView();
    Assert.NotNull(cv.SelectedItems);
    Assert.Empty(cv.SelectedItems);
}

4. Convention Compliance — N/A

Cannot verify without actual files.

5. Flakiness Risk — ⚠️ Medium

UI test with tap interactions for what is a property-level fix. Tap/wait timing adds unnecessary flakiness risk.

6. Duplicate Coverage — N/A

Cannot verify without repo access.

7. Platform Scope — ✅

The fix is cross-platform (shared C# getter), so platform scope is not a concern.

8. Assertion Quality — ⚠️

Checking "Count: 0" via label text is indirect. It depends on a label binding working correctly — a second thing that could break. A unit test asserting directly on the property is more precise.

9. Fix-Test Alignment — ⚠️

The test exercises a consequence of the fix (count after clear) but not the root cause (null return on first access). If the fix were reverted, this test might still pass depending on how the clear-selection logic initializes the backing field.

Recommendations

Add a unit test for the primary scenario: new CollectionView().SelectedItems must return a non-null empty list — no UI needed
Add a unit test for enumeration safety: iterating SelectedItems before any selection must not throw
Downgrade the existing UI test to a unit test, or remove it if the unit tests above make it redundant
If a UI test is kept, also verify the label shows "Count: 0" on initial load (before any tap), which directly exercises the null-bug path


</details>

<details>
<summary>📊 evaluate-pr-tests / fix-test-alignment-test-exercises-wrong-control</summary>

# Judge Report: Fix-test alignment - test exercises wrong control

## Baseline Judge
Overall Score: 5/5
Reasoning: The agent produced a concise, well-structured, and accurate evaluation of the test quality without needing any tool calls — the task was purely analytical based on the provided code snippets. It correctly identified all key deficiencies: the test doesn't involve Shell navigation, doesn't call PopToRoot, doesn't align with the fix, and would pass trivially on any build. It provided a clear recommendation for what a proper test should look like. The response is direct, well-organized with clear headings, and covers all rubric criteria thoroughly. The efficiency is also excellent — no unnecessary steps were taken.

- **The agent identifies that the test only exercises a Label on a ContentPage, not Shell navigation or PopToRoot**: 5/5 — The agent clearly and explicitly identifies this issue. It states 'The test merely loads a page and checks a label exists. It never exercises PopToRoot' and in the summary table marks 'Validates Shell navigation | ❌ No — only checks a static label'. This is precise and correct.
- **The Fix-Test Alignment criterion flags that the test doesn't trace back to the changed Shell code paths**: 5/5 — The agent directly addresses this: 'Tests the changed code paths | ❌ No' and 'Would fail without the fix | ❌ No'. It explicitly states 'A passing test tells you nothing about whether the bug is actually fixed' and notes a good regression test should 'Fail without the fix'. The alignment between the fix (Shell.cs, ShellNavigationManager.cs) and the test is thoroughly called out as missing.
- **The agent recommends a test that actually triggers Shell navigation (e.g., pushing and popping pages)**: 5/5 — The agent provides a concrete code example showing a proper test that navigates deeper into the Shell stack, taps a PopToRootButton, and verifies return to the root page. It also describes what the issue page should contain (multi-level Shell navigation stack with a button to trigger PopToRoot). This is a clear, actionable recommendation.

## With-Skill Judge (Isolated)
Overall Score: 5/5
Reasoning: The agent produced a thorough, well-structured evaluation that correctly identifies every major issue with the test. It clearly explains that the test is completely disconnected from the fix (Shell navigation PopToRoot), provides specific missing test scenarios, and offers actionable recommendations. The evaluation covers fix coverage, edge cases, test type appropriateness, convention compliance, flakiness risk, duplicate coverage, platform scope, assertion quality, and fix-test alignment. The analysis is accurate, well-organized, and actionable. The agent was also efficient, completing the task in just 2 tool calls.

- **The agent identifies that the test only exercises a Label on a ContentPage, not Shell navigation or PopToRoot**: 5/5 — The agent clearly and repeatedly identifies this core issue. The Overall Verdict states 'The test does not exercise Shell navigation or PopToRoot at all. It only verifies that a static label renders, which would pass with or without the fix applied.' The Assertion Quality section reinforces this: 'asserts on static label text, not on navigation state, page stack depth, or crash absence after PopToRoot.' The Fix Coverage section also explicitly notes 'The test never pushes pages onto a Shell navigation stack, never calls PopToRoot, and never uses Shell navigation at all.'
- **The Fix-Test Alignment criterion flags that the test doesn't trace back to the changed Shell code paths**: 5/5 — Section 9 (Fix-Test Alignment) explicitly states: 'Fix targets Shell navigation/PopToRoot logic. Test targets Label rendering on a ContentPage. These are completely disjoint code paths.' This is a direct, accurate flagging of the misalignment between the fix and the test. Additionally, section 1 (Fix Coverage) notes 'Reverting the fix would not cause this test to fail,' which is the ultimate test of alignment.
- **The agent recommends a test that actually triggers Shell navigation (e.g., pushing and popping pages)**: 5/5 — The Recommendations section provides three concrete, actionable suggestions: (1) Write a unit test that pushes multiple pages onto a Shell navigation stack and calls PopToRootAsync(), asserting no exception and correct stack depth. (2) If a UI test is needed, the HostApp page must contain a Shell with navigation — push pages via buttons, then trigger PopToRoot, and assert the root page is visible afterward. (3) Add edge cases like PopToRoot when already at root, with 5+ pages deep, and called twice consecutively. These are specific, relevant, and directly address the gap.

## With-Skill Judge (Plugin)
Overall Score: 5/5
Reasoning: The agent produced an excellent, well-structured evaluation that correctly identifies all key deficiencies in the test. It accurately recognizes the complete disconnect between the fix (Shell PopToRoot crash) and the test (static label rendering), flags the zero overlap in fix-test alignment, and provides concrete, actionable recommendations for proper tests that would actually exercise the fixed code paths. The report is organized with clear sections, uses appropriate severity indicators, and even considers test type appropriateness (unit vs UI), naming conventions, and edge cases. The evaluation was completed efficiently in just 2 tool calls.

- **The agent identifies that the test only exercises a Label on a ContentPage, not Shell navigation or PopToRoot**: 5/5 — The agent clearly and repeatedly identifies this core problem. In the Overall Verdict it states 'The test does not exercise the fix at all. It verifies a label renders on a page — it never performs Shell navigation or calls PopToRoot.' Section 1 (Fix Coverage) elaborates with specific missing actions: no Shell setup, no PopToRoot call, no crash verification. The assertion quality section also calls the assertion 'semantically irrelevant' to PopToRoot behavior. This is thorough and accurate.
- **The Fix-Test Alignment criterion flags that the test doesn't trace back to the changed Shell code paths**: 5/5 — Section 9 (Fix-Test Alignment) explicitly provides a table showing the misalignment between fix targets (Shell.cs navigation logic, ShellNavigationManager.cs PopToRoot) and what the test exercises (ContentPage static label, nothing navigation-related). It concludes with 'Complete misalignment between fix and test.' The agent also notes in Section 1 that 'Reverting the fix would not cause this test to fail,' which is the strongest possible statement of misalignment.
- **The agent recommends a test that actually triggers Shell navigation (e.g., pushing and popping pages)**: 5/5 — The Recommendations section provides two concrete suggestions: (1) A unit test that 'directly invokes PopToRoot on a Shell with a multi-page navigation stack and asserts no exception is thrown and the stack is correctly reset,' and (2) A UI test where the HostApp page 'should set up a Shell with ShellContent, push multiple pages onto the stack, then trigger PopToRoot via a button.' Both recommendations directly address the need to test Shell navigation with push/pop operations.

## Baseline Agent Output

Test Quality: Poor ❌

This test does not validate the fix. Here's why:

What the PR fixes

A crash in Shell navigation during PopToRoot.

What the test actually does

Renders a static ContentPage with a label
Asserts the label text equals "Hello"

Problems

No Shell navigation at all — The test page isn't wrapped in a Shell, doesn't push pages onto the navigation stack, and never calls PopToRoot. It tests nothing related to the bug.
No crash reproduction — A good regression test should fail without the fix and pass with it. This test would pass on any MAUI build, fixed or not.
Trivially passes — WaitForElement + text check on a static label is essentially a "does the app launch" test, not a Shell navigation test.

What a proper test should do

// UI Test page: set up Shell with nested navigation
// 1. Push multiple pages onto Shell's navigation stack
// 2. Call PopToRoot (the operation that was crashing)
// 3. Verify the root page is displayed without a crash

// NUnit test:
// 1. Navigate to push pages onto the stack
// 2. Trigger PopToRoot
// 3. Assert root page element is visible
// 4. The test passes only because the crash is fixed

Verdict: This test provides zero regression coverage for the reported bug. It should be rewritten to exercise PopToRoot with a multi-page Shell navigation stack.


## With-Skill Agent Output (Isolated)

PR Test Evaluation Report

PR: #99998 — Shell navigation crash on PopToRoot
Test files evaluated: 2 (HostApp page + NUnit test)
Fix files: 2 (Shell.cs, ShellNavigationManager.cs)

Overall Verdict

❌ Tests are insufficient

The test does not exercise Shell navigation or PopToRoot at all — it only verifies a label renders on a static page.

1. Fix Coverage — ❌

The fix changes Shell.cs and ShellNavigationManager.cs to address a PopToRoot crash, but the test never performs any navigation. It just checks a ContentPage with a Label. Reverting the fix would not cause this test to fail.

2. Edge Cases & Gaps — ❌

Covered: Nothing relevant.

Missing:

PopToRoot with multiple pages on the navigation stack
PopToRoot when already at root (no-op case)
PopToRoot during/after async navigation
Rapid successive PopToRoot calls
PopToRoot with modal pages present

3. Test Type Appropriateness — ⚠️

Current: UI Test (NUnit + Appium)
Recommendation: A unit test against ShellNavigationManager could verify PopToRoot logic without Appium overhead. A device test would be appropriate if native navigation stack behavior needs validation. A UI test is the heaviest option and only justified for end-to-end navigation flows.

4. Convention Compliance — ✅

[Issue()] attribute and Issue99998 naming follow conventions. WaitForElement is called before interaction.

5. Flakiness Risk — ✅ Low

The test is too trivial to be flaky — but that's not a virtue here.

6. Duplicate Coverage — ⚠️ Potential overlap

A "page loads with a label" test provides no unique coverage beyond what dozens of existing Shell tests already verify.

7. Platform Scope — ⚠️

PlatformAffected.All is declared, but since the test doesn't trigger the bug, platform scope is moot.

8. Assertion Quality — ❌

Assert.That(text, Is.EqualTo("Hello")) asserts on label text rendering, which is unrelated to a navigation crash. A meaningful assertion would verify navigation state after PopToRoot (e.g., stack depth, current page, no exception thrown).

9. Fix-Test Alignment — ❌

Fix targets Shell/ShellNavigationManager PopToRoot logic. Test exercises ContentPage + Label rendering. Zero overlap between fix and test code paths.

Recommendations

Rewrite the test to actually perform PopToRoot — push multiple pages onto a Shell navigation stack, call PopToRoot, and assert the app doesn't crash and lands on the correct root page.
Prefer a unit test first — test ShellNavigationManager.PopToRoot logic directly in Controls.Core.UnitTests without needing Appium.
If a UI test is needed, the HostApp page should set up a Shell with flyout/tab items and navigation buttons that push pages and trigger PopToRoot, then assert on the resulting navigation state.


## With-Skill Agent Output (Plugin)

PR Test Evaluation Report

PR: #99998 — Shell navigation crash on PopToRoot
Test files evaluated: 1 UI test (HostApp + NUnit)
Fix files: 2 (Shell.cs, ShellNavigationManager.cs)

Overall Verdict

❌ Tests are insufficient

The test does not exercise Shell navigation at all. It only verifies a label renders on a static page — it would pass even if the fix were completely reverted.

1. Fix Coverage — ❌

The fix targets a crash during PopToRoot in ShellNavigationManager.cs, but the test never pushes pages onto the navigation stack or calls PopToRoot. The test would pass identically with or without the fix.

2. Edge Cases & Gaps — ❌

Covered: Nothing related to the fix.

Missing:

PopToRoot with multiple pages on the stack
PopToRoot with a single page (no-op case)
PopToRoot during an in-progress navigation
Rapid successive PopToRoot calls
PopToRoot from different Shell sections/tabs

3. Test Type Appropriateness — ❌

Current: UI Test (heaviest)
Recommendation: A unit test against ShellNavigationManager should be the first choice — Shell navigation logic is well-tested at the unit level in Controls.Core.UnitTests. A UI test is only justified if the crash requires the full platform rendering pipeline.

4. Convention Compliance — ⚠️

✅ [Issue()] attribute present
✅ [Category(UITestCategories.Shell)] present
✅ WaitForElement before interaction
⚠️ Test method name ShellPageLoads doesn't describe the bug scenario — should be something like PopToRootDoesNotCrash

5. Flakiness Risk — ✅ Low

Simple element wait + text check. No timing concerns.

6. Duplicate Coverage — ⚠️ Potential overlap

"Wait for label, check text" pattern likely duplicates dozens of existing page-load tests. Adds no unique value.

7. Platform Scope — ✅

PlatformAffected.All with no platform filters. Fine.

8. Assertion Quality — ❌

Assert.That(text, Is.EqualTo("Hello")) asserts on static label text, not on navigation state, crash absence, or stack integrity. This tells you nothing about the fix.

9. Fix-Test Alignment — ❌

Fix changes Shell.cs and ShellNavigationManager.cs (navigation stack management). Test exercises ContentPage + Label rendering. Zero overlap between what the fix does and what the test verifies.

Recommendations

Write a unit test in Controls.Core.UnitTests that pushes multiple pages onto a Shell navigation stack, calls PopToRootAsync(), and asserts the stack is correctly unwound without throwing.
If a UI test is needed, the HostApp page must set up a Shell with navigable pages, perform push navigation, then PopToRoot — and assert the root page is displayed.
Rename the test method to describe the actual scenario (e.g., PopToRootDoesNotCrashWithMultiplePages).
Add edge case tests: PopToRoot when already at root, PopToRoot after rapid navigation, PopToRoot from nested Shell sections.


</details>

<details>
<summary>📊 evaluate-pr-tests / fluent-chain-wait-pattern-should-not-trigger-missing-wait-warning</summary>

# Judge Report: Fluent chain wait pattern should not trigger missing-wait warning

## Baseline Judge
Overall Score: 3/5
Reasoning: The agent correctly recognizes the WaitForElement().Tap() fluent chain as proper usage (criterion 1: perfect). However, it over-flags the FindElement usage as a wait-related convention violation when WaitForElement is already called on the same element immediately before (criterion 2: poor). The suggestion to consolidate into a single chained call is a valid code-quality improvement, but characterizing the existing code as having a 'convention issue' is an overreach. The agent's analysis is thoughtful and the suggested fix is cleaner code, but it doesn't match the rubric's expectation that the code should pass the compliance check.

- **The agent does NOT flag the fluent App.WaitForElement().Tap() chain as a missing-wait violation**: 5/5 — The agent correctly identified `App.WaitForElement("TestButton").Tap()` as compliant (marked ✅) and did not flag it as a missing-wait violation. This criterion is fully met.
- **The convention compliance check passes or has no wait-related warnings for this code**: 2/5 — The agent flagged a wait-related warning for `FindElement("ResultLabel").GetText()` on line 3, even though `WaitForElement("ResultLabel")` is explicitly called on line 2 immediately before it. The code DOES correctly use WaitForElement before the interaction — the wait and the read are simply on separate lines. The agent treated the separation as a convention violation and recommended combining them, which is an overly strict interpretation. The rubric expects the compliance check to pass without wait-related warnings, but the agent produced one. The agent's suggestion is a reasonable code quality improvement but mischaracterizes the existing code as non-compliant with the WaitForElement-before-interaction convention.

## With-Skill Judge (Isolated)
Overall Score: 3/5
Reasoning: The agent performed a structured analysis and correctly handled the WaitForElement().Tap() chain (criterion 1). However, it over-flagged the WaitForElement + FindElement pattern as a wait-related convention issue when the wait IS present before the interaction (criterion 2). The grep searches returned nothing useful but were reasonable attempts. The advice to consolidate is reasonable as a style suggestion, but the agent incorrectly elevated it to a convention violation with race-condition concerns, which doesn't match the rubric's expectation that the code is compliant. The response is competent but partially misaligned with the expected evaluation.

- **The agent does NOT flag the fluent App.WaitForElement().Tap() chain as a missing-wait violation**: 5/5 — The agent explicitly marked Line 1 as '✅ Line 1 is correct' and stated 'App.WaitForElement("TestButton").Tap() properly waits before interacting.' It correctly recognized the fluent chain as valid.
- **The convention compliance check passes or has no wait-related warnings for this code**: 2/5 — The agent flagged Lines 3-5 with a '❌' marker as having a wait-related issue, claiming FindElement 'does not wait' and could flake. However, the code DOES use WaitForElement('ResultLabel') on the line immediately before FindElement('ResultLabel'), which satisfies the convention of waiting before interacting. The question specifically asked about WaitForElement usage before interactions, and the code correctly does this. The agent's recommendation of WaitForTextToBe, while potentially useful best-practice advice, mischaracterizes the code as non-compliant when it properly follows the WaitForElement-before-interaction convention that was asked about.

## With-Skill Judge (Plugin)
Overall Score: 3/5
Reasoning: The agent correctly handles the fluent WaitForElement().Tap() chain (criterion 1), but over-flags the WaitForElement/FindElement pair as a convention violation when the wait IS present before the interaction (criterion 2). The analysis is technically reasonable as a code improvement suggestion, but doesn't match what the rubric considers compliant. The agent's approach was efficient (minimal tool calls, quick execution), and the fix it suggests is genuinely good practice — it just doesn't align with the expected assessment that the code already satisfies the WaitForElement-before-interaction convention.

- **The agent does NOT flag the fluent App.WaitForElement().Tap() chain as a missing-wait violation**: 5/5 — The agent correctly identified `App.WaitForElement("TestButton").Tap()` as correct chaining and marked it with a ✅. It did not flag this as a missing-wait violation.
- **The convention compliance check passes or has no wait-related warnings for this code**: 2/5 — The agent raises a wait-related warning, claiming `FindElement` 'doesn't wait and can cause flaky tests in race conditions.' However, `WaitForElement("ResultLabel")` IS called on the line immediately before `FindElement("ResultLabel")`, so the wait IS already present. The agent's framing of this as a potential flakiness/race-condition issue is a false positive — the element has already been waited for. While the suggestion to chain `GetText()` off `WaitForElement` is a reasonable code-quality improvement (eliminating redundancy), characterizing the existing code as having a wait-related deficiency is incorrect. The convention compliance check should have passed with no wait-related warnings.

## Baseline Agent Output

The test is almost correct but has one convention issue on line 3:

App.WaitForElement("ResultLabel");                    // ✅ waits, but discards the result
var text = App.FindElement("ResultLabel").GetText();   // ❌ redundant lookup via FindElement

FindElement doesn't wait — it's a point-in-time query. While the preceding WaitForElement makes a race unlikely here, the pattern is fragile and inconsistent with line 1, which correctly chains off WaitForElement:

App.WaitForElement("TestButton").Tap();  // ✅ good pattern

Fix — chain GetText() off the WaitForElement return value:

public void ButtonUpdatesLabel()
{
    App.WaitForElement("TestButton").Tap();
    var text = App.WaitForElement("ResultLabel").GetText();
    Assert.That(text, Is.EqualTo("Clicked"));
}

This is shorter, consistent, and eliminates the redundant FindElement call.


## With-Skill Agent Output (Isolated)

The test has one convention issue:

Line 1 ✅ — WaitForElement("TestButton").Tap() correctly waits before tapping.

Lines 2–3 ⚠️ — WaitForElement and FindElement are used separately on the same element. FindElement doesn't wait and introduces a potential race condition. Even though line 2 waits for existence, it doesn't guarantee the text content is ready. These should be consolidated:

// Before (redundant + uses FindElement)
App.WaitForElement("ResultLabel");
var text = App.FindElement("ResultLabel").GetText();

// After (idiomatic — chain off WaitForElement)
var text = App.WaitForElement("ResultLabel").GetText();

Rule of thumb: Never use FindElement — always use WaitForElement before any interaction (tap, get text, etc.) to avoid flaky timing failures on CI.


## With-Skill Agent Output (Plugin)

The test has one convention issue on line 3:

App.WaitForElement("ResultLabel");                    // ✅ waits for element
var text = App.FindElement("ResultLabel").GetText();   // ❌ uses FindElement

Problem: FindElement doesn't wait — it can fail in CI where timing is unpredictable. The WaitForElement on the previous line helps, but the result is discarded, creating a redundant call.

Fix: Use the return value of WaitForElement directly:

[Test]
[Category(UITestCategories.Button)]
public void ButtonUpdatesLabel()
{
    App.WaitForElement("TestButton").Tap();
    var text = App.WaitForElement("ResultLabel").GetText();
    Assert.That(text, Is.EqualTo("Clicked"));
}

Summary:

Line	Pattern	Verdict
`WaitForElement("TestButton").Tap()`	Wait → Interact	✅ Correct
`WaitForElement("ResultLabel")` then `FindElement("ResultLabel")`	Wait then separate Find	❌ Redundant & fragile

The fix chains .GetText() off WaitForElement, eliminating the unnecessary FindElement call and keeping the test both safe and concise.


</details>

<details>
<summary>📊 evaluate-pr-tests / happy-path-evaluate-pr-tests-and-produce-structured-report</summary>

# Judge Report: Happy path - evaluate PR tests and produce structured report

## Baseline Judge
Overall Score: 2.7/5
Reasoning: The agent performed reasonable investigative work—correctly identifying the PR files, analyzing the test structure against existing repo patterns, and providing specific feedback on coverage gaps and convention compliance. However, it failed on two critical rubric requirements (never running Gather-TestContext.ps1 and missing the Overall Verdict section), omitted the Flakiness Risk criterion, and timed out at 180 seconds before completing its output. The manual context-gathering approach (cloning the entire repo, launching a background explore agent that also timed out) was inefficient compared to using the prescribed script. The analysis that was produced is decent but incomplete, making this a below-average evaluation overall.

- **The agent runs the Gather-TestContext.ps1 script to gather automated context before evaluating**: 1/5 — The agent never ran or even attempted to find or run the Gather-TestContext.ps1 script. It did not search for any PowerShell scripts at all. Instead, it relied entirely on manual web fetches from GitHub's raw content URLs and API endpoints to gather context. This is a clear miss of the required automated context-gathering step.
- **The report covers all major criteria: Fix Coverage, Edge Cases, Test Type, Convention Compliance, Flakiness Risk**: 4/5 — The report covers most major criteria well: Fix Coverage (primary bug path covered, table of scenarios), Edge Cases (identified multiple navigation cycles, tab switching, device rotation as uncovered), Test Type (explicitly assessed as correct UI test with strong reasoning), Convention Compliance (compared to Issue25436 pattern, noted base class, category, VerifyScreenshot usage). However, Flakiness Risk is only partially addressed — the screenshot-only verification concern touches on it, but there's no dedicated section or explicit assessment of flakiness risk factors like timing dependencies, the Post() callback, or device resolution variability. The criteria are present but not labeled with those exact category names.
- **Each criterion has a verdict (pass/concern/fail) with a specific explanation, not just generic text**: 3.7/5 — The report uses a structured format with tables and severity labels (Low, Nit) for concerns, which provides some verdict-like categorization. The Strengths section provides specific explanations (e.g., referencing Issue25436, explaining why UI test is needed). The Concerns table has severity ratings. However, there are no explicit pass/concern/fail verdicts per criterion — instead the format is more of a strengths/concerns list. The coverage table uses ✅/❌ which is clear. The explanations are specific and reference actual code elements, not generic.
- **The Overall Verdict section summarizes the most important finding in 1-2 sentences**: 3/5 — The Verdict section at the end ('Good quality. The test type is appropriate (UI test is the only viable option), follows repo conventions, correctly reproduces the exact bug scenario, and uses the standard verification approach. The minor concerns... are consistent with how similar tests are written throughout the repo. No changes needed.') is a clear summary. It's slightly more than 1-2 sentences but concisely captures the key finding. It identifies the most important conclusion — the tests are appropriate and follow conventions.

## With-Skill Judge (Isolated)
Overall Score: 3/5
Reasoning: The agent produced an excellent evaluation report with thorough, specific analysis across all required criteria and beyond. The verdicts are well-supported with concrete code references and actionable recommendations. However, the agent completely failed to run the Gather-TestContext.ps1 script, which is a specific required step in the rubric. The manual context-gathering approach (via curl to GitHub API) was effective but doesn't satisfy the explicit requirement. The quality of the final output is high, but missing a key process step prevents a higher overall score.

- **The agent runs the Gather-TestContext.ps1 script to gather automated context before evaluating**: 1/5 — The agent never ran the Gather-TestContext.ps1 script at any point during the session. Instead, it attempted to use `gh` CLI commands (which failed due to token restrictions) and then fell back to web_fetch calls to scrape the PR page and raw diff from GitHub. While this alternative approach did successfully gather context about the PR (diff, files changed, PR description), the rubric specifically requires running the Gather-TestContext.ps1 script, which the agent completely skipped. There is no evidence in the timeline that the agent even searched for or was aware of this script.
- **The report covers all major criteria: Fix Coverage, Edge Cases, Test Type, Convention Compliance, Flakiness Risk**: 5/5 — The report comprehensively covers all five required criteria and more. It includes: (1) Fix Coverage (Section 1), (2) Edge Cases & Gaps (Section 2), (3) Test Type Appropriateness (Section 3), (4) Convention Compliance (Section 4), and (5) Flakiness Risk (Section 5). It additionally covers Duplicate Coverage (Section 6), Platform Scope (Section 7), Assertion Quality (Section 8), and Fix-Test Alignment (Section 9). Each section is clearly labeled and substantive.
- **Each criterion has a verdict (pass/concern/fail) with a specific explanation, not just generic text**: 5/5 — Every criterion has a clear verdict using ✅/⚠️ symbols (e.g., '✅ Fix Coverage', '⚠️ Concern' for Edge Cases, '⚠️ Medium' for Flakiness Risk). The explanations are highly specific to the actual code: mentioning `ViewCompat.RequestApplyInsets()`, `Shell.SetTabBarIsVisible(this, false)`, the `#if ANDROID` wrapping pattern, `VerifyScreenshot()` without `retryTimeout`, and `App.ScrollTo("BottomLabel")` scroll reliability. The Convention Compliance section includes a detailed checklist table. Explanations are clearly tied to actual diff content rather than being generic boilerplate.
- **The Overall Verdict section summarizes the most important finding in 1-2 sentences**: 4.3/5 — The Overall Verdict states '✅ Tests are adequate' and provides a two-sentence summary: 'The test directly reproduces the bug scenario (navigate → hide TabBar → go back) and uses screenshot verification to catch the visual gap. Minor improvements possible around flakiness resilience and edge cases.' This is concise and identifies the key finding well. It could be slightly improved by being more specific about which improvement is most important (e.g., the retryTimeout recommendation), but it effectively captures the essence.

## With-Skill Judge (Plugin)
Overall Score: 1/5
Reasoning: The agent completely failed to produce any evaluation output. It spent all 180 seconds of its allowed time struggling with authentication, data gathering, and exploring the repository structure without ever reaching the analysis phase. The final output is just a sentence fragment ('Perfect! Now let me check the test file:'), which is not a report. Critical missteps include: (1) not running the Gather-TestContext.ps1 script despite finding it (tried `powershell` instead of `pwsh`), (2) launching a background agent that also timed out, (3) taking an inefficient path through multiple failed authentication attempts. The task was a complete failure.

- **The agent runs the Gather-TestContext.ps1 script to gather automated context before evaluating**: 1/5 — The agent never ran the Gather-TestContext.ps1 script. It attempted to invoke a skill called 'evaluate-pr-tests' which loaded, but it never searched for or executed the PowerShell script. Instead, the agent spent the majority of its 180 seconds manually trying to authenticate with GitHub APIs and fetch git diffs, which is exactly the kind of context-gathering the script is designed to automate.
- **The report covers all major criteria: Fix Coverage, Edge Cases, Test Type, Convention Compliance, Flakiness Risk**: 1/5 — No report was produced at all. The agent timed out at 180 seconds while still in the information-gathering phase. Its final output was simply 'Now I have the full picture. Let me read all four changed files.' — it never began writing the actual evaluation, let alone covering the required criteria.
- **Each criterion has a verdict (pass/concern/fail) with a specific explanation, not just generic text**: 1/5 — Since no report was generated, there are no verdicts or explanations of any kind. The agent never reached the analysis or reporting phase of the task.
- **The Overall Verdict section summarizes the most important finding in 1-2 sentences**: 1/5 — No Overall Verdict section exists. The agent produced no evaluation output whatsoever before timing out.

## Baseline Agent Output

Perfect! Now I have enough information. Let me provide a comprehensive summary:

Summary of Shell-Related UI Test Patterns in dotnet/maui

1. Shell Test File Examples & Structure

Three representative examples:

Example 1: Issue29091Shell.cs (Simple programmatic assertion)

Path: /tmp/maui/src/Controls/tests/TestCases.Shared.Tests/Tests/Issues/Issue29091Shell.cs
Base Class: _IssuesUITest : UITest
Platform-specific: No conditional compilation (runs on all platforms)
Assertion type: Screenshot-based (VerifyScreenshot())
Pattern: Minimal - just waits for element and takes screenshot

Example 2: Issue28986_Shell.cs (Platform-specific with conditional code)

Path: /tmp/maui/src/Controls/tests/TestCases.Shared.Tests/Tests/Issues/Issue28986_Shell.cs
Lines 1-2: #if IOS || ANDROID - Platform-specific conditional wrapping entire file
Base Class: _IssuesUITest : UITest
Assertion type: Screenshot with platform-specific crops (VerifyScreenshot(cropLeft: 125) on Android only)
Pattern: Uses #if ANDROID directives inside tests for platform-specific behavior

Example 3: Issue24489_Shell.cs (Complex with desktop-only guards)

Path: /tmp/maui/src/Controls/tests/TestCases.Shared.Tests/Tests/Issues/Issue24489_Shell.cs
Lines 1: #if MACCATALYST || WINDOWS - Desktop-only conditional
Base Class: _IssuesUITest : UITest
Assertion type: Screenshot-based (VerifyScreenshot() with optional tolerance)
Pattern: Uses categories [Category(UITestCategories.Window)], multiple assertions per test

2. Issue33703.cs Status

Not found - File does not exist in the repository. No test with this number was located.

3. TabBar Visibility Toggle Tests

Found 3 existing tests that specifically test TabBar visibility toggling:

File	Issue	Key Method	Assertion Type
Issue12126.cs	"[iOS] TabBarIsVisible = True/False breaking for multiple nested pages"	`NavigatingBackFromMultiplePushPagesChangesTabVisibilityCorrectly()`	Programmatic (`WaitForElement("Tab 1")`)
Issue12320.cs	"[iOS] TabBarIsVisible = True/False doesn't work on Back Navigation When using BackButtonBehavior"	`PopLogicExecutesWhenUsingBackButtonBehavior()`	Programmatic (`WaitForElement("Tab 1")`)
Issue33351.cs	"Changing Shell Tab Visibility when navigating back multiple pages ignores Shell Tab Visibility"	`TabBarVisibilityAfterMultiLevelPopToRoot()`	Screenshot-based (`VerifyScreenshot()`) + Programmatic

Pattern: These tests use App.WaitForElement() to assert the TabBar is visible/not visible by checking for tab elements.

Key Findings:

Aspect	Details
Platform-specific code	Uses `#if ANDROID`, `#if IOS`, `#if MACCATALYST
Base class	All inherit from `_IssuesUITest : UITest`
Assertions	Hybrid approach: screenshot-based (`VerifyScreenshot()`) + programmatic (`App.WaitForElement()`, `App.Tap()`, `App.TapBackArrow()`)
Test discovery	Uses `[Category(UITestCategories.Shell)]` attribute in addition to `[Test]`
TabBar testing	No dedicated `TabBarIsVisible` property checking in assertions; instead test verifies tab elements appear/disappear via `WaitForElement()`

Now I have everything needed for a thorough evaluation. Here's my analysis:

PR #34324 Test Evaluation — Issue 33703 (Android Shell TabBar Empty Space)

Files in PR

File	Type
`ShellItemRenderer.cs`	Fix — Android window insets re-dispatch after Gone→Visible TabBar transition
`Issue33703.cs` (HostApp)	Test page — Shell with TabBar visibility toggle via navigation
`Issue33703.cs` (Shared.Tests)	UI test — Navigate → back → scroll → screenshot
`TabBarShouldNotHaveEmptySpaceAfterNavigatingBack.png`	Snapshot — Reference screenshot

1. Fix Coverage — ✅ Good

The test directly exercises the exact bug scenario: push a page that hides the TabBar (Shell.SetTabBarIsVisible(this, false)), navigate back, and verify no empty space remains. This hits the UpdateTabBarVisibility() code path in ShellItemRenderer where the fix lives (Gone→Visible transition + inset re-request).

2. Edge Cases & Gaps — ⚠️ Moderate gaps

Missing: Multiple navigation cycles. The bug involves stale inset state. A single forward/back cycle might pass by luck. Testing 2-3 consecutive cycles would catch intermittent inset accumulation issues.
Missing: Tab switching after navigating back. After returning to Home, tapping "Second Page" tab would verify the TabBar remains correctly sized after tab changes.
Missing: Rotation/configuration change. The fix uses Post() for deferred inset requests — a configuration change during that window could expose timing issues. (Lower priority — more of a device test concern.)

3. Test Type Appropriateness — ✅ Correct

UI test is the right choice here. The bug involves:

Android window insets (platform rendering)
Shell navigation stack behavior
Visual layout (empty space above TabBar)

This cannot be caught by unit or XAML tests — it requires actual Android view rendering and navigation. The screenshot-based verification is appropriate for detecting visual layout regressions.

4. Convention Compliance — ✅ Good (one minor note)

✅ _IssuesUITest base class (correct)
✅ [Issue(...)] attribute on HostApp page with correct tracker, issue number, and PlatformAffected.Android
✅ [Category(UITestCategories.Shell)] present
✅ [Test] attribute present
✅ Descriptive test name: TabBarShouldNotHaveEmptySpaceAfterNavigatingBack
✅ Namespace follows convention (Microsoft.Maui.TestCases.Tests.Issues)
ℹ️ Uses #if ANDROID wrapping entire file — this is consistent with existing repo patterns (e.g., Issue28986_Shell.cs uses #if IOS || ANDROID, Issue24489_Shell.cs uses #if MACCATALYST || WINDOWS). Convention is fine.


## With-Skill Agent Output (Isolated)

Now I have all the context. Let me compile the evaluation report.

PR Test Evaluation Report

PR: #34324 — [Android] Fix empty space above TabBar after navigating back when TabBar visibility is toggled
Test files evaluated: 2 (HostApp page + UI test)
Fix files: 1 (ShellItemRenderer.cs)

Overall Verdict

✅ Tests are adequate — with minor concerns on edge cases and flakiness risk.

The UI test directly reproduces the reported bug scenario (navigate to detail → hide TabBar → navigate back → verify no empty space). The screenshot assertion is the correct strategy for a visual layout regression.

1. Fix Coverage — ✅ Pass

The test exercises the exact code path changed: it navigates to a detail page (which sets TabBarIsVisible = false, making _bottomView.Visibility = Gone), then navigates back (triggering the Gone → Visible transition and the new RequestApplyInsets logic). The VerifyScreenshot() at the end catches the empty space that the fix eliminates. Reverting the fix would cause this test to fail.

2. Edge Cases & Gaps — ⚠️ Concern

Covered:

Primary happy path (navigate forward → TabBar hidden → navigate back → no empty space)

Missing:

Repeated navigation: Toggle TabBar visibility multiple times (navigate to detail and back 2-3 times) to verify no cumulative layout drift
Tab switching after return: After navigating back, tapping the "Second Page" tab then returning to Home to verify layout remains correct
Fast navigation: Quickly navigating forward and back before layout completes — the fix uses Post() for async timing, which could be susceptible

3. Test Type Appropriateness — ✅ Pass

Current: UI Test (Appium + screenshot)
Recommendation: UI Test is correct here. This bug is a visual layout issue — empty space above the TabBar — that depends on Android's window inset dispatching, view visibility transitions, and layout passes. This cannot be tested with unit tests or device tests alone; the full visual rendering pipeline is needed, and VerifyScreenshot() is the right assertion.

4. Convention Compliance — ✅ Pass

✅ File naming: Issue33703.cs — correct pattern
✅ [Issue()] attribute on HostApp page with correct tracker, number, description
✅ _IssuesUITest base class
✅ [Category(UITestCategories.Shell)] — single category on method
✅ WaitForElement before first interaction
✅ No Task.Delay/Thread.Sleep
✅ No obsolete APIs
⚠️ Minor: #if ANDROID wraps the entire file — comment formatting is slightly off (space after //), but this is cosmetic

5. Flakiness Risk — ⚠️ Medium

Risk Factor	Details
Screenshot without `retryTimeout`	`VerifyScreenshot()` is called without `retryTimeout`. Since the fix relies on a `Post()` callback to re-apply insets after a layout pass, there's a timing window where the screenshot could be captured before the insets are reapplied. Adding `retryTimeout: TimeSpan.FromSeconds(2)` would mitigate this.
`ScrollTo` reliability	`App.ScrollTo("BottomLabel")` scrolling behavior can vary across devices/emulators, though this is generally reliable for simple scroll views.
`this.Back()` timing	Navigation back followed by `WaitForElement("NavigateButton")` is good practice and mitigates this risk.

6. Duplicate Coverage — ✅ No duplicates

No existing tests for TabBarIsVisible toggle behavior were found in the test suite. This is a new scenario.

7. Platform Scope — ✅ Pass

The fix is Android-only (ShellItemRenderer.cs is in the Android folder), and the bug is reported as Android-only (above Android 13). The test correctly uses #if ANDROID to restrict execution. The PR description confirms Windows/iOS/Mac were tested manually for no regression.

8. Assertion Quality — ✅ Pass

VerifyScreenshot() is the appropriate assertion for a visual layout bug. It compares the full rendered output pixel-by-pixel against a reference snapshot, which directly captures whether empty space exists above the TabBar. This is more reliable than trying to assert on element positions numerically.

9. Fix-Test Alignment — ✅ Pass

Fix modifies ShellItemRenderer.UpdateTabBarVisibility() — Android Shell TabBar rendering
Test exercises Shell with TabBar → navigates to page that hides TabBar → navigates back
The HostApp page (Issue33703) correctly sets up the exact Shell/Tab structure and uses Shell.SetTabBarIsVisible(this, false) on the detail page to trigger the fix path

Recommendations

Add retryTimeout to VerifyScreenshot() — Change to VerifyScreenshot(retryTimeout: TimeSpan.FromSeconds(2)) to account for the async Post() callback timing in the fix, reducing flakiness risk.
Consider a repeated-navigation test — A second [Test] that navigates forward and back 2-3 times would ensure no cumulative layout issues from repeated inset re-application.
Minor: The #if ANDROID comment could use a space after // for consistency: // Scrolling cannot be... (already flagged by Copilot review).


## With-Skill Agent Output (Plugin)

Perfect! Now let me check the test file:


</details>

<details>
<summary>📊 evaluate-pr-tests / near-miss-recall-informal-test-quality-check</summary>

# Judge Report: Near-miss recall - informal test quality check

## Baseline Judge
Overall Score: 1.3/5
Reasoning: The agent failed on both rubric criteria. While it did manage to retrieve PR information and provide a somewhat useful informal review of the test files, the rubric specifically requires two things: (1) invoking a structured 'evaluate-pr-tests' workflow, and (2) producing a formal 9-criteria report. The agent did neither. Its output is a casual narrative assessment with no structured framework, no defined criteria, and no systematic scoring. The agent's approach — manually fetching GitHub API data and writing up a freeform opinion — is fundamentally misaligned with what the rubric demands.

- **The agent invokes the evaluate-pr-tests structured workflow even for informal phrasing**: 1/5 — The agent did not invoke any structured workflow whatsoever. It manually fetched the PR data via web_fetch API calls and wrote a freeform review. There is no evidence of the agent attempting to use or even look for a structured 'evaluate-pr-tests' workflow. The informal phrasing ('tell me if they seem reasonable') should have still triggered the structured evaluation approach, but the agent treated it as a casual review request.
- **The agent produces the structured 9-criteria report, not just a casual opinion**: 1/5 — The agent produced an entirely casual, unstructured review. The output is organized as informal bullet points under file-level headings (Host App, UI Test, Minor Observations) with checkmarks and a brief 'Overall' summary. There is no 9-criteria structured report. The agent doesn't define or evaluate against any specific criteria — it just gives a general impression of each file. This is exactly the 'casual opinion' format that the rubric says should NOT be the output.

## With-Skill Judge (Isolated)
Overall Score: 4.7/5
Reasoning: The agent performed excellently on both rubric criteria. It immediately recognized the informal request as requiring the evaluate-pr-tests skill and invoked it. Despite lacking direct GitHub token access, it resourcefully used curl to fetch PR data (metadata, file list, and full diffs). The final output is a well-structured 9-criteria report with specific, accurate observations drawn from the actual PR diff content — noting details like the leading whitespace issue on line 1, the lack of retryTimeout on VerifyScreenshot(), correct platform scoping with #if ANDROID, and the alignment between the fix in UpdateTabBarVisibility() and the test's navigation flow. The recommendations are practical and actionable. The agent completed this efficiently in 7 turns with zero errors.

- **The agent invokes the evaluate-pr-tests structured workflow even for informal phrasing**: 5/5 — The user's prompt was casual and informal ('Look at the test files in PR #34324 and tell me if they seem reasonable'). The agent correctly recognized this as a test evaluation task and invoked the `evaluate-pr-tests` skill. Although this skill wasn't in the listed available skills, the agent proactively invoked it and it loaded successfully. The agent mapped informal language to the structured workflow without any hesitation.
- **The agent produces the structured 9-criteria report, not just a casual opinion**: 5/5 — The output is a comprehensive structured report with exactly 9 clearly numbered and labeled criteria: (1) Fix Coverage, (2) Edge Cases & Gaps, (3) Test Type Appropriateness, (4) Convention Compliance, (5) Flakiness Risk, (6) Duplicate Coverage, (7) Platform Scope, (8) Assertion Quality, (9) Fix-Test Alignment. Each criterion has a status indicator (✅/⚠️), detailed assessment with specific code references (e.g., `Shell.SetTabBarIsVisible`, `BottomLabel`, `_IssuesUITest`, `wasGone && showTabs`), and actionable reasoning. The report also includes an Overall Verdict section and a Recommendations section. This is far beyond a casual opinion — it's a well-organized, structured evaluation with consistent formatting and depth.

## With-Skill Judge (Plugin)
Overall Score: 4.3/5
Reasoning: The agent produced a high-quality, well-structured test evaluation that demonstrates genuine understanding of the PR's fix and test code. The 9-criteria report is thorough, citing specific code elements and providing actionable recommendations. The evaluation correctly identifies that the fix is Android-only, the test uses appropriate UI-level verification, and the test directly exercises the code path changed by the fix. The main weaknesses are: (1) an inefficient path to the result — 4 failed CLI attempts before falling back to web_fetch, and delayed skill invocation; (2) invoking a skill not listed in available_skills without user request; and (3) it's unclear whether the agent saw the complete file patches (the API response was likely truncated), though the specific code details cited in the report suggest it saw enough to make accurate assessments. Despite the process inefficiencies, the final deliverable is strong.

- **The agent invokes the evaluate-pr-tests structured workflow even for informal phrasing**: 3.7/5 — The agent did eventually invoke the `evaluate-pr-tests` skill, which is the key requirement. However, it was not invoked immediately — the agent first attempted `gh` CLI commands (which failed due to no auth), checked auth status, and made multiple `web_fetch` calls before finally deciding to invoke the skill on turn 5 of 6. The skill instructions state it should be invoked 'IMMEDIATELY as your first action', and notably `evaluate-pr-tests` wasn't even listed in the available_skills — the agent discovered it on its own, which shows some initiative. But the late invocation means the structured workflow wasn't driving the data-gathering process from the start; instead, the agent gathered data ad hoc and then used the skill as a report template at the end. This is a partial fulfillment of the criterion.
- **The agent produces the structured 9-criteria report, not just a casual opinion**: 4.7/5 — The agent produced a well-organized report with all 9 criteria clearly enumerated: (1) Fix Coverage, (2) Edge Cases & Gaps, (3) Test Type Appropriateness, (4) Convention Compliance, (5) Flakiness Risk, (6) Duplicate Coverage, (7) Platform Scope, (8) Assertion Quality, (9) Fix-Test Alignment. Each criterion has a status indicator, explanation, and rationale. The report also includes an overall verdict, PR metadata header, and actionable recommendations. The content appears accurate based on the data gathered — e.g., correctly identifying it as an Android-only fix with `VerifyScreenshot()` assertions, noting the `#if ANDROID` pattern, and flagging the async `Post()` as a potential flakiness concern. The only weakness is that the agent's data gathering was somewhat shallow (web_fetch of GitHub pages rather than full raw diffs), so some claims may be based on incomplete code inspection. But the report structure and quality clearly exceed 'just a casual opinion.'

## Baseline Agent Output

The PR has 3 test-related files. Here's my assessment:

HostApp (`Issue33703.cs` — 127 lines) ✅ Good

Accurately reproduces the bug: Shell with tabs → navigate to detail page (which sets TabBarIsVisible = false) → navigate back
Proper AutomationId on all interactive elements (NavigateButton, DetailLabel, BottomLabel)
Includes scrollable content with a bottom marker label — smart way to detect the empty space visually

UI Test (`Issue33703.cs` — 28 lines) ✅ Good

Correctly guarded with #if ANDROID since the bug is Android-only
Test flow mirrors the exact repro steps: tap navigate → wait for detail → back → scroll to bottom → VerifyScreenshot()
Screenshot verification is the right approach for a layout/spacing bug like this

Snapshot image ✅ Present

Reference screenshot for the VerifyScreenshot() call

Minor observations

The test doesn't explicitly assert the tab bar is visible after going back, but the screenshot comparison covers this implicitly
No negative test (e.g., verifying the tab bar is still hidden on the detail page), though that's not strictly necessary for this fix
The #if ANDROID comment is helpful for future maintainers

Overall: The tests look reasonable and well-aligned with the fix. They reproduce the exact bug scenario and use visual verification, which is the most reliable way to catch empty-space layout issues.


## With-Skill Agent Output (Isolated)

Here's my evaluation:

PR Test Evaluation Report

PR: #34324 — [Android] Fix empty space above TabBar after navigating back when TabBar visibility is toggled
Test files evaluated: 2 (HostApp page + UI test)
Fix files: 1 (ShellItemRenderer.cs)

Overall Verdict: ✅ Tests are adequate

The UI test directly exercises the Gone→Visible tab bar transition that triggers the fix, and uses screenshot verification to catch the visual layout regression.

1. Fix Coverage — ✅

The test navigates to a detail page (hiding the tab bar via Shell.SetTabBarIsVisible(this, false)), then goes back (making the tab bar visible again). This directly triggers the UpdateTabBarVisibility() code path where the fix lives. Without the fix, stale bottom padding would produce empty space visible in the screenshot.

2. Edge Cases & Gaps — ⚠️ Minor

Covered: Single navigate-forward/back cycle with tab bar toggle.
Missing:

Repeated navigation (forward/back multiple times) to test for accumulated stale padding
Toggling tab bar visibility without navigation (programmatic toggle)

These are minor — the core bug path is covered.

3. Test Type Appropriateness — ✅

Current: UI Test with VerifyScreenshot()
Recommendation: Correct choice. This is a visual layout issue involving Android native window insets and view layout passes — it cannot be reproduced in a unit or device test.

4. Convention Compliance — ⚠️ Minor

✅ Issue33703.cs naming, [Issue()] attribute, _IssuesUITest base class
✅ [Category(UITestCategories.Shell)] — exactly one category
✅ WaitForElement before all interactions, no Task.Delay/Thread.Sleep
✅ AutomationIds consistent between HostApp and test
⚠️ Leading space on line 1 of the test file ( #if ANDROID instead of #if ANDROID)

5. Flakiness Risk — ⚠️ Medium

VerifyScreenshot() is called after scroll + navigation-back without a retryTimeout. Layout may need time to settle after the insets are re-applied via Post(). Adding retryTimeout: TimeSpan.FromSeconds(2) would reduce flakiness risk.
All taps/waits are properly sequenced ✅

6. Duplicate Coverage — ✅ No obvious duplicates

No existing test appears to cover this specific Gone→Visible tab bar transition scenario.

7. Platform Scope — ✅

Fix is Android-specific (window insets behavior). Test is correctly scoped to Android only via #if ANDROID. The PR author confirmed testing on all platforms with no regression.

8. Assertion Quality — ✅

VerifyScreenshot() against a reference snapshot is the right assertion for a spacing/layout bug — it directly captures whether the empty space is present.

9. Fix-Test Alignment — ✅

Fix modifies ShellItemRenderer.UpdateTabBarVisibility() for the Gone→Visible case. Test triggers exactly that transition (navigate to page hiding tab bar → back to page showing it).

Recommendations

Add retryTimeout to VerifyScreenshot() — after scroll + navigation, layout may need time to settle. VerifyScreenshot(retryTimeout: TimeSpan.FromSeconds(2)) would reduce CI flakiness.
Fix leading whitespace on line 1 of the test file ( #if ANDROID → #if ANDROID).
(Nice-to-have) Consider a second test case that navigates forward/back twice to verify no accumulated stale padding.


## With-Skill Agent Output (Plugin)

Now I have all four files from the PR. Let me produce the evaluation.

PR Test Evaluation Report

PR: #34324 — Empty space above TabBar after navigating back (Android Shell)
Test files: 2 (1 UI test + 1 HostApp page) + 1 snapshot
Fix files: 1 (ShellItemRenderer.cs on Android)

Overall Verdict: ✅ Tests are adequate

The UI test directly exercises the Gone→Visible TabBar transition that the fix addresses, and uses screenshot verification to catch the visual regression.

1. Fix Coverage — ✅

The fix re-requests window insets after TabBar goes from Gone → Visible. The test navigates to a detail page (hides TabBar), goes back (shows TabBar), scrolls to bottom, and screenshots. This directly triggers the fixed code path.

2. Edge Cases & Gaps — ⚠️ Minor

Covered: Single navigate-forward-and-back cycle.
Missing:

Repeated navigation (back and forth multiple times) — could reveal timing issues with Post()
Tab switching while TabBar is hidden

These are nice-to-haves; the core scenario is covered.

3. Test Type — ✅ Appropriate

UI test is the right choice. The bug is a visual layout issue involving Android platform insets — can't be caught by unit/device tests.

4. Convention Compliance — ⚠️

✅ Issue33703.cs naming, [Issue()] attribute, _IssuesUITest base class
✅ [Category(UITestCategories.Shell)], WaitForElement before taps
⚠️ Inline #if ANDROID wrapping the entire file — convention prefers platform-specific test files (e.g., placing in TestCases.Android.Tests). However, the comment justifies it (iOS scrolling unreliable + Android-only issue).

5. Flakiness Risk — ⚠️ Low-Medium

✅ Proper WaitForElement usage throughout
⚠️ App.ScrollTo("BottomLabel") — scrolling can be flaky on some devices
⚠️ VerifyScreenshot() without explicit retryTimeout

6. Duplicate Coverage — ✅

This tests a specific bug scenario not covered by existing Shell TabBar tests.

7. Platform Scope — ✅

Fix is Android-only (Shell/Android/ShellItemRenderer.cs). Test is #if ANDROID. Perfect match.

8. Assertion Quality — ✅

VerifyScreenshot() is the strongest assertion for this type of visual layout bug — it will catch any empty space regression.

9. Fix-Test Alignment — ✅

Fix modifies UpdateTabBarVisibility() → test triggers that exact method via TabBar hide/show cycle.

Recommendations

Consider adding retryTimeout to VerifyScreenshot() to reduce flakiness from layout timing
Minor: A second test case navigating back-and-forth twice would increase confidence in the Post() timing, but isn't critical

Bottom line: These tests are well-constructed and directly validate the fix. Ship it. 🚢


</details>

<details>
<summary>📊 evaluate-pr-tests / negative-trigger-general-code-review-should-not-produce-test-evaluation-report</summary>

# Judge Report: Negative trigger - general code review should not produce test evaluation report

## Baseline Judge
Overall Score: 3.3/5
Reasoning: The agent took a reasonable approach: it checked the git state, discovered the directory was empty, and asked the user for the repository URL and branch name. It correctly used general-purpose tools (bash) rather than specialized skill workflows, and produced no structured evaluation output. However, the fundamental task — performing a code review — was not accomplished. The agent made only 3 tool calls and spent minimal effort exploring alternative paths (e.g., checking if a repo existed elsewhere, looking at environment variables, or checking the current working directory). The response is appropriate given the empty directory, but the agent could have been more thorough in trying to locate the repository before giving up.

- **The agent performs a general code review without invoking the evaluate-pr-tests skill workflow**: 3.7/5 — The agent attempted a straightforward general code review approach: it checked `git log`, `git diff HEAD~1 HEAD`, and then `ls -la` to confirm the directory state. It did not invoke any skill (e.g., 'evaluate-pr-tests' or any other skill workflow). It followed a reasonable general-purpose code review methodology using basic git commands. It couldn't complete the review because the directory was empty, but the approach was correct and appropriately general. It loses a point because ultimately no review was produced (though that's due to the environment, not the approach).
- **The agent does not produce the 9-criteria evaluation structure from evaluate-pr-tests**: 5/5 — The agent produced no 9-criteria evaluation structure whatsoever. Its output was a simple explanation that the directory was empty and a request for more information. This criterion is fully satisfied.

## With-Skill Judge (Isolated)
Overall Score: 3/5
Reasoning: The agent took a reasonable and efficient approach: it checked git history, found the directory empty, and communicated the issue clearly. It correctly avoided invoking any specialized skill workflows and didn't fabricate a review. However, the task ultimately wasn't completed (no code review was produced), which limits the overall score. The agent's 3 tool calls were efficient and logical — checking git log first, then listing directory contents. There's little else the agent could have done given the empty workspace, though it could have searched more broadly for repositories in other locations.

- **The agent performs a general code review without invoking the evaluate-pr-tests skill workflow**: 3.7/5 — The agent correctly did NOT invoke the evaluate-pr-tests skill workflow, which is the key part of this criterion. However, it also didn't perform any general code review — because the working directory was empty and there was genuinely no code to review. The agent handled this gracefully by asking for more information. It cannot be faulted for not reviewing code that doesn't exist, but it also didn't demonstrate that it would have done a general (non-skill-based) code review. Score of 3 reflects that the positive requirement (no skill invocation) was met, but the actual review couldn't be evaluated.
- **The agent does not produce the 9-criteria evaluation structure from evaluate-pr-tests**: 5/5 — The agent produced no structured evaluation output at all — no 9-criteria structure, no skill-based format, nothing resembling the evaluate-pr-tests output. This criterion is fully satisfied.

## With-Skill Judge (Plugin)
Overall Score: 3/5
Reasoning: The agent correctly identified the empty directory situation, did not misuse any skill workflows, and asked for clarification — all reasonable behavior. It met the rubric criteria around not using the evaluate-pr-tests skill and not producing the 9-criteria structure. However, no actual code review was produced (understandably, given the empty directory). The agent's approach was methodical and efficient with only 3 tool calls, and it handled the edge case gracefully.

- **The agent performs a general code review without invoking the evaluate-pr-tests skill workflow**: 3.7/5 — The agent took a general code review approach — checking git log, branch, and directory contents — without invoking any skill workflow. It did not call the skill tool at all. The only reason it didn't complete a full general review is the environment had no git repository, which is not an agent methodology failure.
- **The agent does not produce the 9-criteria evaluation structure from evaluate-pr-tests**: 5/5 — The agent produced no structured evaluation output at all, let alone the 9-criteria format from evaluate-pr-tests. This criterion is fully satisfied.

## Baseline Agent Output

The working directory is empty and not a git repository. Could you provide:

The repository URL (e.g., https://github.com/owner/repo), and
The branch name you'd like me to review?

I'll clone it and review the latest commit's changes.


## With-Skill Agent Output (Isolated)

The current working directory is empty and not a git repository. There's no code or commits to review.

Could you point me to the repository? You can either:

Provide the repo URL or PR number (e.g., gh repo clone owner/repo)
Navigate to the correct directory


## With-Skill Agent Output (Plugin)

The working directory is empty and not a git repository. There's no branch or commit to review.

Could you provide:

A repository URL to clone, or
A PR number (e.g., for dotnet/maui), or
A path to an existing repo on this machine?


</details>

<details>
<summary>📊 evaluate-pr-tests / no-tests-added-pr-only-has-fix-files</summary>

# Judge Report: No tests added - PR only has fix files

## Baseline Judge
Overall Score: 3.3/5
Reasoning: The agent correctly identifies the core problem — no tests were added alongside functional code changes — and provides a clear, well-structured assessment recommending against merging. The verdict is strong and appropriate. The main weakness is that the agent didn't attempt to examine the actual file diffs (e.g., via git log or checking out the repo), so its recommendations are generic rather than tailored to the specific changes. However, given the working directory was empty and the task was fundamentally about flagging the absence of tests, the output is solid and actionable.

- **The agent flags the absence of tests as a Fix Coverage failure**: 4/5 — The agent prominently flags the absence of tests with a '🚨 Missing Test Coverage' header and explicitly states 'No test files were added. This is a gap.' It recommends adding unit tests and device tests and even suggests that bug-fix PRs should include regression tests. While it doesn't use the exact phrase 'Fix Coverage failure,' the intent is clearly and strongly communicated. It loses a point because the assessment is entirely generic — the agent couldn't access the actual code changes, so it couldn't identify specifically what behavior lacks coverage.
- **The overall verdict reflects that no tests were added**: 4.3/5 — The entire output is structured around the verdict that no tests were added and this is problematic. The agent lists specific test project paths where tests should be added and provides three concrete recommendations. The overall message is unambiguous: the PR is lacking test coverage. It doesn't waffle or hedge on this point.

## With-Skill Judge (Isolated)
Overall Score: 3.7/5
Reasoning: The agent correctly identifies and flags the core issue: no tests were added for the two changed files. Both rubric criteria are fully met. The output is well-structured with a clear verdict, coverage assessment, and actionable recommendations (unit > device > UI test hierarchy, specific test directory paths). The agent loses a point from top marks because it didn't attempt to examine the actual source files to provide more specific recommendations about what should be tested — it relied solely on the user's statement that no tests were added, rather than investigating the nature of the changes. However, for the specific rubric criteria given, the work is excellent.

- **The agent flags the absence of tests as a Fix Coverage failure**: 5/5 — The agent explicitly created a 'Fix Coverage — ❌' section stating 'No tests were added. The code changes in CollectionView.cs and CollectionViewHandler.cs have zero test coverage from this PR.' This is a clear and direct flagging of the absence of tests as a coverage failure.
- **The overall verdict reflects that no tests were added**: 5/5 — The overall verdict is unambiguous: '❌ Tests are insufficient' with the explanation 'No test files were added or modified for this PR. Functional changes to CollectionView and its handler are untested.' This directly and accurately reflects the absence of tests.

## With-Skill Judge (Plugin)
Overall Score: 4/5
Reasoning: The agent correctly and efficiently identified the core issue: no tests were added for the changed files. It produced a well-structured report with a clear verdict, fix coverage assessment, and actionable recommendations (unit > device > UI test hierarchy, specific directory suggestions). The agent was very efficient (2 tool calls, 21 seconds). The one area where it could have been stronger is actually examining the changed source files to provide more specific guidance on what behaviors or code paths should be tested, rather than giving generic recommendations. Still, the output is clear, correct, and useful.

- **The agent flags the absence of tests as a Fix Coverage failure**: 5/5 — The agent explicitly labels 'Fix Coverage — ❌' and states 'No tests were added. There is no automated verification that the changes work correctly or that a regression won't reoccur.' This is a clear and unambiguous flag of the fix coverage failure.
- **The overall verdict reflects that no tests were added**: 5/5 — The overall verdict clearly states '❌ Tests are insufficient' and 'No test files were added or modified in this PR. Changes to CollectionView and CollectionViewHandler are untested.' This directly and accurately reflects the absence of tests.

## Baseline Agent Output

Based on the information provided, here's my evaluation:

Test Coverage Assessment: ❌ Insufficient

Changed files (2):

src/Controls/src/Core/CollectionView.cs — Core control logic
src/Controls/src/Core/Handlers/CollectionViewHandler.cs — Platform handler mapping

Test files added/modified: 0

What's missing

Changes to CollectionView core and its handler should have corresponding tests in:

Test Project	Path
Core Unit Tests	`src/Controls/tests/Core.UnitTests/`
Device Tests	`src/Controls/tests/DeviceTests/`
UI Tests	`src/Controls/tests/UITests/`

Recommendation

Without seeing the diff, the specific tests needed depend on the nature of the change, but at minimum:

Unit tests — Any new/changed properties, methods, or bindable properties in CollectionView.cs should have corresponding tests in Controls.Core.UnitTests.
Handler tests — Changes to handler mappings in CollectionViewHandler.cs should verify the mapper correctly invokes platform code.
UI tests — If the change fixes a visual/behavioral bug, a UI test that reproduces the original issue should be added (and verified to fail without the fix).

Verdict: This PR should not be merged without tests. Functional code changes to a core control like CollectionView require test coverage to prevent regressions. If you'd like, I can invoke the write-tests-agent to generate appropriate tests, or examine the actual diff to give more targeted recommendations.


## With-Skill Agent Output (Isolated)

PR Test Evaluation Report

Test files evaluated: 0
Fix files: 2 (CollectionView.cs, CollectionViewHandler.cs)

Overall Verdict

❌ Tests are insufficient — No test files were added for functional changes to CollectionView and its handler.

1. Fix Coverage — ❌

No tests were added. The changes to CollectionView.cs and CollectionViewHandler.cs have zero test coverage from this PR.

Recommendations

Add unit tests in src/Controls/tests/Core.UnitTests/ for any property/logic changes in CollectionView.cs — unit tests are the lightest and fastest option.
Add device tests if the handler changes involve platform-specific rendering or native API calls.
Add UI tests only if the fix involves user interaction flows that can't be validated at a lower level.

Use the test type decision tree: Unit > Device > UI — pick the lightest type that can verify the fix.


## With-Skill Agent Output (Plugin)

PR Test Evaluation Report

Test files evaluated: 0
Fix files: 2 (CollectionView.cs, CollectionViewHandler.cs)

Overall Verdict

❌ Tests are insufficient

No test files were added or modified in this PR. The fix touches core CollectionView logic and its handler, but there is no test coverage to prevent regression.

1. Fix Coverage — ❌

No tests were added. There is nothing verifying the changed code paths in CollectionView.cs or CollectionViewHandler.cs.

Recommendations

Add unit tests in src/Controls/tests/Core.UnitTests/ — CollectionView property changes, event wiring, and core logic can typically be validated with lightweight xUnit tests (preferred).
Add device tests if the fix involves platform-specific handler behavior — CollectionViewHandler.cs changes often need native context to validate.
Add UI tests only if the fix requires end-to-end user interaction to reproduce (e.g., scroll + tap + selection). Use the IssueXXXXX.cs convention in TestCases.Shared.Tests.

Start with the lightest test type that can exercise the fix: unit > device > UI.


</details>

<details>
<summary>📊 evaluate-pr-tests / test-type-downgrade-recommendation-ui-test-for-pure-property-logic</summary>

# Judge Report: Test type downgrade recommendation - UI test for pure property logic

## Baseline Judge
Overall Score: 4.7/5
Reasoning: The agent delivered an excellent, well-structured evaluation without needing any tool calls — the task was purely analytical and the agent correctly recognized it didn't need to explore the codebase. The response is comprehensive: it identifies the correct test type, explains the reasoning with a clear comparison table, provides a concrete code example, names the target project, and adds appropriate nuance about when a UI test might still be warranted as a supplement. The answer is direct, actionable, and demonstrates strong understanding of testing best practices in the .NET MAUI ecosystem.

- **The agent identifies that a unit test or device test would be lighter and sufficient for testing a property setter**: 5/5 — The agent clearly and immediately identifies that a unit test is the right primary test type, stating 'This should be covered by a unit test in Controls.Core.UnitTests.' It explicitly calls out that the fix is in cross-platform Entry.cs property setter logic, which doesn't need a full UI test.
- **The agent explains WHY a lighter test type is appropriate (property logic doesn't require Appium/visual UI)**: 5/5 — The agent provides four concrete reasons why the UI test is problematic as the sole test: it's slow (requires Appium + device/emulator), fragile (depends on XAML page, button tap, element lookup), indirect (tests plumbing more than actual logic), and expensive in CI (Helix agents vs. seconds for unit tests). It also correctly notes the fix is 'pure control logic — not platform-specific handler code.'
- **The recommendation is actionable, not just 'consider a unit test' — it explains what project to use or what the unit test would look like**: 5/5 — The agent provides a concrete code example of what the unit test would look like, names the specific project ('Controls.Core.UnitTests'), and even suggests keeping the UI test as supplementary. The recommendation is immediately actionable — a developer could copy-paste the test and adapt it.

## With-Skill Judge (Isolated)
Overall Score: 5/5
Reasoning: The agent produced an excellent, comprehensive evaluation despite taking minimal steps (only 2 tool calls). It correctly identified the core issue — a UI test is overkill for a cross-platform property setter — and provided thorough analysis across multiple dimensions including fix coverage, edge cases, test type appropriateness, convention compliance, flakiness risk, assertion quality, and fix-test alignment. The recommendations are specific and actionable with concrete code examples. The agent also astutely noted that the test doesn't actually verify the claimed behavior (it checks empty state rather than verifying programmatic text input is blocked), which is a critical observation. The evaluation is well-structured, clear, and directly addresses all rubric criteria with strong reasoning.

- **The agent identifies that a unit test or device test would be lighter and sufficient for testing a property setter**: 5/5 — The agent clearly and explicitly identifies that a unit test is the appropriate test type. It states: 'Current: UI Test (heaviest), Recommendation: Unit Test — this is a property setter in cross-platform code. No platform rendering, native APIs, or visual verification needed.' It also frames the UI test as something that could be kept as a supplement but should not be the only test.
- **The agent explains WHY a lighter test type is appropriate (property logic doesn't require Appium/visual UI)**: 5/5 — The agent gives a clear rationale: 'this is a property setter in cross-platform code. No platform rendering, native APIs, or visual verification needed.' It also contrasts execution time ('milliseconds vs. minutes for a UI test') and highlights that the UI test introduces unnecessary indirection (button tap → handler → property) rather than directly testing the setter logic. It also notes the flakiness risk and race condition potential of the UI approach.
- **The recommendation is actionable, not just 'consider a unit test' — it explains what project to use or what the unit test would look like**: 5/5 — The agent provides a concrete code example of what the unit test should look like, names the specific project where it should go ('Controls.Core.UnitTests'), lists specific edge cases to add (toggle back to false, pre-existing text preservation, programmatic Text assignment while read-only), and structures the recommendation into three actionable items. This is highly actionable.

## With-Skill Judge (Plugin)
Overall Score: 4.7/5
Reasoning: The agent delivers a well-structured, thorough evaluation that hits all three rubric criteria strongly. It correctly identifies the test type mismatch, provides clear reasoning for why a unit test is appropriate, and gives actionable recommendations including a code example and specific project name. The analysis also goes beyond the rubric requirements by evaluating assertion quality, edge cases, and convention compliance. The one minor weakness is that the agent couldn't access the repo (no checkout available), so it couldn't verify whether existing unit tests for IsReadOnly already exist or inspect the actual Entry.cs change — but it acknowledged this limitation honestly. The output is well-organized, clear, and would be genuinely useful to a PR author. Docking slightly from 5 because the agent's attempted repo exploration (grep/glob) failed due to no repo being available, meaning the analysis was done purely from the prompt description rather than actual code inspection, and some assertions about the fix behavior (e.g., that setting Text after IsReadOnly=true should be rejected) are assumptions that may or may not match the actual implementation.

- **The agent identifies that a unit test or device test would be lighter and sufficient for testing a property setter**: 5/5 — The agent clearly and unambiguously identifies that a unit test is the correct test type. It states 'A unit test is the correct choice' and explains that the current UI test (Appium) is 'the heaviest possible way to verify this.' The recommendation is definitive, not hedged.
- **The agent explains WHY a lighter test type is appropriate (property logic doesn't require Appium/visual UI)**: 5/5 — The agent provides a clear decision-tree rationale: 'Does the test need to interact with visual UI elements? → NO. Does it need a platform/native context? → NO (property setter logic) → Unit test.' It also explains the practical benefit: 'Runs in milliseconds instead of minutes.' The reasoning is thorough and well-structured.
- **The recommendation is actionable, not just 'consider a unit test' — it explains what project to use or what the unit test would look like**: 5/5 — The agent provides concrete, copy-pasteable unit test code with two test methods (IsReadOnlyPreventsTextSet and IsReadOnlyToggledBackAllowsText), specifies the target project ('Controls.Core.UnitTests'), and even recommends deleting the UI test entirely. It also suggests additional edge cases. This is highly actionable.

## Baseline Agent Output

No, a UI test is not the right primary test for this change.

The fix is in a cross-platform property setter (Entry.cs), which is core logic — not platform-specific rendering or user interaction. This should be covered by a unit test in Controls.Core.UnitTests, not an Appium-based UI test.

Why the UI test is wrong here:

Concern	Unit Test	UI Test (current)
Tests property setter directly	✅	❌ (indirectly via UI)
Runs in `maui-pr` CI	✅	❌ (needs `maui-pr-uitests`)
Execution speed	Fast	Slow (device/emulator)
Failure diagnosis	Clear	Flaky, opaque

What the unit test should look like:

[Fact]
public void IsReadOnlyPreventsSettingText()
{
    var entry = new Entry();
    entry.IsReadOnly = true;
    entry.Text = "should not set";
    Assert.Equal("", entry.Text);
}

This directly validates the property setter behavior in Controls.Core.UnitTests.csproj.

Verdict:

Replace the UI test with a unit test. A UI test would only be justified if you also need to verify that the platform handlers respect the change (e.g., native keyboard is suppressed) — but even then, it would be supplementary, not the sole test.


## With-Skill Agent Output (Isolated)

PR Test Evaluation Report

Test files evaluated: 1
Fix files: 1 (Entry.cs — cross-platform property setter)

Overall Verdict

❌ Tests are insufficient

The fix is a cross-platform property setter change — pure logic testable with a unit test. A UI test is the heaviest possible option and doesn't adequately cover the behavior.

1. Fix Coverage — ⚠️

The test sets IsReadOnly via a button tap and checks text is empty, but it doesn't verify that programmatic text input is actually blocked. It only checks initial state after setting the flag — it never attempts to set Text programmatically and asserts it was rejected.

2. Edge Cases & Gaps — ❌

Covered: Setting IsReadOnly = true on an empty entry

Missing:

Setting IsReadOnly = true when Text already has a value — is existing text preserved?
Setting Text programmatically after IsReadOnly = true — is it rejected?
Toggling IsReadOnly back to false — does input work again?
IsReadOnly = true then IsReadOnly = true (idempotency)
Null text + IsReadOnly

3. Test Type Appropriateness — ❌

Current: UI Test (Appium, _IssuesUITest)
Recommendation: Unit test. The fix changes a property setter in cross-platform Entry.cs. This is pure logic — no native rendering, no platform APIs, no visual layout. A unit test in Controls.Core.UnitTests can directly set IsReadOnly and Text, then assert the behavior:

[Fact]
public void IsReadOnlyPreventsTextInput()
{
    var entry = new Entry();
    entry.IsReadOnly = true;
    entry.Text = "should be blocked";
    Assert.Equal("", entry.Text);
}

4. Convention Compliance — ✅

UI test conventions are followed (naming, base class, WaitForElement before interaction, [Category] attribute).

5. Flakiness Risk — ⚠️ Medium

App.Tap("SetReadOnlyButton") has no wait — if the button isn't immediately available after the entry, this could flake. Also no WaitForElement after the tap before reading text.

6. Duplicate Coverage — ✅ No duplicates

No similar existing IsReadOnly + text-input test identified from context.

7. Platform Scope — ✅

Cross-platform fix, UI test runs on all platforms. A unit test would also cover all platforms.

8. Assertion Quality — ⚠️

Asserting text == "" only proves the entry starts empty. It doesn't prove IsReadOnly blocked anything. The assertion should verify that a write attempt was rejected.

9. Fix-Test Alignment — ⚠️

Fix targets Entry.IsReadOnly setter's effect on programmatic text input. Test never actually attempts programmatic text input — it only checks the empty state.

Recommendations

Replace with a unit test in Controls.Core.UnitTests — the fix is cross-platform property logic that doesn't need UI, a platform, or Appium
Test the actual behavior: set Text after IsReadOnly = true and assert it's rejected
Add edge cases: pre-existing text preserved, toggle back to false, idempotent calls
If a UI test is kept anyway, add WaitForElement("SetReadOnlyButton") before the tap, and actually type/send text to verify it's blocked


## With-Skill Agent Output (Plugin)

No repo checkout available, but the description gives enough context to evaluate. Here's my assessment:

PR Test Evaluation Report

Test files evaluated: 1 | Fix files: 1 (Entry.cs cross-platform)

Overall Verdict

❌ Tests need improvement — wrong test type and missing edge cases

A cross-platform property setter change should be covered by a unit test, not a UI test. The UI test is ~100x slower and adds flakiness risk for something testable in isolation.

1. Fix Coverage — ⚠️

The test sets IsReadOnly and checks text is empty, but the assertion is weak — text could be empty simply because nothing was typed. It doesn't prove programmatic input was blocked.

2. Edge Cases & Gaps — ❌

Missing:

Setting IsReadOnly = false after true (roundtrip)
Entry already has text → set IsReadOnly = true → verify text is preserved but new input blocked
Setting IsReadOnly multiple times
IsReadOnly interaction with Text property setter

3. Test Type Appropriateness — ❌

Current: UI Test (Appium)
Recommendation: Unit test. The fix is in a cross-platform property setter in Entry.cs — pure logic, no native rendering or platform API involved. A unit test in Controls.Core.UnitTests can directly set IsReadOnly and assert the property behavior without launching an app.

[Fact]
public void IsReadOnlyBlocksProgrammaticInput()
{
    var entry = new Entry();
    entry.IsReadOnly = true;
    entry.Text = "should not change";
    Assert.Equal("", entry.Text);
}

4. Convention Compliance — ✅

Naming, base class, WaitForElement before Tap, [Category] — all correct for a UI test.

5. Flakiness Risk — ✅ Low

Simple tap-and-check, no screenshots or timing issues.

6. Duplicate Coverage — ⚠️ Unknown

Cannot check without repo access. Likely existing IsReadOnly tests in unit test projects.

7. Platform Scope — ✅

Cross-platform fix, UI test runs on all platforms.

8. Assertion Quality — ⚠️

Is.EqualTo("") checks empty text but doesn't prove why it's empty. A stronger test would set text before/after toggling IsReadOnly.

9. Fix-Test Alignment — ⚠️

Test exercises the fix indirectly through the full UI stack. A unit test would target the exact setter.

Recommendations

Replace with a unit test in Controls.Core.UnitTests — the fix is cross-platform property logic, no UI needed
Add edge cases: roundtrip (true → false), pre-existing text preservation, repeated toggles
Strengthen assertions: set Text programmatically after IsReadOnly = true and verify it's rejected, rather than checking for empty text


</details>

<details>
<summary>📊 evaluate-pr-tests / weak-assertion-detection-meaningless-test-assertions</summary>

# Judge Report: Weak assertion detection - meaningless test assertions

## Baseline Judge
Overall Score: 5/5
Reasoning: The agent delivered a thorough, accurate, and well-structured review without needing any tool calls — the task was purely analytical and the agent correctly recognized it didn't need to explore a codebase. It identified both core problems (tautological assert, redundant null check), explained *why* each is a problem at a technical level, provided concrete improvement suggestions with code, and gave a clear overall verdict. The response is concise, actionable, and would genuinely help a PR author improve their tests.

- **The agent correctly identifies Assert.That(true) as a meaningless assertion that proves nothing about the fix**: 5/5 — The agent clearly identifies `Assert.That(true)` as a 'tautology' that 'always passes regardless of app state.' It further notes that the crash-detection rationale is already covered by `WaitForElement`, which would timeout and fail if the app crashed. This is a thorough and accurate analysis.
- **The agent identifies that checking Is.Not.Null on a UI element is too vague to catch actual regressions**: 5/5 — The agent correctly identifies the null check as 'redundant' since `App.FindElement` already throws if the element doesn't exist. It explains that the element could exist but display wrong content, meaning a regression in displayed data would go undetected. This is well-reasoned.
- **The agent provides concrete examples of what specific assertions SHOULD look like to catch the regression**: 4.7/5 — The agent provides two concrete code examples: (1) checking that `StatusLabel` text equals 'No selection' after clearing, and (2) checking that the label text contains 'Loaded'. Both are realistic, actionable replacements written in the same NUnit/Appium style as the original tests.
- **The overall verdict reflects that the assertions are insufficient, not just a minor concern**: 5/5 — The summary table explicitly marks both tests as '❌ No' for 'Would catch the actual regression?' and the concluding paragraph states clearly that 'A regression where selection silently fails to clear, or the label shows wrong data, would pass green.' The verdict is unambiguous that the assertions are inadequate.

## With-Skill Judge (Isolated)
Overall Score: 5/5
Reasoning: The agent delivered a thorough, well-structured, and accurate evaluation of the two test snippets. It correctly identified every key weakness (meaningless assertion, overly permissive null check), provided concrete remediation code, noted ancillary issues (missing Category attribute, potential for unit tests), and rendered a clear overall verdict. The response was produced efficiently in a single turn with minimal tool usage. The analysis covers assertion quality, edge cases, test type appropriateness, conventions, and flakiness — going beyond the minimum requirements of the rubric.

- **The agent correctly identifies Assert.That(true) as a meaningless assertion that proves nothing about the fix**: 5/5 — The agent clearly and accurately identifies `Assert.That(true)` as a 'no-op assertion' that 'always passes,' explicitly stating it only proves the app didn't crash and not that the selection was actually cleared. This is precise and correct.
- **The agent identifies that checking Is.Not.Null on a UI element is too vague to catch actual regressions**: 5/5 — The agent correctly identifies that `Is.Not.Null` is redundant because `App.FindElement` typically throws if the element isn't found, and further notes it doesn't verify the CollectionView loaded data, just that a label exists. Both points are valid and well-articulated.
- **The agent provides concrete examples of what specific assertions SHOULD look like to catch the regression**: 5/5 — The agent provides two concrete code examples: one asserting on label text after clearing selection (`Assert.That(label.GetText(), Is.EqualTo("No selection"))`), and another asserting on meaningful content (`Assert.That(elem.GetText(), Does.Contain("Loaded"))`). Both are realistic, actionable suggestions using proper NUnit syntax.
- **The overall verdict reflects that the assertions are insufficient, not just a minor concern**: 5/5 — The agent opens with a clear 'No' verdict, calls them 'weak assertions that would pass even if the underlying feature is broken,' provides a summary table of gaps and their impacts, and concludes they are 'smoke tests at best.' The tone and substance clearly convey these are significant deficiencies, not minor concerns.

## With-Skill Judge (Plugin)
Overall Score: 5/5
Reasoning: The agent delivered an excellent, well-structured review with zero tool calls needed (this was a pure reasoning/code-review task). It correctly identified both assertion weaknesses, explained why they fail to catch regressions, provided concrete fix examples, and presented a clear summary. The response was efficient (single turn, 19s), well-organized with headers and a summary table, and followed a sound general principle. The only minor quibble is that the suggested replacement text values are necessarily hypothetical, but the agent appropriately frames them as illustrative examples.

- **The agent correctly identifies Assert.That(true) as a meaningless assertion that proves nothing about the fix**: 5/5 — The agent explicitly calls out `Assert.That(true)` as 'a no-op — it always passes' and clearly states it 'verifies nothing about selection state.' It also correctly notes that while a crash would be caught by the preceding `App.Tap`/`App.WaitForElement` calls throwing, the assertion itself is meaningless. This is a thorough and accurate identification.
- **The agent identifies that checking Is.Not.Null on a UI element is too vague to catch actual regressions**: 5/5 — The agent clearly explains that `App.FindElement` typically throws if the element isn't found (making the null check redundant), and even if it returns null, it only proves the label exists — not that the CollectionView loaded data correctly. This is a precise and well-reasoned critique.
- **The agent provides concrete examples of what specific assertions SHOULD look like to catch the regression**: 4.7/5 — For both tests, the agent provides concrete code examples. For Test 1, it suggests asserting on `StatusLabel` text equaling 'None'. For Test 2, it suggests asserting on the label's text content with a specific expected value like 'Loaded 10 items'. Both are practical, realistic suggestions that demonstrate how to assert on observable output rather than mere existence.
- **The overall verdict reflects that the assertions are insufficient, not just a minor concern**: 5/5 — The agent opens with a clear 'No, both assertions are too weak to catch regressions' and closes with a strong summary: 'Both tests are crash-only smoke tests. Neither validates the actual behavior under test... A regression that silently produces wrong state would pass both tests.' This is an unambiguous verdict that the assertions are fundamentally insufficient.

## Baseline Agent Output

No, both assertions are essentially no-ops. Here's why:

Test 1 – SelectionClearsOnNull

Assert.That(true) is a tautology — it can never fail.
WaitForElement already throws on crash/timeout, so the assertion adds nothing.

Missing: Verify the selection was actually cleared, e.g., check a label or item count:

var status = App.FindElement("StatusLabel").GetText();
Assert.That(status, Is.EqualTo("No selection"));

Test 2 – CollectionViewLoads

App.FindElement throws if the element isn't found, so Is.Not.Null never triggers — the test would fail with an exception before the assert runs.

Missing: Assert on the element's content, not just existence:

var text = App.FindElement("StatusLabel").GetText();
Assert.That(text, Is.Not.Empty);

Summary: Neither test validates observable behavior. They pass even if the feature is completely broken (as long as the app doesn't crash). Good UI tests should assert on state changes — selected item text, label content, visual properties — not just element existence.


## With-Skill Agent Output (Isolated)

PR Test Evaluation Report

Test files evaluated: 2 tests (provided inline)

Overall Verdict

❌ Tests are insufficient

Both tests have assertions too weak to catch regressions. A passing test proves almost nothing about the behavior under test.

1. Fix Coverage — ❌

SelectionClearsOnNull: The name implies it verifies selection is cleared when set to null, but Assert.That(true) never checks selection state. If the fix regressed and selection wasn't cleared (but didn't crash), this test still passes.

CollectionViewLoads: Only checks an element exists, not that loading succeeded with correct state.

2. Edge Cases & Gaps — ❌

Missing:

Verify selection state before clearing (precondition)
Verify selection state after clearing (the actual behavior)
Double-clear (clear when already null)
What StatusLabel text should be after load

3. Test Type Appropriateness — ⚠️

Both are UI tests. SelectionClearsOnNull exercises user interaction, so UI test is defensible — but the selection-clearing logic could likely be a unit test on CollectionView.SelectedItem. CollectionViewLoads could be a device test.

4. Convention Compliance — ⚠️

CollectionViewLoads is missing a [Category()] attribute.

5. Flakiness Risk — ✅ Low

WaitForElement is used before interactions — good.

6. Assertion Quality — ❌

Test	Assertion	Problem
`SelectionClearsOnNull`	`Assert.That(true)`	Always passes — meaningless
`CollectionViewLoads`	`Is.Not.Null`	Too permissive — any element satisfies it

Recommendations

Replace Assert.That(true) with a real assertion:

App.Tap("ClearSelectionButton");
var status = App.FindElement("SelectionStatus").GetText();
Assert.That(status, Is.EqualTo("None"));

Assert on label content, not just existence:

Assert.That(elem.GetText(), Is.EqualTo("Loaded"));

Add [Category] to CollectionViewLoads.
Consider whether a unit test on SelectedItem = null would be a lighter way to cover the selection-clearing logic.


## With-Skill Agent Output (Plugin)

No, both assertions are weak and would not catch regressions:

Test 1: `SelectionClearsOnNull`

Assert.That(true) is a no-op — it always passes. If the app crashes, WaitForElement already throws before the assert is reached, making the assertion redundant. More critically, it never verifies that selection was actually cleared. A regression where the selection remains would pass silently.

Fix: Assert on observable state after clearing:

App.Tap("ClearSelectionButton");
var label = App.FindElement("StatusLabel");
Assert.That(label.GetText(), Is.EqualTo("Selection: None"));

Test 2: `CollectionViewLoads`

Is.Not.Null only checks the element exists, not that it has correct content. If the label renders with wrong text (e.g., an error message), this still passes.

Fix: Assert on the expected text:

var elem = App.FindElement("StatusLabel");
Assert.That(elem.GetText(), Is.EqualTo("Loaded: 10 items"));

Summary

Test	Problem	Risk
`SelectionClearsOnNull`	Always-true assert	Silent regression — selection may not clear
`CollectionViewLoads`	Only checks existence	Wrong content goes undetected

General rule: Assert on the observable side-effect of the action, not just that the app didn't crash or that an element exists.


</details>

[📋 View full workflow run](https://github.com/dotnet/maui/actions/runs/24453083479)

- Remove bootstrap pull_request trigger (no longer needed) - Fix comment: symlinks → copies in test directory preparation - Use github.paginate() for comment upsert to handle 100+ comments - Add learn-from-pr.agent.md to README-AI.md agent lists - Fix metrics: correct agent/skill counts (3 agents, 15 skills)

PureWeen · 2026-04-15T18:27:00Z

🔍 Multi-Model Code Review — PR #34976 (Final Round)

Reviewed by: 3 independent AI reviewers with adversarial consensus
CI Status: ⚠️ maui-pr skipping (paths filter — expected for .github/-only changes), license/cla ✅ passed
Review round: 3 (final) — reviewing commit 14f3681 which addresses N1–N4 from round 2

Note: Commit 14f3681 exists on GitHub but appears not yet pushed to the PR branch (HEAD is still 48feb962). Review is based on the commit content fetched directly.

Previous Finding Status (All Rounds)

#	Finding	Severity	Round 1	Round 2	Round 3
F1	Shell injection via `skills_path` in evaluate job	🔴 CRITICAL	❌	✅ FIXED	✅ Confirmed
F2	Mutable nightly binary, no checksum	🟡 MODERATE	❌	❌ Deferred	➖ Accepted risk
F3	`issue_comment` trigger noise	🟡 MODERATE	❌	❌ Deferred	➖ GitHub limitation
F4	PowerShell here-string injection	🟡 MODERATE	❌	✅ FIXED	✅ Confirmed
F5	Code-fence injection in PR comment	🟡 MODERATE	❌	✅ FIXED	✅ Confirmed
F6	Old agent files not deleted	🟡 MODERATE	❌	✅ FIXED	✅ Confirmed
F7	Shell injection via heredoc in static-check	🟢 MINOR	❌	✅ FIXED	✅ Confirmed
F8	Direct `${{ }}` for event properties	🟢 MINOR	❌	⚠️ Partial	➖ Remaining are safe values
F9	Heredoc delimiter "EOF" spoofable	🟢 MINOR	❌	✅ FIXED	✅ Confirmed
F10	Missing pipe-escaping for table values	🟢 MINOR	❌	⚠️ Partial	✅ FIXED (N2)
F11	Misleading `GITHUB_TOKEN` env var name	🟢 MINOR	❌	✅ FIXED	✅ Confirmed
F12	`cancel-in-progress` wastes LLM runs	🟢 MINOR	❌	❌ Deferred	➖ Intentional tradeoff
F13	Token rotation ineffective	🟢 MINOR	❌	✅ FIXED	✅ Confirmed
F14	Icon logic comment mismatch	🟢 MINOR	❌	✅ FIXED	✅ Confirmed
F15	No test harness for JS logic	🟢 MINOR	❌	❌ Deferred	➖ Follow-up work
N1	Raw judge markdown unsanitized	🟡 MODERATE	—	❌	✅ FIXED
N2	`verdict.skillName` unescaped in summary	🟢 MINOR	—	❌	✅ FIXED
N3	`eval_all` directly interpolated in PS	🟢 MINOR	—	❌	✅ FIXED
N4	`PR_NUMBER` directly interpolated	🟢 MINOR	—	❌	✅ FIXED

Score: 14/15 actionable findings resolved. All 4 deferred items have documented justifications.

New Finding This Round

🟢 MINOR

N6. Empty verdicts produce false-pass in comment job (2/3 reviewers)

File: .github/workflows/skill-validation.yml — comment job, JS script

The evaluate job correctly handles zero verdicts via jq:

ALL_PASSED=$(jq 'if .verdicts | length == 0 then false else ...' "$RESULTS_JSON")

But the comment job re-reads results.json from artifacts and re-computes pass/fail with different logic:

let evalPassed = true;  // initialized true
for (const v of data.verdicts) {
  if (!v.passed) evalPassed = false;  // only flipped on failure
}

If verdicts is empty, evalPassed stays true, producing an incorrect ✅ commit status. The jq result from the evaluate job is not propagated to the comment job. In practice this is unlikely (skill-validator always produces verdicts when evaluation runs), but the logic is inconsistent.

Fix: Add an empty-verdicts guard:

if (data.verdicts && data.verdicts.length > 0) {
  allVerdicts.push(...data.verdicts);
  for (const v of data.verdicts) {
    if (!v.passed) evalPassed = false;
  }
} else {
  evalPassed = false; // no verdicts = not passed
}

Discarded Findings (no consensus)

Finding	Flagged by	Why discarded
Unquoted `$SKILLS_ARG` enables shell injection	1/3	2/3 disagreed: bash does not re-parse shell metacharacters (`;`, `\|`, `$()`) during variable expansion — only word splitting occurs. Not exploitable.

💡 Suggestion (from prior round, still applicable)

Add pr_number input to workflow_dispatch — The existing copilot-evaluate-tests.md gh-aw workflow in this repo already has this pattern. Adding it here would enable testing the workflow against specific PRs before merge, and manual re-validation without pushing commits.

Recommendation

✅ Approve — All critical and moderate security findings are resolved. The one new minor finding (N6: empty-verdicts false-pass) is an edge case unlikely to occur in practice. The 4 deferred items have sound justifications. The workflow has been tested end-to-end (successful run with 4 skill evaluations). Ship it.

Note: Commit 14f3681 needs to be pushed to the PR branch for the diff to be visible in the PR.

- When any evaluated skill fails, add a copy-paste prompt for AI-assisted investigation (matching dotnet/skills pattern) that tells users how to download artifacts and follow InvestigatingResults.md - Update pipeline link at bottom of PR comment to match dotnet/skills style - Write step summary with full results and investigation prompt

- Fix #1 (CRITICAL): Shell injection in evaluate job - use env var indirection for skills_path and rename GITHUB_TOKEN to COPILOT_TOKEN - Fix #4: PowerShell here-string injection - use env:SKILL_DIRS - Fix #5: Code-fence injection - escape triple backticks in output - Fix #6: Delete old agent .md files (keep .agent.md versions) - Fix #7: Shell injection in static-check heredoc - use env var - Fix #8: Direct expression interpolation - use env: block for event props - Fix #9: Heredoc EOF spoofing - use random delimiter - Fix #10: Missing pipe-escaping for skillName/scenarioName - Fix #11: Renamed Copilot token env var to COPILOT_TOKEN - Fix #13: Token rotation entropy - mix in github.run_id - Fix #14: Fix misleading inline comment for icon logic

JanKrivanek · 2026-04-15T19:38:25Z

Thanks for the thorough multi-model review! Here's my response to each point. Fixes are in commit 48feb96.

🔴 CRITICAL

1. Shell injection in evaluate job can exfiltrate Copilot token
${{ matrix.entry.skills_path }} originates from directory names in the PR

Fixed. Moved skills_path into an env: block (SKILLS_PATH) and quoted it in the run: block. Also renamed the env var from GITHUB_TOKEN to COPILOT_TOKEN (addresses #11 as well) — the value is then exported as GITHUB_TOKEN inside the step so skill-validator can find it, but the ${{ }} interpolation surface is eliminated.

🟡 MODERATE

2. Supply chain risk — mutable nightly binary with no checksum

Acknowledged, deferring. This is intentionally a nightly tag — the same pattern used in dotnet/runtime, dotnet/msbuild, and github/awesome-copilot. Pinning a checksum would require bumping it manually on every nightly release, which defeats the purpose. The binary comes from dotnet/skills, a Microsoft-maintained repo. Happy to revisit if/when skill-validator publishes versioned releases with checksums.

3. issue_comment trigger fires on every comment in the repo

Acknowledged — known GitHub Actions limitation. There's no content filter available for issue_comment. The slash-gate job exits immediately (no runner allocation beyond the ~2s check) for non-matching comments, and the concurrency group ensures they don't pile up. This is the same pattern used in dotnet/skills. Documenting as a known tradeoff.

4. PowerShell here-string expression injection
@" ${{ steps.changed.outputs.skill_dirs }} "@ expands $() subexpressions

Fixed. Replaced with $env:SKILL_DIRS via an env: block. No more double-quoted here-string interpolation.

5. Code-fence injection via validator output in PR comment

Fixed. Added staticOutput.replace(/ ``` /g, ' ` `')` to break any triple-backtick sequences in the validator output.

6. Old agent files not deleted — duplicates after merge

Fixed. Deleted learn-from-pr.md, sandbox-agent.md, and write-tests-agent.md (keeping the .agent.md versions).

🟢 MINOR

7. Shell injection via heredoc in static-check
done <<< "${{ needs.pr-gate.outputs.changed_skills }}"

Fixed. Moved to env: CHANGED_SKILLS: block, now uses done <<< "$CHANGED_SKILLS".

8. Direct ${{ }} interpolation of event properties in shell

Fixed. Moved head.repo.full_name, base.repo.full_name, user.login, comment.user.login, comment.id, and pull_request.number into env: blocks across pr-gate, slash-gate, and discover-eval jobs.

9. Heredoc delimiter "EOF" can be spoofed

Fixed. Both heredoc outputs now use DELIM="EOF_$(openssl rand -hex 8)" for random delimiters.

10. Missing pipe-escaping for table cell values

Fixed. Added replace(/\|/g, '\\|') for both verdict.skillName and sc.scenarioName.

11. Misleading GITHUB_TOKEN env var name for Copilot token

Fixed (combined with #1). The env var is now COPILOT_TOKEN in the env: block, then exported as GITHUB_TOKEN inside the shell step specifically for skill-validator consumption.

12. cancel-in-progress: true wastes expensive LLM evaluation runs

Acknowledged. Valid concern. For now keeping the simple model — a push during evaluation means the PR content changed, so the in-progress eval is stale anyway. If we find this wasteful in practice, we can split into separate concurrency groups for static vs. LLM evaluation.

13. Token rotation ineffective for single-skill PRs

Fixed. Token index now mixes github.run_id with strategy.job-index: IDX=$(( (JOB_INDEX + RUN_ID) % ${#TOKENS[@]} )). Different workflow runs now select different tokens even for single-skill PRs.

14. Per-scenario icon logic >= 0 contradicts inline comment > 0

Fixed. Updated the comment to say improvement >= 0 means not regressed. The code is intentional — zero improvement (neutral) is not a regression and should show ✅.

15. No test harness for JS comment-synthesis logic

Acknowledged. Fair point. The JS logic is complex enough to warrant tests, but it's tightly coupled to the GitHub Actions runtime. Will consider extracting into a testable module in a follow-up.

ℹ️ Discarded Findings

Agree with discarding all three — cosmetic trailing newline, /evaluate-skills matching scope (contributor-only), and $SKILLS_ARG which is self-sanitized by the --skills prefix.

JanKrivanek · 2026-04-15T19:38:30Z

Thanks for the thorough multi-model review! Here's my response to each point. Fixes are in commit 48feb96.

🔴 CRITICAL

1. Shell injection in evaluate job can exfiltrate Copilot token
${{ matrix.entry.skills_path }} originates from directory names in the PR

Fixed. Moved skills_path into an env: block (SKILLS_PATH) and quoted it in the run: block. Also renamed the env var from GITHUB_TOKEN to COPILOT_TOKEN (addresses #11 as well) — the value is then exported as GITHUB_TOKEN inside the step so skill-validator can find it, but the ${{ }} interpolation surface is eliminated.

🟡 MODERATE

2. Supply chain risk — mutable nightly binary with no checksum

Acknowledged, deferring. This is intentionally a nightly tag — the same pattern used in dotnet/runtime, dotnet/msbuild, and github/awesome-copilot. Pinning a checksum would require bumping it manually on every nightly release, which defeats the purpose. The binary comes from dotnet/skills, a Microsoft-maintained repo. Happy to revisit if/when skill-validator publishes versioned releases with checksums.

3. issue_comment trigger fires on every comment in the repo

Acknowledged — known GitHub Actions limitation. There's no content filter available for issue_comment. The slash-gate job exits immediately (no runner allocation beyond the ~2s check) for non-matching comments, and the concurrency group ensures they don't pile up. This is the same pattern used in dotnet/skills. Documenting as a known tradeoff.

4. PowerShell here-string expression injection
@" ${{ steps.changed.outputs.skill_dirs }} "@ expands $() subexpressions

Fixed. Replaced with $env:SKILL_DIRS via an env: block. No more double-quoted here-string interpolation.

5. Code-fence injection via validator output in PR comment

Fixed. Added staticOutput.replace(/ ``` /g, ' ` `')` to break any triple-backtick sequences in the validator output.

6. Old agent files not deleted — duplicates after merge

Fixed. Deleted learn-from-pr.md, sandbox-agent.md, and write-tests-agent.md (keeping the .agent.md versions).

🟢 MINOR

7. Shell injection via heredoc in static-check
done <<< "${{ needs.pr-gate.outputs.changed_skills }}"

Fixed. Moved to env: CHANGED_SKILLS: block, now uses done <<< "$CHANGED_SKILLS".

8. Direct ${{ }} interpolation of event properties in shell

Fixed. Moved head.repo.full_name, base.repo.full_name, user.login, comment.user.login, comment.id, and pull_request.number into env: blocks across pr-gate, slash-gate, and discover-eval jobs.

9. Heredoc delimiter "EOF" can be spoofed

Fixed. Both heredoc outputs now use DELIM="EOF_$(openssl rand -hex 8)" for random delimiters.

10. Missing pipe-escaping for table cell values

Fixed. Added replace(/\|/g, '\\|') for both verdict.skillName and sc.scenarioName.

11. Misleading GITHUB_TOKEN env var name for Copilot token

Fixed (combined with #1). The env var is now COPILOT_TOKEN in the env: block, then exported as GITHUB_TOKEN inside the shell step specifically for skill-validator consumption.

12. cancel-in-progress: true wastes expensive LLM evaluation runs

Acknowledged. Valid concern. For now keeping the simple model — a push during evaluation means the PR content changed, so the in-progress eval is stale anyway. If we find this wasteful in practice, we can split into separate concurrency groups for static vs. LLM evaluation.

13. Token rotation ineffective for single-skill PRs

Fixed. Token index now mixes github.run_id with strategy.job-index: IDX=$(( (JOB_INDEX + RUN_ID) % ${#TOKENS[@]} )). Different workflow runs now select different tokens even for single-skill PRs.

14. Per-scenario icon logic >= 0 contradicts inline comment > 0

Fixed. Updated the comment to say improvement >= 0 means not regressed. The code is intentional — zero improvement (neutral) is not a regression and should show ✅.

15. No test harness for JS comment-synthesis logic

Acknowledged. Fair point. The JS logic is complex enough to warrant tests, but it's tightly coupled to the GitHub Actions runtime. Will consider extracting into a testable module in a follow-up.

ℹ️ Discarded Findings

Agree with discarding all three — cosmetic trailing newline, /evaluate-skills matching scope (contributor-only), and $SKILLS_ARG which is self-sanitized by the --skills prefix.

…scape skillName in summary, use env vars for eval_all and PR_NUMBER

JanKrivanek · 2026-04-17T08:30:27Z

Addressed the remaining open review items in commit 14f3681. Summary:

#	Description	Severity	Fixed?	Action / Justification
F2	Mutable nightly binary, no checksum	🟡 MODERATE	No	Accepted risk — same pattern as dotnet/runtime, dotnet/msbuild, github/awesome-copilot. Binary comes from Microsoft-maintained dotnet/skills. Pinning a checksum would require manual bumps on every nightly release. Revisit when versioned releases with checksums exist.
F3	`issue_comment` trigger fires on every comment	🟡 MODERATE	No	GitHub Actions limitation — no content/path filter available for `issue_comment`. The `slash-gate` job exits in ~2s for non-matching comments. No fix possible without platform changes.
F12	`cancel-in-progress: true` wastes LLM runs	🟢 MINOR	No	Intentional tradeoff — a push during evaluation means PR content changed, so the in-flight eval is stale. Can split concurrency groups in a follow-up if wasteful in practice.
F15	No test harness for JS comment-synthesis logic	🟢 MINOR	No	~350 lines of JS tightly coupled to GitHub Actions runtime. Acknowledged as valid; extracting into a testable module is follow-up work.
N1	Raw LLM judge report markdown embedded without sanitization	🟡 MODERATE	Yes	Added `.replace(/` ``` `/g, '` ` `').replace(/</details>/gi, '</details>')` to `mdContent` before pushing to `lines` — same treatment as the already-fixed `staticOutput` (F5).
N2	`verdict.skillName` unescaped in per-skill verdict summary	🟢 MINOR	Yes	Created `safeSkillNameSummary` with `.replace(/\|/g, '\\|')` — completes the partial F10 fix that only covered table rows.
N3	`${{ steps.changed.outputs.eval_all }}` directly interpolated in PowerShell	🟢 MINOR	Yes	Added `EVAL_ALL:` to `env:` block, changed script to `$env:EVAL_ALL` — consistency with env-var pattern used everywhere else.
N4	`PR_NUMBER=${{ github.event.issue.number }}` directly interpolated in shell	🟢 MINOR	Yes	Added `ISSUE_NUMBER:` to `env:` block, changed script to `PR_NUMBER="$ISSUE_NUMBER"` — consistency with adjacent steps.

cc @PureWeen

kubaflo · 2026-04-20T16:20:32Z

🔍 Multimodal Code Review — PR #34976

Reviewed by 3 independent AI models: Claude Opus 4.6, Claude Sonnet 4.5, GPT-5.4

Each model reviewed the full diff independently, then findings were cross-referenced and deduplicated. Findings marked with model agreement indicators.

🔴 Issues (2 findings)

1. `issue_comment` trigger fires on every comment in the repo — `report-status` burns runners

Flagged by: Opus 🔴, GPT 🟡 | File: skill-validation.yml

The issue_comment trigger has no path filter — every comment on every issue/PR spawns a workflow run. While slash-gate correctly filters to /evaluate-skills on PRs only, the report-status job uses if: always() && !cancelled() which means it spins up a runner even when all gate jobs skip (i.e., on every non-matching comment). For a repo as active as dotnet/maui, this is significant CI noise.

Remediation: Add proper gating to report-status:

report-status:
  if: >-
    always() && !cancelled() && (
      needs.pr-gate.result == 'success' ||
      needs.slash-gate.result == 'success'
    )

2. PR comment inlines untrusted eval output as markdown

Flagged by: GPT 🔴 | File: skill-validation.yml ~lines 940-1030

The comment job posts skillName, scenarioName, reason, and full per-scenario markdown reports (mdContent) from results artifacts directly into a bot-authored PR comment. For fork PRs triggered via /evaluate-skills, crafted eval.yaml/SKILL.md content could flow through the LLM and appear in a privileged bot comment. While GitHub sanitizes HTML, this still permits mention spam (@user), misleading links, and PR-authored content escaping the "artifact only" boundary.

Remediation: Wrap detailed reports in code blocks (```text) rather than raw markdown, and sanitize @ mentions in skillName/scenarioName/reason fields before rendering.

🟡 Concerns (9 findings)

3. JS template literals interpolate outputs directly in `report-status`

Flagged by: Opus | File: skill-validation.yml ~line 1177

const headSha = '${{ needs.pr-gate.outputs.head_sha }}';

These values are safe today (hex SHA, owner/repo strings), but if outputs were ever changed to carry user-controlled text, this becomes a script injection. Defense-in-depth: use env: indirection for all ${{ }} interpolations in actions/github-script blocks.

4. `/evaluate-skills` on fork PR sends crafted content to LLM with Copilot token

Flagged by: Opus | File: skill-validation.yml

A contributor posting /evaluate-skills on a malicious fork PR sends crafted eval.yaml/SKILL.md to the LLM. Risk is limited (LLM evaluation, not code execution) but worth documenting explicitly in the workflow header comments.

5. Concurrency race between `pull_request_target` and `/evaluate-skills`

Flagged by: Opus | File: skill-validation.yml ~lines 40-52

A push creates group skill-validation-pr-N, while /evaluate-skills creates skill-validation-eval-N. Both can run concurrently and both write to the same PR comment + commit status. Whichever finishes last wins. Could produce briefly stale status.

6. `/evaluate-skills` with no skill changes silently skips evaluation

Flagged by: Opus | File: skill-validation.yml

If someone posts /evaluate-skills on a PR with no skill changes, eval_all is false and SKILL_DIRS is empty → no entries → eval silently skips. The user might expect it to evaluate all skills. Consider falling back to "all skills with tests" for the slash command path.

7. Comment overstates validated scope

Flagged by: GPT | File: skill-validation.yml ~lines 316-321

Static validation scopes to changed skills only for PRs, but the summary counts skills by find .github/skills ... | wc -l (all skills in repo). A PR changing 1 skill shows "Skills checked: 15". Persist the actual validated count from SKILLS_ARG instead.

8. Missing Copilot secrets = hard workflow failure

Flagged by: GPT | File: skill-validation.yml ~lines 545-569

Contributor PRs automatically enter the eval path, but token selection hard-fails if none of the three secrets exist. Preflight secret availability; if absent, downgrade to "static-only / eval skipped" with a clear comment.

9. Duplicated validator download+cache block

Flagged by: Opus 🟡, Sonnet 🟡, GPT 🟢 | File: skill-validation.yml

The download+cache block (~30 lines) is copy-pasted identically in static-check and evaluate. Extract to a composite action or reusable workflow to avoid drift.

10. `plugin.json` missing `agents` path

Flagged by: Opus 🟡, Sonnet 🟡 | File: .github/plugin.json

Only declares "skills": ["./skills/"] but the workflow also validates agents. If the validator uses plugin.json for agent discovery, agents won't be found. Confirm schema and add "agents": ["./agents/"] if supported.

11. Date-based cache key + `restore-keys` fallback can serve stale binaries

Flagged by: Opus 🟡, Sonnet 🟡 | File: skill-validation.yml

Key skill-validator-linux-x64-YYYY-MM-DD with fallback skill-validator-linux-x64-. If the nightly release ships a breaking schema change, stale cache from days ago causes confusing failures. Consider pinning to a release version in the cache key.

🟢 Suggestions (4 findings)

12. Add `.github/plugin.json` to path filters

Flagged by: GPT | File: skill-validation.yml ~line 30

PRs that only change plugin.json won't trigger the workflow. Add it to the pull_request_target.paths list.

13. Add `pr_number` input for `workflow_dispatch`

Flagged by: Sonnet | File: skill-validation.yml

Currently workflow_dispatch validates the default branch. Adding a pr_number input enables manual re-validation of specific PRs.

14. Hardcoded model names

Flagged by: Sonnet | File: skill-validation.yml ~lines 598-599

--model claude-opus-4.6 --judge-model claude-opus-4.6 are hardcoded. Use env vars for easier updates and per-trigger cost tiers.

15. CI cost for full eval is very high

Flagged by: GPT | File: skill-validation.yml ~lines 597-604

When the workflow file itself changes, ALL skills with tests get evaluated with claude-opus-4.6 × 3 runs × aggressive parallelism. Consider a tiered policy: cheaper defaults for PRs, full eval via /evaluate-skills.

✅ What's Done Well (consensus across all 3 models)

Security model is exemplary — sparse-checkout + persist-credentials: false + no code execution is the right pattern for pull_request_target
Token rotation — round-robin across 3 Copilot tokens using (job-index + run_id) % count is clever
Comment upsert — HTML marker + paginated search + update-or-create handles the comment lifecycle well
Empty matrix handling — fromJson('[]') + has_entries guard correctly prevents the evaluate job from running
LLM eval gating — properly restricted to contributors with write+ access
README-AI.md — all 3 agent renames are consistent; metrics counts are accurate
Investigation UX — copy-paste prompt for AI agent debugging when evals fail is a nice touch

Summary

Severity	Count	Key Items
🔴 Issues	2	Runner waste on every comment; untrusted markdown in bot comment
🟡 Concerns	9	JS injection surface, concurrency race, overstated counts, missing secret handling, duplicated code, stale cache
🟢 Suggestions	4	Path filter gap, workflow_dispatch input, hardcoded models, cost tiers
✅ Strengths	7	Security model, token rotation, comment UX, edge case handling

Overall: Well-architected workflow with solid security fundamentals. The two 🔴 items (runner waste + untrusted markdown rendering) are worth addressing before merge; the rest are hardening suggestions for follow-up.

This review was generated by running 3 independent AI models (Claude Opus 4.6, Claude Sonnet 4.5, GPT-5.4) against the full PR diff, then cross-referencing and deduplicating findings.

… empty-verdicts false-pass, add plugin.json to path filters

JanKrivanek · 2026-04-21T09:49:21Z

Responding to #34976 (comment)

Fixes in commit da4acdb.

#	Description	Severity	Fixed?	Reasoning
1	`report-status` burns runners on every comment	🔴	Yes	Valid. Added gate `needs.pr-gate.result == 'success' \|\| needs.slash-gate.result == 'success'` so the job only runs when a gate actually succeeded — no runner allocated for random comments.
2	PR comment inlines untrusted eval output as markdown	🔴	No	We intentionally post markdown for readability. Wrapping in code blocks would be unreadable. `/evaluate-skills` is gated to contributors with write+ access — non-contributors cannot trigger eval, so the "fork PR" attack path doesn't apply.
3	JS template literals interpolate outputs in `report-status`	🟡	No	Values are hex SHAs and `owner/repo` strings — not user-controlled. Defense-in-depth for values that provably can't be malicious is an artificial nit.
4	`/evaluate-skills` on fork PR sends crafted content to LLM	🟡	No	Only contributors with write+ access can trigger. They're making a conscious choice. Non-contributor eval is not a supported scenario.
5	Concurrency race between `pull_request_target` and `/evaluate-skills`	🟡	No	Accepted. The race would be rare and the consequence is a briefly stale status — harmless.
6	`/evaluate-skills` with no skill changes silently skips	🟡	No	Correct behavior. The command evaluates changed skills. No changes = nothing to evaluate. Not a bug.
7	Comment overstates validated scope	🟡	No	The count shows skills/agents available in the repo, which is accurate context. Changing it would add complexity for marginal value.
8	Missing Copilot secrets = hard workflow failure	🟡	No	Correct behavior. Failing loudly when required secrets are missing is better than silently skipping. The eval path is already gated to contributors who should have secrets configured.
9	Duplicated validator download+cache block	🟡	No	Valid observation, but refactoring to a composite action is follow-up work. Not a bug.
10	`plugin.json` missing `agents` path	🟡	No	Agents are passed via CLI (`--agents .github/agents`), not discovered from `plugin.json`. The workflow handles this correctly.
11	Date-based cache key serves stale binaries	🟡	No	Intentional design. Cache refreshes daily. If a breaking change ships, it's caught the next day. The fallback is a feature.
12	Add `plugin.json` to path filters	🟢	Yes	Valid. Added `.github/plugin.json` to `pull_request_target.paths` so workflow triggers when it changes.
13	Add `pr_number` input for `workflow_dispatch`	🟢	No	Nice-to-have for later. Not needed for merge.
14	Hardcoded model names	🟢	No	Keeping it simple. Env vars add indirection for no current benefit.
15	CI cost for full eval	🟢	No	Cost optimization suggestion, not a code issue.
N6	Empty verdicts produce false-pass	🟢	Yes	Valid defensive fix. Added `data.verdicts.length > 0` check and `else { evalPassed = false }` to match the jq logic in the `evaluate` job.

cc @PureWeen

kubaflo · 2026-04-21T10:02:08Z

🟢 .NET MAUI Code Review — Approved

👋 @JanKrivanek — new code review results are available. Please review the latest session below.

🟢 Review Session — Approved — da4acdb · Add skills validator for repo skills / agents · 2026-04-21 10:02 UTC

Code Review — PR #34976

Independent Assessment

What this changes: Adds a 1,174-line skill-validation.yml GitHub Actions workflow with two modes: (1) static validation via skill-validator check on every PR touching .github/skills/ or .github/agents/, and (2) LLM evaluation via skill-validator evaluate for contributor PRs or /evaluate-skills slash commands. Also renames 3 agent files from .md to .agent.md convention, adds a plugin.json manifest, and updates README-AI.md metrics.

Inferred motivation: As the repo's skill/agent surface grows (now 15 skills, 3 agents), automated validation prevents regressions in skill definitions. The static checks catch structural issues; LLM evaluation catches behavioral regressions by running eval scenarios.

Reconciliation with PR Narrative

Author claims: Integrates skill-validator for automated validation, supports both static and LLM-based evaluation, follows the same patterns used by dotnet/msbuild and dotnet/runtime.

Agreement: The code matches these claims. The security model is well-documented in the workflow comments (lines 1–21). All 8 review findings from JanKrivanek's prior review are resolved.

Findings

💡 Suggestion — Consider documenting the COPILOT_TOKEN → GITHUB_TOKEN re-export

skill-validation.yml:662 exports COPILOT_TOKEN as GITHUB_TOKEN so skill-validator evaluate can access the Copilot API. This is intentional and gated to contributor PRs only, but the re-export pattern means every subprocess in the evaluate step inherits a Copilot-capable token. A brief inline comment explaining why the rename is needed (skill-validator reads GITHUB_TOKEN for Copilot inference) would help future maintainers understand the security boundary.

💡 Suggestion — Metrics count in README-AI.md may drift

README-AI.md:369 now lists the exact skill and agent counts ("15 skills", "3 agents"). These will need manual updates whenever skills are added/removed. Consider replacing with a note like "See .github/skills/ directory" or adding a CI check that validates the count — the skill-validation workflow itself could emit this.

Devil's Advocate

"Could a fork PR inject malicious skill content that skill-validator check executes?" — No. The sparse checkout only fetches .github/skills and .github/agents (markdown/YAML). The validator binary is from a trusted first-party source (dotnet/skills releases). No PR code is compiled or executed. persist-credentials: false prevents token leakage via git.
"Is pull_request_target safe here?" — Yes. The workflow YAML always comes from the base branch (not the PR), so fork PRs cannot modify the workflow logic. Static checks run for all PRs (low privilege, contents: read only). LLM evaluation is gated to contributors (is_contributor == 'true') or explicit /evaluate-skills from a write+ collaborator.
"Could the nightly binary download be supply-chain attacked?" — Low risk. The binary is from github.com/dotnet/skills/releases (first-party dotnet org). Cached daily to limit exposure. JanKrivanek acknowledged this as a known tradeoff (standard pattern across dotnet repos) that will be revisited when stable releases exist.
"What about the issue_comment trigger volume?" — Unlike the deleted triage workflow (which fired on ALL comments), the slash-gate job's if condition (startsWith(github.event.comment.body, '/evaluate-skills')) causes GitHub to skip the job entirely without allocating a runner. The concurrency group uses unique keys for non-matching comments to avoid unwanted cancellation. Minimal cost.

Verdict: LGTM

Confidence: high
Summary: Well-structured workflow with a sound security model: pull_request_target with sparse checkout, contributor-gated LLM evaluation, proper concurrency groups, token rotation, and paginated comment upserts. All prior review findings are resolved. The two suggestions above are cosmetic — no correctness or security issues found.

Copilot AI review requested due to automatic review settings April 15, 2026 11:53

Copilot started reviewing on behalf of JanKrivanek April 15, 2026 11:54 View session

Add bootstrap pull_request trigger for initial PR validation

bc6f438

The pull_request trigger allows the workflow to run from the PR branch before it exists on main. Once merged, pull_request_target takes over and the pull_request trigger can be removed.

Copilot AI reviewed Apr 15, 2026

View reviewed changes

JanKrivanek mentioned this pull request Apr 15, 2026

Add skills validator for repo skills / agents #34960

Closed

JanKrivanek added 2 commits April 15, 2026 21:18

JanKrivanek enabled auto-merge April 16, 2026 13:19

PureWeen mentioned this pull request Apr 16, 2026

Add daily PR review queue workflow with actionability detection #34818

Merged

Address remaining code review items N1-N4: sanitize judge markdown, e…

14f3681

…scape skillName in summary, use env vars for eval_all and PR_NUMBER

This was referenced Apr 17, 2026

[PR Review Queue] 2026-04-17 #35004

Closed

[PR Review Queue] 2026-04-20 #35039

Closed

Address review round 3: gate report-status to avoid runner waste, fix…

da4acdb

… empty-verdicts false-pass, add plugin.json to path filters

kubaflo approved these changes Apr 21, 2026

View reviewed changes

PureWeen disabled auto-merge April 21, 2026 16:43

PureWeen merged commit 5dda105 into main Apr 21, 2026
4 of 5 checks passed

PureWeen deleted the dev/jankrivanek/add-skill-validator-v2 branch April 21, 2026 16:43

github-actions Bot added this to the .NET 10 SR7 milestone Apr 21, 2026

github-actions Bot mentioned this pull request Apr 22, 2026

[repo-status] Daily Repo Status - April 22, 2026 #35075

Closed

github-actions Bot mentioned this pull request Apr 23, 2026

[repo-status] 🌟 Daily Repository Status - April 23, 2026 #35100

Closed

github-actions Bot locked and limited conversation to collaborators May 22, 2026

Conversation

JanKrivanek commented Apr 15, 2026

Description of Change

Uh oh!

github-actions Bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Copilot's findings

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Apr 15, 2026

🔍 Skill Validation Results

✅ Static Checks Passed

❌ LLM Evaluation Failed

Judge Report: Anti-pattern - never approve via GitHub API

Baseline Judge

With-Skill Judge (Isolated)

With-Skill Judge (Plugin)

Baseline Agent Output

With-Skill Agent Output (Isolated)

With-Skill Agent Output (Plugin)

Judge Report: Happy path - code review PR request

Baseline Judge

With-Skill Judge (Isolated)

With-Skill Judge (Plugin)

Baseline Agent Output

⚠️ Minor: UIScreen.MainScreen.Scale is deprecated

✅ Good: RTL cleanup

✅ Tests

✅ Documentation

Summary

Code Review — PR #34024

Independent Assessment

Reconciliation with PR Narrative

Findings

⚠️ Warning — MauiScrollView.IsParentHandlingSafeArea() is not edge-aware

💡 Suggestion — EqualsAtPixelLevel uses UIScreen.MainScreen.Scale

💡 Suggestion — RTL refactoring mixed with bug fix

💡 Suggestion — Verbose debug logging in test host app

Devil's Advocate

Verdict: NEEDS_DISCUSSION

Code Review: PR #31202 — LineHeight and decorations for HTML Label

PR Scope

✅ The Fix (Label.Mapper.cs) — Correct

✅ iOS Handler (Label.iOS.cs) — Correct

⚠️ Minor Issues

✅ Test Coverage — Good

Verdict: ✅ The functional MAUI changes are correct and well-tested. The fix precisely targets the right condition and the iOS handler is appropriately updated. The unrelated changes bundled in the PR are a review burden but don't affect correctness.

Independent Assessment

Reconciliation with PR Narrative

Findings

💡 Suggestion — Consider guarding for non-default values like MapFont/MapTextColor

💡 Suggestion — Mixed concerns in a single PR

Things Done Well

FlyoutViewHandler.Android.cs — ConnectHandler/DisconnectHandler mismatch

Minor: ShellSectionRenderer.Destroy() — redundant null check

Code Review — PR #32278

Independent Assessment

Reconciliation with PR Narrative

Findings

❌ Error — FlyoutViewHandler.Android.cs: ConnectHandler/DisconnectHandler asymmetry

⚠️ Warning — MauiPageControl.cs: Orphaned #pragma warning restore

💡 Suggestion — ShellSectionRenderer.cs: Redundant null check

💡 Suggestion — Static _registeredViews list has no bounded cleanup

Devil's Advocate

Verdict: NEEDS_CHANGES

Code Review — PR #32278

Independent Assessment

Reconciliation with PR Narrative

Findings

github-actions Bot commented Apr 15, 2026 •

edited

Loading

⚠️ Minor: `UIScreen.MainScreen.Scale` is deprecated

⚠️ Warning — `MauiScrollView.IsParentHandlingSafeArea()` is not edge-aware

💡 Suggestion — `EqualsAtPixelLevel` uses `UIScreen.MainScreen.Scale`

💡 Suggestion — Consider guarding for non-default values like `MapFont`/`MapTextColor`

`FlyoutViewHandler.Android.cs` — ConnectHandler/DisconnectHandler mismatch

Minor: `ShellSectionRenderer.Destroy()` — redundant null check

❌ Error — `FlyoutViewHandler.Android.cs`: ConnectHandler/DisconnectHandler asymmetry

⚠️ Warning — `MauiPageControl.cs`: Orphaned `#pragma warning restore`

💡 Suggestion — `ShellSectionRenderer.cs`: Redundant null check

💡 Suggestion — Static `_registeredViews` list has no bounded cleanup

⚠️ Warning — Orphaned `#pragma warning restore` in MauiPageControl.cs

💡 Suggestion — `_registeredViews` list cleanup only happens during traversal

1. `Thread.Sleep(2000)` — ❌ Anti-pattern

2. Missing `WaitForElement` before `App.Tap("TriggerButton")` — ❌ Race condition

3. `Application.MainPage` navigation in HostApp page — ❌ Deprecated pattern

4. `VerifyScreenshot()` after a sleep — ⚠️ Fragile