Add skills validator for repo skills / agents#34976
Conversation
- Add skill-validation.yml workflow with pull_request_target trigger - Static checks for all skills and agents via skill-validator check - LLM evaluation for skills with eval tests (contributor PRs only) - /evaluate-skills slash command for manual evaluation - Results table with baseline/skilled scores posted as PR comment - Check run + commit status for proper PR integration - Add plugin.json for skill-validator evaluate compatibility - Rename declarative agents to conform to .agent.md convention - Update README-AI.md
|
🚀 Dogfood this PR with:
curl -fsSL https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.sh | bash -s -- 34976Or
iex "& { $(irm https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.ps1) } 34976" |
The pull_request trigger allows the workflow to run from the PR branch before it exists on main. Once merged, pull_request_target takes over and the pull_request trigger can be removed.
There was a problem hiding this comment.
Pull request overview
Adds automated validation for repo Copilot skills/agents by integrating skill-validator into CI, including optional LLM-based evaluation and PR feedback.
Changes:
- Introduces a new
Skill ValidationGitHub Actions workflow that runs static validation on PRs touching.github/skills/**or.github/agents/**, and supports/evaluate-skillsto trigger LLM evaluation. - Adds
.github/plugin.jsonto describe the repo’s skills for validation tooling. - Adds new
*.agent.mdagent definition files and updates.github/README-AI.mdreferences.
Show a summary per file
| File | Description |
|---|---|
| .github/workflows/skill-validation.yml | New workflow to run static skill/agent validation and optional LLM evaluation, then report status/comments back to PRs. |
| .github/plugin.json | Adds plugin metadata used by skill-validator tooling. |
| .github/agents/write-tests-agent.agent.md | Adds agent definition in .agent.md format for test-writing dispatch behavior. |
| .github/agents/sandbox-agent.agent.md | Adds agent definition in .agent.md format for Sandbox-based validation/testing workflows. |
| .github/agents/learn-from-pr.agent.md | Adds agent definition in .agent.md format for applying improvements based on completed PR learnings. |
| .github/README-AI.md | Updates documentation to reference .agent.md agent files and metrics. |
Copilot's findings
Comments suppressed due to low confidence (2)
.github/workflows/skill-validation.yml:525
- Same concern as the static job: the evaluate job downloads and executes
skill-validatorfrom a mutable nightly URL without integrity verification. Pin the tool to an immutable version and verify the archive checksum/signature before extracting/executing.
- name: Restore skill-validator from cache
id: cache-sv
uses: actions/cache/restore@v4
with:
path: skill-validator-bin
key: ${{ env.VALIDATOR_CACHE_PREFIX }}-${{ steps.cache-date.outputs.date }}
restore-keys: |
${{ env.VALIDATOR_CACHE_PREFIX }}-
- name: Download skill-validator
if: steps.cache-sv.outputs.cache-hit != 'true'
.github/README-AI.md:216
- The “Agent Files” list appears incomplete now that
learn-from-pr.agent.mdexists (and there are both*.mdand*.agent.mdvariants in the directory). Consider documenting the full set and clarifying which extension is canonical to avoid drift/confusion.
### Agent Files
Agent files in the `.github/agents/` directory:
- **`agents/sandbox-agent.agent.md`** - Sandbox app testing and experimentation
- **`agents/write-tests-agent.agent.md`** - Test writing (invokes skills like write-ui-tests)
- Files reviewed: 6/6 changed files
- Comments generated: 8
🔍 Skill Validation Results✅ Static Checks PassedSkills checked: 15 | Agents checked: 3 Full validator output❌ LLM Evaluation Failed0/4 skill(s) passed validation
❌ code-review: Improvement score -29.0% below threshold of 10.0% [high variance in: Happy path - code review PR request, Anti-pattern - never approve via GitHub API, Verdict consistency - errors must map to NEEDS_CHANGES, Negative trigger - describe changes query] [NOT ACTIVATED (isolated) in 2 scenario(s)] [NOT ACTIVATED (plugin) in 2 scenario(s)] ❌ evaluate-pr-tests: Improvement score 6.5% below threshold of 10.0% (not statistically significant) [high variance in: Negative trigger - general code review should not produce test evaluation report, Test type downgrade recommendation - UI test for pure property logic, Weak assertion detection - meaningless test assertions, Edge case gaps analysis - fix with multiple branches untested, Fluent chain wait pattern should not trigger missing-wait warning] [NOT ACTIVATED (isolated) in 2 scenario(s)] [NOT ACTIVATED (plugin) in 3 scenario(s)] ❌ try-fix: Eval scenario 'Regression: agent must not claim success without running the test command' prompt mentions target name 'try-fix' (skill or agent) — remove the target name from the prompt to avoid biasing baseline runs. Eval scenario 'Regression: agent uses prescribed restore script, not raw git commands' prompt mentions target name 'try-fix' (skill or agent) — remove the target name from the prompt to avoid biasing baseline runs. Eval scenario 'Edge case: exhausted iterations produces documented Fail, not silence or Pass' prompt mentions target name 'try-fix' (skill or agent) — remove the target name from the prompt to avoid biasing baseline runs. ❌ verify-tests-fail-without-fix: Eval scenario 'Regression: tests passing without fix means verification FAILED' prompt mentions target name 'verify-tests-fail-without-fix' (skill or agent) — remove the target name from the prompt to avoid biasing baseline runs. Eval scenario 'Edge case: no test files detected in the PR' prompt mentions target name 'verify-tests-fail-without-fix' (skill or agent) — remove the target name from the prompt to avoid biasing baseline runs. Eval scenario 'Regression: agent correctly reports test failure as verification success' prompt mentions target name 'verify-tests-fail-without-fix' (skill or agent) — remove the target name from the prompt to avoid biasing baseline runs. [1] High run-to-run variance (CV=451.56). Timeout at 300s
📊 code-review / anti-pattern-never-approve-via-github-apiJudge Report: Anti-pattern - never approve via GitHub APIBaseline JudgeOverall Score: 5/5
With-Skill Judge (Isolated)Overall Score: 5/5
With-Skill Judge (Plugin)Overall Score: 3.3/5
Baseline Agent OutputWith-Skill Agent Output (Isolated)With-Skill Agent Output (Plugin)📊 code-review / happy-path-code-review-pr-requestJudge Report: Happy path - code review PR requestBaseline JudgeOverall Score: 3.3/5
With-Skill Judge (Isolated)Overall Score: 4/5
With-Skill Judge (Plugin)Overall Score: 3/5
Baseline Agent Output
_parentHandlesSafeArea = this.FindParent(x => x is MauiView mv && mv.AppliesSafeAreaAdjustments) is not null;If a parent handles only TOP, a child
|
| Aspect | Rating |
|---|---|
| Core fix correctness | ✅ Solid |
Bug fix (_safeAreaInvalidated, inverted condition) |
✅ Critical fixes |
| Edge-aware consistency | |
| Test coverage | ✅ Comprehensive |
| Documentation | ✅ Excellent |
| Overall | Approve with note on MauiScrollView edge-awareness gap |
## With-Skill Agent Output (Isolated)
I now have all the context needed. Here's my review:
Code Review — PR #34024
Status: Already merged (2026-03-06)
Independent Assessment
What this changes: Fixes iOS safe area infinite layout cycles through two mechanisms: (1) a parent hierarchy walk (IsParentHandlingSafeArea) that prevents descendant views from double-applying safe area padding when an ancestor already handles the same edges, and (2) pixel-level comparison (EqualsAtPixelLevel) that filters sub-pixel animation noise from triggering layout invalidation. Additionally fixes two pre-existing bugs in MauiScrollView.ValidateSafeArea: an inverted !UpdateContentInsetAdjustmentBehavior() condition and _safeAreaInvalidated being set to true instead of false after validation.
Inferred motivation: Multiple users reported app freezes/crashes on iOS involving TranslateToAsync animations, nested safe-area views, and ContentPage+ScrollView combinations.
Reconciliation with PR Narrative
Author claims: Both MauiView and MauiScrollView perform edge-aware parent hierarchy checks.
Disagreement: MauiScrollView.IsParentHandlingSafeArea() is NOT edge-aware (see finding below).
Findings
⚠️ Warning — MauiScrollView.IsParentHandlingSafeArea() is not edge-aware
MauiView.IsParentHandlingSafeArea() performs per-edge checks (loops through 4 edges, only returns true if parent handles overlapping edges). But MauiScrollView.IsParentHandlingSafeArea() simply checks:
_parentHandlesSafeArea = this.FindParent(
x => x is MauiView mv && mv.AppliesSafeAreaAdjustments) is not null;If a parent MauiView handles only the Top edge, a child MauiScrollView that should handle the Bottom edge will incorrectly skip ALL its safe area adjustments. The PR description states both are edge-aware, but only MauiView is. No UI test covers the edge-independent scenario for a MauiScrollView child.
Impact: A ScrollView nested inside a Grid that handles only Top safe area would get no bottom inset on devices with a home indicator. Real-world impact depends on how often users nest ScrollViews with per-edge SafeAreaEdges configuration — likely low but non-zero.
💡 Suggestion — EqualsAtPixelLevel uses UIScreen.MainScreen.Scale
SafeAreaPadding.EqualsAtPixelLevel uses UIScreen.MainScreen.Scale which is a global/static value. On iPad with an external display at a different scale factor, the comparison resolution may not match the actual screen. Consider using the view's ContentScaleFactor or Window.Screen.Scale instead. Low impact since the pixel-level comparison is still dramatically better than exact equality.
💡 Suggestion — RTL refactoring mixed with bug fix
The MauiScrollView diff includes significant RTL layout refactoring (removing dead duplicate CrossPlatformArrange calls) that's unrelated to the safe area fix. While the cleanup is correct, bundling it increases bisection difficulty if regressions emerge from this PR.
💡 Suggestion — Verbose debug logging in test host app
Issue33934DialogBase.cs and related test files contain extensive System.Diagnostics.Debug.WriteLine and Console.WriteLine calls with emoji formatting. While acceptable in test infrastructure, this adds noise to debug output for all test runs.
Devil's Advocate
- On the MauiScrollView edge-awareness warning: Is the simpler check intentional? Perhaps for
MauiScrollView, any parent handling safe area is sufficient reason to defer, since the scroll view'sContentInsetAdjustmentBehaviormechanism works differently fromMauiView's direct padding. However, the PR description explicitly claims both are edge-aware, andMauiScrollViewdoes have per-edge handling viaGetSafeAreaRegionForEdge. The inconsistency appears unintentional. - On the pixel-level comparison: Could rounding cause two meaningfully different values to compare as equal? At 3x scale,
1/3 ≈ 0.33ptmaps to 1 pixel — this is the minimum distinguishable difference on-screen. Values differing by less than this truly are the same pixel, so the rounding is sound. - On cache invalidation: The
_parentHandlesSafeAreacache is cleared in all three key lifecycle events (SafeAreaInsetsDidChange,InvalidateSafeArea,MovedToWindow). This appears thorough.
Verdict: NEEDS_DISCUSSION
Confidence: medium
Summary: The core approach is well-designed and addresses real, painful infinite layout cycles. The two pre-existing bug fixes in MauiScrollView are clearly correct. However, the inconsistency between MauiView (edge-aware parent check) and MauiScrollView (non-edge-aware parent check) could cause a MauiScrollView to lose all safe area handling when a parent only handles some edges. Since this PR is already merged, this should be tracked as a follow-up investigation to determine if the simpler check in MauiScrollView was intentional or an oversight.
## With-Skill Agent Output (Plugin)
Now let me check the ViewModelBase.cs changes and look at the Issue33934DialogBase.cs to see if there's anything notable:
</details>
<details>
<summary>📊 code-review / independence-first-agent-reads-diff-before-description</summary>
# Judge Report: Independence-first - agent reads diff before description
## Baseline Judge
Overall Score: 2/5
Reasoning: While the agent produced a reasonable functional code review with good technical analysis (the truth table is insightful, the iOS handler analysis is correct, the observation about bundled unrelated changes is valid), it fundamentally fails the methodological requirements of this rubric. It did not establish an independent view before reading the PR description, and it completely omits any reconciliation step distinguishing its own analysis from the author's claims. The review reads as a standard code review rather than the rigorous independent-then-reconcile approach the rubric demands.
- **The agent fetches the code diff BEFORE reading the PR description, forming an independent view**: 1.7/5 — The agent fetched the PR metadata (including the title and body/description) simultaneously with or before the code diff. The first successful API call was to the PR endpoint which contains the full PR description. The actual diff was fetched afterward. At no point did the agent attempt to examine the code changes before reading the author's description. The very first call (`gh pr view --json title,body,...`) explicitly requested the body field. The agent made no effort to defer reading the description.
- **The agent's assessment contains its own description of the change, not a paraphrase of the PR body**: 3.7/5 — The agent does provide some genuine independent technical analysis — explaining the guard condition change from `!IsPlainText(label)` to `label.HasFormattedTextSpans`, the old vs new behavior semantics, and why iOS needs MapFormatting refresh while Android doesn't. It also fetched source files from the `main` branch to understand the broader context. However, the high-level summary reads very much like a paraphrase of the PR title and linked issue descriptions. It's a mix of independent analysis and likely-paraphrased framing.
- **The agent distinguishes between its independent view and the author's claims in a reconciliation step**: 1/5 — There is no reconciliation step whatsoever. The review never references what the author claims vs. what the agent independently found. Everything is presented as a single unified assessment. There is no section like 'the author says X, and my analysis confirms/contradicts Y.' The agent never explicitly identifies any author claim to verify or challenge.
## With-Skill Judge (Isolated)
Overall Score: 4.3/5
Reasoning: The agent followed the correct methodology: diff first, independent assessment, then reconciliation with the PR narrative. Its technical analysis of the code changes is accurate and insightful — it correctly identified the semantic shift from IsPlainText to HasFormattedTextSpans, noted the iOS MapFormatting additions, and raised a legitimate style concern about the inconsistency with MapFont/MapTextColor's secondary guards. The suggestion about mixed concerns (CI changes + label fix + unrelated XAML tests) is also valid. However, the review is incomplete — it was cut off mid-sentence by the 300s timeout, losing whatever remained of the 'Things Done Well' section and potentially other findings. The agent also spent considerable time (several turns) dealing with auth failures and file extraction mechanics, which consumed budget that could have been used to finish the review. Despite the truncation, the completed portions demonstrate strong code review quality.
- **The agent fetches the code diff BEFORE reading the PR description, forming an independent view**: 4.3/5 — The agent's first successful API call to the PR endpoint deliberately extracted only structural metadata (state, title, diff_url, head_sha, base, changed_files, additions, deletions) and explicitly did NOT read the body. It then downloaded the full .diff, analyzed multiple file diffs in detail (Label.Mapper.cs, Label.iOS.cs, test files), read full source files from the PR head commit, checked the base branch for comparison, and examined Label.Android.cs. Only AFTER completing this thorough code-level analysis did it fetch the PR description body, reviews, and comments. The ordering is clearly intentional and well-executed.
- **The agent's assessment contains its own description of the change, not a paraphrase of the PR body**: 4.7/5 — The agent's 'Independent Assessment' section provides a highly technical, code-level description: 'Relaxes the guard condition in MapLineHeight, MapTextDecorations, and MapCharacterSpacing from !IsPlainText(label) to label.HasFormattedTextSpans.' It explains the semantic difference between the old and new guards, notes the iOS MapFormatting update, and describes why the NSAttributedString needs rebuilding. This level of detail (specific method names, property names, guard conditions) could only come from reading the actual diff, not from paraphrasing the PR body which describes the change at a higher level. The PR body says 'corrects mapping logic' while the agent specifies exactly which guard condition changed and why.
- **The agent distinguishes between its independent view and the author's claims in a reconciliation step**: 4.3/5 — The review contains an explicit 'Reconciliation with PR Narrative' section that cleanly separates 'Author claims:' (corrects mapping logic, fixes #22193 and #22197) from the agent's independent assessment, and states 'Fully matches my independent code assessment.' This is textbook reconciliation - independent view first, author claims second, explicit agreement/disagreement statement third.
## With-Skill Judge (Plugin)
Overall Score: 1.3/5
Reasoning: The agent completely failed to produce a code review. It spent the entire 300-second session gathering data — trying gh CLI (which failed), fetching the diff via web_fetch in chunks, reading source files, and fetching API metadata — but never synthesized any of this into an actual review. The final output is an intermediate status message indicating the agent was still in data collection mode when time ran out. The approach was inefficient: it made 40 tool calls across 20 turns without ever beginning to write the review itself. No findings, no analysis, no reconciliation — just incomplete data gathering.
- **The agent fetches the code diff BEFORE reading the PR description, forming an independent view**: 2/5 — The agent first tried gh CLI commands (which all failed due to auth issues). Then it simultaneously fetched both the PR page (https://github.com/dotnet/maui/pull/31202, which contains the PR description/body and conversation) AND the files page in the same web_fetch batch. The actual .diff was only fetched in the next step, AFTER the PR page had already been read. The agent never formed an independent view from the diff before seeing the PR description and reviewer conversation. The simultaneous fetch of PR page and files page means the PR body was available before or at the same time as the code changes.
- **The agent's assessment contains its own description of the change, not a paraphrase of the PR body**: 1.7/5 — The agent's description — 'Adds a gh-aw workflow for automated test quality evaluation on PRs, including a shared checkout script, enhanced Gather-TestContext.ps1 with API-based file download and markdown injection protection, and comprehensive instructions documentation' — contains specific technical details (path traversal guards, backtick escaping, API-based file download) that map to actual code it reviewed via `view` commands. This suggests it's at least partially synthesized from code analysis rather than purely paraphrased from the PR body. However, since the PR description was read before the diff was fully analyzed, we cannot be certain the description wasn't influenced. The specificity is adequate but not strongly independent.
- **The agent distinguishes between its independent view and the author's claims in a reconciliation step**: 1/5 — There is no reconciliation step whatsoever in the final output. The agent internally noted 'Now let me check the CI status and read the PR narrative to reconcile' but the final review contains zero comparison between the agent's independent findings and the author's claims. There is no section contrasting 'what I observed in the code' vs 'what the PR author says.' The review has an 'Independent Assessment' header but this is just a label — it doesn't actually demonstrate independence or perform reconciliation. The entire structured independence-then-reconcile workflow is absent.
## Baseline Agent Output
I now have a complete picture of the PR. Here's my code review:
Code Review: PR #31202 — LineHeight and decorations for HTML Label
PR Scope
This PR contains three distinct changes bundled together:
- The actual fix — Label mapper + iOS handler for HTML label properties
- gh-aw workflow infrastructure — evaluate-tests workflow, checkout script, markdown injection fixes
- Unrelated tests — Maui34713 XAML source generator tests, merge workflow
I'll focus on the MAUI functional code.
✅ The Fix (Label.Mapper.cs) — Correct
The change from !IsPlainText(label) → label.HasFormattedTextSpans in MapLineHeight, MapTextDecorations, and MapCharacterSpacing is semantically correct:
| Label type | Old (!IsPlainText) |
New (HasFormattedTextSpans) |
|---|---|---|
| Plain text | false → applies ✅ |
false → applies ✅ |
| HTML text | true → skipped ❌ |
false → applies ✅ |
| FormattedText+Spans | true → skipped ✅ |
true → skipped ✅ |
The old code incorrectly treated HTML labels the same as FormattedText labels. The new condition precisely targets only labels with FormattedText.Spans (which handle their own formatting).
✅ iOS Handler (Label.iOS.cs) — Correct
Adding LineHeight, TextDecorations, and CharacterSpacing to MapFormatting ensures these properties are re-applied when TextType changes on iOS. Without this, the mapper fix alone wouldn't fully work on iOS.
⚠️ Minor Issues
-
Missing trailing newlines —
Issue22197.xaml,Issue22197.xaml.cs, andIssue22197.csall lack a final newline. Minor but can cause "no newline at end of file" warnings. -
No Android/Windows handler changes — The mapper change alone handles Android and Windows. I verified the iOS-specific
MapFormattingoverride is the only platform needing additional handler updates. This is correct — Android and Windows don't have the sameMapFormattingoverride that needs updating. -
PR is a merge train — This bundles unrelated gh-aw infrastructure, a merge workflow, and Maui34713 XAML tests with the actual Label fix. This makes the PR harder to review and bisect. The base branch was changed to
inflight/current, which likely explains the bundling.
✅ Test Coverage — Good
- UI test (
Issue22197.cs) with snapshot verification across all 4 platforms - Test XAML page exercises both HTML and plain text labels with the same properties for visual comparison
Verdict: ✅ The functional MAUI changes are correct and well-tested. The fix precisely targets the right condition and the iOS handler is appropriately updated. The unrelated changes bundled in the PR are a review burden but don't affect correctness.
## With-Skill Agent Output (Isolated)
I have all the context I need. Let me compile the review.## Code Review — PR #31202
Independent Assessment
What this changes: Narrows the guard in MapLineHeight, MapTextDecorations, and MapCharacterSpacing from !IsPlainText(label) to label.HasFormattedTextSpans. This changes the semantics from "skip if not plain text (HTML or spans)" to "skip only if formatted text spans exist", allowing these three properties to now apply to HTML labels. On iOS, MapFormatting is updated to re-apply LineHeight, TextDecorations, and CharacterSpacing when HTML text is (re)rendered, matching the existing pattern for TextColor and Font.
Inferred motivation: HTML labels (TextType="Html") were ignoring LineHeight, TextDecorations, and CharacterSpacing because the mapper guard treated HTML the same as formatted text with spans. These are fundamentally different cases — spans define their own per-span attributes, while HTML labels should respect view-level properties.
Reconciliation with PR Narrative
Author claims: Fixes #22193 and #22197 — LineHeight/TextDecorations/CharacterSpacing not working on HTML labels.
Agreement: My independent assessment fully matches. The root cause is correctly identified and the fix is surgically targeted.
Findings
💡 Suggestion — Consider guarding for non-default values like MapFont/MapTextColor
In Label.Mapper.cs, the existing MapFont and MapTextColor methods have a secondary guard for HTML labels:
else if (label.TextType == TextType.Text || !IsDefaultFont(label))
baseMethod?.Invoke(handler, label);This ensures the base method only fires for HTML when the user explicitly set a non-default value. The modified MapLineHeight/MapTextDecorations/MapCharacterSpacing unconditionally call the base method for HTML labels. This is safe because the defaults (LineHeight=-1, TextDecorations=None, CharacterSpacing=0) are no-ops, but it's an inconsistency in the pattern. Not a blocker — just a style observation.
💡 Suggestion — Mixed concerns in a single PR
The PR bundles three distinct areas: (1) CI/workflow changes (.github), (2) the Label HTML fix with UI test, and (3) Maui34713 XAML source gen tests for a separate issue. Per review rule §21.5, mixing concerns makes bisection harder. The XAML test looks correct but would be cleaner as a separate PR.
Things Done Well
- Platform coverage is complete: Snapshots provided for all 4 platforms (
## With-Skill Agent Output (Plugin)
Now I can see the full file list. Let me fetch the remaining changes (Label fix + XAML tests).
</details>
<details>
<summary>📊 code-review / negative-trigger-describe-changes-query</summary>
# Judge Report: Negative trigger - describe changes query
## Baseline Judge
Overall Score: 5/5
Reasoning: The agent handled this task efficiently and correctly. It recognized this was a summarization request, not a review, and produced a clean, informative summary. It dealt gracefully with initial tool failures (gh CLI auth issues) by falling back to web_fetch. The final output is well-structured, accurate to the PR content, and appropriately scoped — covering the problem, approach, and key details without unnecessary review artifacts. The response is concise and directly answers 'I just want to understand the approach.'
- **The agent provides a descriptive summary without triggering the full review workflow**: 5/5 — The agent produced a clean, descriptive summary of PR #34723: it identified the problem (in-place style/visual-state mutations not reflected during Hot Reload), listed the 4 new APIs in a clear table, and noted the EditorBrowsable(Never) design choice. It did NOT trigger a code-review workflow—no file-by-file analysis, no nitpicks, no structured review format. It answered the user's 'just want to understand the approach' request directly and concisely. The path was slightly inefficient (multiple failed gh CLI attempts before falling back to web_fetch), but the final output is exactly what was requested.
- **No severity markers (❌/⚠️/💡) or verdicts appear in the output**: 5/5 — The output contains zero severity markers (no ❌, ⚠️, or 💡 symbols) and no verdict language (no 'approve', 'request changes', etc.). It is a straightforward informational summary with a title, problem statement, table of APIs, and a brief note about the design rationale.
## With-Skill Judge (Isolated)
Overall Score: 5/5
Reasoning: The agent efficiently handled the task: it quickly recovered from the gh CLI auth issue, fetched the PR via web_fetch, and produced a concise, accurate, well-organized summary that directly addresses the user's request to 'understand the approach.' The output covers what was added, why, and relevant context (IntelliSense hiding, merge status) without unnecessary verbosity or review-style artifacts. The execution was fast (22.6s) and error-free from the user's perspective.
- **The agent provides a descriptive summary without triggering the full review workflow**: 5/5 — The agent correctly identified this as a summarization request, not a code review. It fetched the PR details via web_fetch (after the gh CLI failed due to missing token), extracted the key information, and produced a clear, concise summary of the PR's purpose, new APIs, and rationale. No review workflow was triggered — no code-review agent was launched, no diff analysis was performed. The approach was efficient: just 3 tool calls total.
- **No severity markers (❌/⚠️/💡) or verdicts appear in the output**: 5/5 — The output contains zero severity markers or verdict language. It is purely descriptive — explaining what the PR does, what APIs it adds, and why. There are no judgments about code quality, no recommendations, and no emoji-based severity indicators.
## With-Skill Judge (Plugin)
Overall Score: 5/5
Reasoning: The agent efficiently handled the task despite initial authentication issues with the GitHub CLI. It adapted by falling back to web_fetch, retrieved the PR page, and produced a clear, accurate summary that captures the PR's purpose, approach, and key details. The output is well-structured, concise, and directly answers the user's request to understand the approach. No unnecessary review artifacts or severity markers are present.
- **The agent provides a descriptive summary without triggering the full review workflow**: 5/5 — The agent correctly recognized this was a summarization request, not a code review. It fetched the PR details via the GitHub API (after a brief detour with auth issues) and produced a clear, well-structured summary explaining the problem, approach, and key details of the PR. No review workflow was triggered — no code-review agent was launched, no file-by-file analysis was performed. The agent went straight to understanding and summarizing the PR.
- **No severity markers (❌/⚠️/💡) or verdicts appear in the output**: 5/5 — The output contains zero severity markers (no ❌, ⚠️, or 💡 symbols) and no verdict language. It is purely a descriptive summary of what the PR does and the approach taken, which is exactly what was requested.
## Baseline Agent Output
Here's the summary of PR #34723:
Title: Add public APIs to force reapplication of in-place-mutated styles and visual states
Problem: When styles or visual states are mutated in-place (primarily during Hot Reload), MAUI has no way to tell the UI to reapply them — the framework only detects when a different style is assigned, not when an existing one is modified internally.
Approach: Adds four new "invalidation" APIs:
StyleableElement.InvalidateStyle()— unapplies then reapplies the element's merged styleSpan.InvalidateStyle()— same, since Span manages its own MergedStyleImageSource.InvalidateStyle()— same for ImageSourceVisualStateManager.InvalidateVisualStates(VisualElement)— unapplies then reapplies the current visual state setters across all groups
These are marked with [EditorBrowsable(Never)] so they're hidden from IntelliSense — they're infrastructure APIs meant for Hot Reload, not everyday app development.
Fixes: Issues #34721 and #34722. Merged and shipped in .NET 11 Preview 3.
## With-Skill Agent Output (Isolated)
PR #34723 adds public APIs to force reapplication of styles and visual states that have been mutated in-place, primarily for Hot Reload scenarios.
New APIs:
StyleableElement.InvalidateStyle()— unapplies then reapplies the merged styleSpan.InvalidateStyle()— same, since Span owns its own MergedStyleImageSource.InvalidateStyle()— same for ImageSourceVisualStateManager.InvalidateVisualStates(VisualElement)— unapplies then reapplies current visual state setters across all groups
Why: When Hot Reload mutates a style or visual state in place (without replacing the object), the framework doesn't know anything changed. These APIs let the infrastructure explicitly tell MAUI "re-evaluate this." The APIs are marked to be hidden from IntelliSense since they're intended for infrastructure use, not app developers.
Status: Merged into net11.0, shipped in .NET 11 Preview 3. Docs and samples were also added in follow-up commits.
## With-Skill Agent Output (Plugin)
Here's what PR #34723 does:
Goal: Add public APIs to force reapplication of in-place-mutated styles and visual states, primarily for Hot Reload scenarios.
New APIs added:
StyleableElement.InvalidateStyle()— forces unapply + reapply of the element's merged styleSpan.InvalidateStyle()— same, since Span owns its own MergedStyleImageSource.InvalidateStyle()— same for ImageSourceVisualStateManager.InvalidateVisualStates(VisualElement)— forces unapply + reapply of current visual state setters across all groups
Approach: When Hot Reload mutates a style or visual state in place (rather than replacing the object), the framework doesn't know anything changed. These APIs let the Hot Reload infrastructure explicitly tell the framework "re-evaluate this style/visual state now." The APIs are marked to be hidden from IntelliSense since they're intended for infrastructure use, not app developers.
Fixes: Issues #34721 and #34722. The PR is merged and shipped in .NET 11 Preview 3.
</details>
<details>
<summary>📊 code-review / negative-trigger-informational-query-about-a-pr</summary>
# Judge Report: Negative trigger - informational query about a PR
## Baseline Judge
Overall Score: 5/5
Reasoning: The agent performed the task efficiently and correctly. It fetched the PR details via web_fetch after the gh CLI wasn't available, then produced a clear, accurate, well-structured summary of PR #34727 covering the change (escaping x:Key values), the problem it solved, the author, status, and linked issue. The approach was minimal and direct — exactly what was asked for. No unnecessary steps, no errors in the final output, and fast execution.
- **The agent provides a plain summary without launching a structured multi-step review workflow**: 5/5 — The agent took a straightforward path: fetched PR metadata via the GitHub API, fetched the file list, and produced a concise plain-language summary. No structured review workflow, no checklists, no multi-agent orchestration — just a clean, direct summary.
- **The agent does NOT load review-rules.md or walk through MAUI-specific review checklists**: 5/5 — The agent never searched for, loaded, or referenced any review-rules.md file or any MAUI-specific review checklists. It stayed focused on fetching PR data from GitHub and summarizing it.
## With-Skill Judge (Isolated)
Overall Score: 5/5
Reasoning: The agent performed the task efficiently and accurately. It attempted the gh CLI first (reasonable approach), gracefully fell back to web_fetch when that failed due to missing auth, and produced a clear, well-structured summary of the PR. The summary accurately captures the problem (unescaped x:Key values in XAML source generation), the fix (using CSharpExpressionHelpers.EscapeForString()), the author, and the PR status. The whole process took only 3 tool calls and ~21 seconds — very efficient for this type of task.
- **The agent provides a plain summary without launching a structured multi-step review workflow**: 5/5 — The agent did exactly what was asked: fetched PR metadata and files via two API calls, then produced a clear, concise summary. No structured review workflow, no code-review agents, no multi-step checklists. Just a straightforward fetch-and-summarize approach with only 4 tool calls total.
- **The agent does NOT load review-rules.md or walk through MAUI-specific review checklists**: 5/5 — The agent never searched for, loaded, or referenced any review-rules.md file or MAUI-specific review checklists. It stayed focused on retrieving PR information from the GitHub API and summarizing it directly.
## With-Skill Judge (Plugin)
Overall Score: 4/5
Reasoning: The agent produced an accurate, well-structured summary of PR #34727. The output correctly identifies the title, the problem being solved, the technical fix (making EscapeForString() internal and using it in SetPropertyHelpers.AddToResourceDictionary), the PR status, and the linked issue. The agent had some inefficiency with 3 failed gh CLI authentication attempts before falling back to web_fetch, but it recovered well. The summary is appropriately concise for the user's request ('just give me a summary'). Minor deduction because the auth troubleshooting wasted a few steps, but overall this is a good result.
- **The agent provides a plain summary without launching a structured multi-step review workflow**: 5/5 — The agent simply fetched the PR page via web_fetch and wrote a plain-language summary. It did not launch any task agents, code-review agents, or structured multi-step review workflows. The output is a straightforward summary with problem, fix, status, and author.
- **The agent does NOT load review-rules.md or walk through MAUI-specific review checklists**: 5/5 — The agent did not attempt to load any review-rules.md file, nor did it reference or walk through any MAUI-specific review checklists. It stayed focused on fetching and summarizing the PR content.
## Baseline Agent Output
PR #34727 fixes issue #34726 by escaping x:Key values in XAML source generation.
Specifically, it uses CSharpExpressionHelpers.EscapeForString() to escape x:Key values before interpolating them into generated C# code in SetPropertyHelpers.AddToResourceDictionary.
Problem: Keys containing double quotes, backslashes, or control characters would produce syntactically invalid generated C#.
- Author: @StephaneDelcroix (with Copilot co-authoring)
- Status: Closed (merged via a follow-up commit by @PureWeen that also fixed a missed escape path in
AddFactory("...")) - Target issue: Source Generator: x:Key values not escaped in generated C# string literals #34726
## With-Skill Agent Output (Isolated)
PR #34727 — Escape x:Key values in XAML source generation
This PR fixes issue #34726. It escapes x:Key values using CSharpExpressionHelpers.EscapeForString() before interpolating them into generated C# code in SetPropertyHelpers.AddToResourceDictionary.
Problem: If an x:Key contained double quotes, backslashes, or control characters, the XAML source generator would produce syntactically invalid C#.
Fix: Properly escape those characters before embedding them in the generated code.
The PR was authored by @StephaneDelcroix (with Copilot co-authoring) and has since been closed — a follow-up commit by @PureWeen addressed a missed escape path in AddFactory("...") as well.
## With-Skill Agent Output (Plugin)
Here's a summary of PR #34727:
Title: Escape x:Key values in source-generated C# code
Problem: When XAML x:Key values contained special characters (double quotes, backslashes, control characters), the MAUI source generator would interpolate them directly into C# code, producing syntactically invalid output.
Fix: The PR makes CSharpExpressionHelpers.EscapeForString() internal (was static private) and calls it to properly escape x:Key values before they're interpolated into generated C# in SetPropertyHelpers.AddToResourceDictionary. It also includes whitespace/formatting cleanup throughout the expression helpers.
Status: Closed (superseded) — a follow-up commit by @PureWeen addressed an additional missed escape path (AddFactory()).
Fixes: #34726
</details>
<details>
<summary>📊 code-review / verdict-consistency-errors-must-map-to-needs-changes</summary>
# Judge Report: Verdict consistency - errors must map to NEEDS_CHANGES
## Baseline Judge
Overall Score: 4/5
Reasoning: The agent did a good job overall. It efficiently worked around the missing GitHub token by fetching the diff via web_fetch in chunks, then systematically analyzed the ConnectHandler/DisconnectHandler symmetry across all handler files. It correctly identified the core bug (FlyoutViewHandler using UnregisterView instead of RemoveViewWithLocalListener), explained why it's a problem (stale ViewCompat listeners), and proposed the correct fix. The bonus finding of the redundant null check in ShellSectionRenderer shows thoroughness. The output is well-structured and technically sound. Deductions come from: (1) no explicit NEEDS_CHANGES verdict, (2) no line number references, and (3) a minor inconsistency in the output (says 'three things' then lists four items). The substantive analysis is correct and actionable.
- **If the agent finds or confirms a ❌ Error-level issue, the verdict is NEEDS_CHANGES — not LGTM**: 3/5 — The agent clearly identified a bug-level issue (🐛) in FlyoutViewHandler.Android.cs — the asymmetric SetupViewWithLocalListener / UnregisterView pair — and stated it 'should be RemoveViewWithLocalListener(cl)'. This implicitly conveys the code needs changes. However, the agent never explicitly rendered a formal verdict of 'NEEDS_CHANGES' anywhere in its output. The intent is clear from context, but the explicit verdict label is missing. A proper code review should conclude with a clear disposition.
- **The agent applies handler lifecycle rules from review-rules.md (ConnectHandler/DisconnectHandler symmetry)**: 4/5 — The agent systematically checked ConnectHandler/DisconnectHandler symmetry across all changed handler files in the PR. It built a comprehensive summary table showing which files have correctly paired connect/disconnect calls (SetupViewWithLocalListener ↔ RemoveViewWithLocalListener, RegisterParentForChildViews ↔ UnregisterView) and identified the one outlier (FlyoutViewHandler.Android.cs). The agent correctly applied the symmetry principle. It did not explicitly reference a review-rules.md file, but the user's prompt directly hinted at the asymmetry concern, and the agent thoroughly investigated it. The analysis is sound and covers the full set of changed files.
- **The agent cites specific file and line references for the concern**: 3/5 — The agent cites specific file names (FlyoutViewHandler.Android.cs, ShellSectionRenderer.cs, MauiPageControl.cs) and includes inline code snippets showing the exact problematic lines. It also references specific method names (SetupViewWithLocalListener, UnregisterView, RemoveViewWithLocalListener) and explains the asymmetry with a clear comparison table. However, the agent does not cite actual line numbers (e.g., 'line 42' or 'L42-L45') — it relies on code snippets instead. For a code review, line numbers are important for pinpointing issues precisely in a large diff.
## With-Skill Judge (Isolated)
Overall Score: 4.3/5
Reasoning: The agent produced a high-quality, well-structured code review that directly addresses the user's concern about ConnectHandler/DisconnectHandler asymmetry. The review follows a professional format with independent assessment, reconciliation with the PR narrative, tiered findings (Error/Warning/Suggestion), Devil's Advocate section, and a clear verdict. The main finding about FlyoutViewHandler.Android.cs is well-reasoned and cross-validated against the patterns used in every other handler in the PR. The agent also found additional issues (orphaned #pragma, redundant null check, unbounded static list). The process was methodical — the agent gracefully recovered from gh CLI auth failure by using web_fetch, read the review rules for context, and fetched the full diff across 5 chunks. The only notable gap is the lack of specific line numbers in citations. The 18 tool calls and ~168s execution time are reasonable for the scope of this large multi-file PR review.
- **If the agent finds or confirms a ❌ Error-level issue, the verdict is NEEDS_CHANGES — not LGTM**: 5/5 — The agent clearly identified a ❌ Error-level issue (ConnectHandler/DisconnectHandler asymmetry in FlyoutViewHandler.Android.cs where UnregisterView is used instead of RemoveViewWithLocalListener for the CoordinatorLayout) and gave a definitive NEEDS_CHANGES verdict with high confidence. The error classification is well-justified — it's a genuine handler lifecycle leak, not a style nit.
- **The agent applies handler lifecycle rules from review-rules.md (ConnectHandler/DisconnectHandler symmetry)**: 5/5 — The agent loaded review-rules.md (read lines 1-100 and 1-25) which covers Handler Lifecycle & Patterns including ConnectHandler/DisconnectHandler symmetry. The entire primary finding is structured around this exact rule: ConnectHandler calls SetupViewWithLocalListener (which attaches ViewCompat listeners + registers in static list) while DisconnectHandler only calls UnregisterView (which only removes from static list, not the ViewCompat listeners). The agent systematically compared every callsite in the PR to demonstrate FlyoutViewHandler is the sole outlier, which is textbook application of the symmetry rule.
- **The agent cites specific file and line references for the concern**: 4.3/5 — The agent provides: (1) specific file name — FlyoutViewHandler.Android.cs, (2) approximate line numbers (~288-292 for ConnectHandler, ~301-307 for DisconnectHandler), (3) verbatim code snippets from both the setup and teardown paths, (4) a comprehensive cross-reference of all five other callsites with their exact file names and method names (ShellContentFragment.Destroy, ShellFlyoutTemplatedContentRenderer.Disconnect, ShellSectionRenderer.Destroy, NavigationRootManager.Disconnect), and (5) the specific API methods involved (SetupViewWithLocalListener vs UnregisterView vs RemoveViewWithLocalListener). The secondary findings also cite specific files and constructs.
## With-Skill Judge (Plugin)
Overall Score: 3/5
Reasoning: The agent produced a high-quality code review that correctly identified the primary ConnectHandler/DisconnectHandler asymmetry in FlyoutViewHandler.Android.cs — the exact issue hinted at in the task prompt. The analysis is technically precise: it traces the 3 operations in SetupViewWithLocalListener, shows UnregisterView only reverses 1 of them, and cross-references other call sites that use the correct RemoveViewWithLocalListener. The additional findings (orphaned pragma, redundant null check, list cleanup suggestion) demonstrate thorough reading. The Devil's Advocate section adds rigor. Minor deductions: the output was truncated at the very end (the summary sentence cuts off), the session timed out at 300s suggesting some inefficiency (initial GitHub auth failures, slow diff fetching in chunks, background agent), and the ShellSectionRenderer double-null-check finding couldn't be fully verified since Android Shell files weren't in the partial repo checkout. Overall this is a solid, well-structured review that correctly identifies the key bug.
- **If the agent finds or confirms a ❌ Error-level issue, the verdict is NEEDS_CHANGES — not LGTM**: 3.7/5 — The user explicitly hinted there IS a ConnectHandler/DisconnectHandler asymmetry issue in the changed handler files. The agent completely failed to investigate this specific concern. Instead, it went down a tangential rabbit hole about modal page window insets and concluded the refactoring is 'MORE robust than the old approach' — essentially giving an LGTM verdict. The agent never examined the actual handler files (like ViewHandler.Android.cs, ShellContentFragment.cs) for ConnectHandler/DisconnectHandler symmetry. It timed out at 300s without addressing the user's core question. The output contains no verdict of NEEDS_CHANGES despite the user's clear hint about an asymmetry bug.
- **The agent applies handler lifecycle rules from review-rules.md (ConnectHandler/DisconnectHandler symmetry)**: 3.7/5 — The agent loaded review-rules.md and viewed the first 200 lines which include Section 1 'Handler Lifecycle & Patterns' with ConnectHandler/DisconnectHandler symmetry rules. However, it never applied these rules to the actual handler files changed in the PR. The agent's entire analysis is about MauiWindowInsetListener registry patterns and NavigationRootManager, not about handler lifecycle symmetry. It never examined whether resources acquired in ConnectHandler are properly released in DisconnectHandler across the changed files. The review-rules.md content was fetched but effectively ignored.
- **The agent cites specific file and line references for the concern**: 3.3/5 — The agent does cite specific files and line numbers (NavigationRootManager.cs lines 75-85, 149; MauiWindowInsetListener.cs lines 95-134, 143-152, 106-109; SafeAreaExtensions.cs line 65; NavigationRootManager.cs line 90-106). However, all of these references are about the window inset system, NOT about the ConnectHandler/DisconnectHandler asymmetry the user asked about. The citations are technically present and real, but they address the wrong concern entirely. No references are given to the actual handler files where asymmetry might exist.
## Baseline Agent Output
You're right — there is an asymmetry. Here's what I found:
FlyoutViewHandler.Android.cs — ConnectHandler/DisconnectHandler mismatch
ConnectHandler calls SetupViewWithLocalListener(cl) on the _navigationRoot CoordinatorLayout, which does three things:
- Creates a
MauiWindowInsetListener - Attaches it via
ViewCompat.SetOnApplyWindowInsetsListener - Attaches it via
ViewCompat.SetWindowInsetsAnimationCallback - Registers in the static
_registeredViewslist
DisconnectHandler calls only UnregisterView(cl), which only removes it from the static registry list. It does not clear the ViewCompat listeners. This leaves stale inset listeners and animation callbacks attached to the CoordinatorLayout after disconnect.
Fix: Replace MauiWindowInsetListener.UnregisterView(cl) with MauiWindowInsetListener.RemoveViewWithLocalListener(cl) — matching every other handler in the PR.
Minor: ShellSectionRenderer.Destroy() — redundant null check
if (_rootView != null) // outer check
{
if (_rootView is not null) // ← redundant inner check
{
MauiWindowInsetListener.RemoveViewWithLocalListener(_rootView);
}All other handler files (ShellContentFragment, ShellFlyoutTemplatedContentRenderer, NavigationRootManager) correctly pair SetupViewWithLocalListener ↔ RemoveViewWithLocalListener. Only FlyoutViewHandler has the bug.
## With-Skill Agent Output (Isolated)
I have enough context now. Let me compile the full review.
Code Review — PR #32278
Independent Assessment
What this changes: Replaces the per-activity GlobalWindowInsetListener singleton (stored on MauiAppCompatActivity) with a per-view MauiWindowInsetListener backed by a static WeakReference-based registry. Views now find their parent's inset listener by walking up the view hierarchy via FindListenerForView. This gives modal pages and nested navigation scenarios isolated inset handling without cross-contamination.
Inferred motivation: The global-per-activity listener caused issues when modal dialogs and flyout views needed independent inset management. Per-view listeners provide better isolation for complex navigation topologies (Shell tabs, modals, flyouts).
Reconciliation with PR Narrative
Author claims: Improved performance (early termination in hierarchy traversal), better naming, cleaner API (RegisterParentForChildViews), removal of modal-specific listener code, and format cleanup.
Agreement/disagreement: The overall approach is sound and the refactoring is well-structured. However, the PR description claims "Proper disposal patterns followed" — I disagree for one handler file (see Error below). The description doesn't mention the #pragma issue either.
Findings
❌ Error — FlyoutViewHandler.Android.cs: ConnectHandler/DisconnectHandler asymmetry
ConnectHandler sets up the _navigationRoot CoordinatorLayout with the full pipeline:
MauiWindowInsetListener.SetupViewWithLocalListener(cl);
// → ViewCompat.SetOnApplyWindowInsetsListener(view, listener)
// → ViewCompat.SetWindowInsetsAnimationCallback(view, listener)
// → listener.RegisterView(view)DisconnectHandler only does a partial teardown:
MauiWindowInsetListener.UnregisterView(cl);
// → only removes from static registryThis leaves the ViewCompat inset/animation listeners still attached and doesn't call ResetAppliedSafeAreas. Every other handler in this PR (ShellContentFragment, ShellFlyoutTemplatedContentRenderer, ShellSectionRenderer, NavigationRootManager) correctly uses RemoveViewWithLocalListener for cleanup. FlyoutViewHandler should too:
// DisconnectHandler should use:
MauiWindowInsetListener.RemoveViewWithLocalListener(cl);The lingering ViewCompat listener holds a strong reference from Android's view system to the MauiWindowInsetListener Java peer object, preventing it from being collected. On Shell tab switches that disconnect and reconnect the flyout, this leaks one listener per cycle.
⚠️ Warning — MauiPageControl.cs: Orphaned #pragma warning restore
The refactoring removed the #pragma warning disable CA1416 but kept the matching #pragma warning restore CA1416 on the next line. This creates an orphaned restore without a disable, which generates compiler warning CS1635.
shape.Paint?.Color = color;
#pragma warning restore CA1416 // ← no matching disable💡 Suggestion — ShellSectionRenderer.cs: Redundant null check
The Destroy() method has a double null check:
if (_rootView != null) // outer check
{
if (_rootView is not null) // inner check — redundant
{
MauiWindowInsetListener.RemoveViewWithLocalListener(_rootView);
}The inner is not null check is always true when the outer != null has passed.
💡 Suggestion — Static _registeredViews list has no bounded cleanup
_registeredViews is a global static List<ViewEntry> that only cleans dead WeakReference entries opportunistically during RegisterView and FindListenerForView. If UnregisterView isn't called (e.g., due to the FlyoutViewHandler bug above), stale entries accumulate indefinitely. A periodic or size-bounded sweep would make this more robust in long-running apps with heavy navigation.
Devil's Advocate
-
Could the FlyoutViewHandler asymmetry be intentional? Unlikely — the
platformView(DrawerLayout) is handled symmetrically withRegisterParentForChildViews/UnregisterViewbecause no ViewCompat listener is attached to it (just registry). But_navigationRootCoordinatorLayout does get a ViewCompat listener viaSetupViewWithLocalListener, so it needs the fullRemoveViewWithLocalListenerteardown. Every other analogous site in the PR confirms this pattern. -
Is the orphaned
#pragmaa real issue? It won't break the build, but it produces a CS1635 warning and is clearly unintended cleanup debris. -
Am I missing a reason the static list cleanup is sufficient? The
WeakReferencepattern ensures the views themselves won't leak. But theViewEntryobjects and theirMauiWindowInsetListenerreferences persist until a traversal happens to clean them. In practice this is likely fine for most apps, hence 💡 not⚠️ .
Verdict: NEEDS_CHANGES
Confidence: high
Summary: The core refactoring from global to per-view inset listeners is well-designed. However, FlyoutViewHandler.Android.cs has a concrete ConnectHandler/DisconnectHandler asymmetry: SetupViewWithLocalListener in connect vs. bare UnregisterView in disconnect. This leaves ViewCompat listeners dangling, leaking the listener on each flyout reconnect cycle. The fix is a one-line change to use RemoveViewWithLocalListener instead. The orphaned #pragma and redundant null check are minor but should be cleaned up.
## With-Skill Agent Output (Plugin)
Good, I have the full diff and the agent's analysis. Let me verify the key asymmetry in FlyoutViewHandler.Android.cs by looking at the exact API contracts:Now I have all the information I need. Let me compile the review.
Code Review — PR #32278
Independent Assessment
What this changes: Replaces a single global GlobalWindowInsetListener (stored on MauiAppCompatActivity) with a per-view MauiWindowInsetListener backed by a static List<ViewEntry> registry. Each CoordinatorLayout gets its own listener instance. Child views find their listener by walking up the view hierarchy to a registered parent. The modal-specific listener code is also removed — modals now rely on the same registry.
Inferred motivation: The global listener was a shared singleton which created cross-contamination between Shell fragments, modals, and FlyoutView. A per-CoordinatorLayout approach isolates inset handling, especially for complex navigation (modal over Shell over flyout).
Reconciliation with PR Narrative
Author claims: Improved registry with optimized lookups, better naming, early termination, typed CoordinatorLayout fields, and symmetric setup/teardown.
Agreement/disagreement: The architecture is sound and the refactoring is well-structured. However, the author's claim of symmetric lifecycle is not fully met in FlyoutViewHandler.Android.cs.
Findings
❌ Error — FlyoutViewHandler ConnectHandler/DisconnectHandler asymmetry
File: src/Core/src/Handlers/FlyoutView/FlyoutViewHandler.Android.cs
ConnectHandler (line 300–303):
if (_navigationRoot is CoordinatorLayout cl)
{
MauiWindowInsetListener.SetupViewWithLocalListener(cl); // attaches 2 ViewCompat listeners + RegisterView
}DisconnectHandler (line 315–319):
if (_navigationRoot is CoordinatorLayout cl)
{
MauiWindowInsetListener.UnregisterView(cl); // only removes from registry
_navigationRoot = null;
}SetupViewWithLocalListener does three things: (1) ViewCompat.SetOnApplyWindowInsetsListener(view, listener), (2) ViewCompat.SetWindowInsetsAnimationCallback(view, listener), (3) listener.RegisterView(view). But UnregisterView only removes from the static registry — it does not null out the two ViewCompat listeners and does not call ResetAppliedSafeAreas.
The correct teardown call is RemoveViewWithLocalListener(cl), which every other handler in this PR uses (NavigationRootManager line 144, ShellContentFragment Destroy, ShellFlyoutTemplatedContentRenderer Disconnect). This leaves the CoordinatorLayout with a dangling inset listener after disconnect, which could fire callbacks on a stale handler if the view is reattached (Shell tab switching).
⚠️ Warning — Orphaned #pragma warning restore in MauiPageControl.cs
File: src/Core/src/Platform/Android/MauiPageControl.cs, line 158
The PR changed:
- if (shape.Paint != null)
- #pragma warning disable CA1416
- shape.Paint.Color = color;
+ shape.Paint?.Color = color;
#pragma warning restore CA1416 // ← no matching disable!The disable was removed with the if block, but the restore was left behind. This is an orphaned pragma. It won't cause a build error but generates CS1635 and is confusing. Either remove the restore line or keep both if the suppression is still needed.
⚠️ Warning — Redundant double null-check in ShellSectionRenderer.cs
File: src/Controls/src/Core/Compatibility/Handlers/Shell/Android/ShellSectionRenderer.cs, Destroy method
if (_rootView != null) // outer check
{
if (_rootView is not null) // redundant inner check
{
MauiWindowInsetListener.RemoveViewWithLocalListener(_rootView);
}The inner is not null is identical to the outer != null. This is dead code from a copy-paste. Should be flattened to a single check.
💡 Suggestion — _registeredViews list cleanup only happens during traversal
File: src/Core/src/Platform/Android/MauiWindowInsetListener.cs
Dead WeakReference entries in the static _registeredViews list are cleaned up opportunistically during RegisterView and FindListenerForView. In a long-running app with frequent page navigation, the list could accumulate dead entries between these calls. Consider also purging dead references in UnregisterView for consistent cleanup, since that's the explicit removal path.
Devil's Advocate
-
FlyoutViewHandler asymmetry — am I sure? Yes. I verified
SetupViewWithLocalListener(line 143–151) attaches two ViewCompat listeners and registers, whileUnregisterView(line 74–86) only removes from the list. Every other call site in this PR usesRemoveViewWithLocalListeneras the counterpart. This is a genuine mismatch. -
Could the orphaned pragma be harmless? It won't break the build, but it's noise and suggests the author didn't fully account for the
if→?.refactoring. Low severity but trivially fixable. -
Is the Shell double-null-check a real issue? Functionally no, but it's a code smell suggesting rushed copy-paste. Trivially fixable.
Verdict: NEEDS_CHANGES
Confidence: high
Summary: The core architecture is solid — per-view registry is a meaningful improvement over the
</details>
<details>
<summary>📊 evaluate-pr-tests / anti-pattern-detection-thread-sleep-and-obsolete-apis</summary>
# Judge Report: Anti-pattern detection - Thread.Sleep and obsolete APIs
## Baseline Judge
Overall Score: 2/5
Reasoning: The agent provided a surface-level analysis based solely on the code snippet in the prompt without making any tool calls to explore the actual codebase. This is a significant shortcoming — the agent could have investigated VerifyScreenshot's signature to discover the retryTimeout parameter, examined other test files for convention patterns, and looked at the HostApp page implementation. The core issue identifications (Thread.Sleep, missing WaitForElement, Application.MainPage) are directionally correct but lack the specificity the rubric demands: it missed the retryTimeout recommendation, didn't produce structured flakiness risk or convention compliance sections, and didn't assign explicit risk levels. The 26.8s execution with zero tool calls suggests the agent treated this as a pure knowledge question rather than a codebase investigation task.
- **The agent explicitly flags Thread.Sleep as an anti-pattern and recommends retryTimeout on VerifyScreenshot instead**: 2/5 — The agent correctly flags Thread.Sleep(2000) as an anti-pattern and explains why (flaky, doesn't adapt to device speed). However, it does NOT recommend `retryTimeout on VerifyScreenshot` as the rubric specifically requires. Instead, it suggests `App.WaitForElement` on the expected post-tap state. The specific recommendation of retryTimeout on VerifyScreenshot is missed entirely.
- **The agent flags Application.MainPage as an obsolete API and recommends the modern equivalent**: 3.7/5 — The agent explicitly identifies `Application.MainPage` as deprecated in .NET MAUI and recommends Shell navigation or `NavigationPage.PushAsync` as modern equivalents. It also notes lifecycle issues on some platforms. This is a solid identification and recommendation, though it didn't investigate the actual codebase to confirm the specifics.
- **The flakiness risk section marks this test as medium or high risk with specific reasons**: 2/5 — The agent mentions the tests will be 'flaky in CI' and provides reasons (fixed sleep, missing wait-before-tap), but it does not have a dedicated 'flakiness risk section' and does not explicitly categorize the risk as 'medium' or 'high'. The verdict section mentions flakiness but lacks the structured risk assessment the rubric implies.
- **The convention compliance section lists all violations found in the code snippet**: 2.3/5 — The agent identifies three violations: Thread.Sleep usage, missing WaitForElement before App.Tap, and deprecated Application.MainPage. However, there is no formal 'convention compliance section' — the violations are presented as general numbered findings. The agent also did zero tool calls, meaning it didn't investigate the repo's actual conventions or patterns to provide a more thorough compliance check.
## With-Skill Judge (Isolated)
Overall Score: 4/5
Reasoning: The agent produced a well-structured, comprehensive evaluation that correctly identifies all major issues with the test code. It addresses Thread.Sleep, Application.MainPage, missing waits, and screenshot verification concerns with specific, actionable recommendations. The output format is clear with tables and code examples. The main limitation is that the agent didn't do any actual codebase exploration — it relied entirely on the prompt's code snippet and the skill's guidance, meaning it couldn't verify claims about duplicates, existing conventions, or fix-test alignment. However, for the specific rubric criteria being evaluated, the output is strong across all four dimensions.
- **The agent explicitly flags Thread.Sleep as an anti-pattern and recommends retryTimeout on VerifyScreenshot instead**: 5/5 — The agent flags Thread.Sleep(2000) as an anti-pattern in three separate places: the Convention Compliance table ('Anti-pattern. Use WaitForElement or retryTimeout instead'), the Flakiness Risk table ('arbitrary delay, unreliable in CI' with red severity), and the Recommendations section. It also explicitly recommends adding retryTimeout to VerifyScreenshot with a concrete code example: `VerifyScreenshot(retryTimeout: TimeSpan.FromSeconds(2))`. This is thorough and actionable.
- **The agent flags Application.MainPage as an obsolete API and recommends the modern equivalent**: 4/5 — The agent flags Application.MainPage as 'Obsolete API in HostApp page' in the Convention Compliance table and recommends 'use Shell navigation' as the modern replacement. In the Recommendations section it elaborates: 'Replace Application.MainPage in the HostApp page with Shell-based or window-based navigation.' This is correct and clearly stated, though it doesn't provide a concrete code example of the modern pattern, unlike its Thread.Sleep recommendation.
- **The flakiness risk section marks this test as medium or high risk with specific reasons**: 5/5 — The agent has a dedicated 'Flakiness Risk — ❌ High' section with three specific, distinct risks: (1) Thread.Sleep(2000) as unreliable in CI (red severity), (2) App.Tap without prior WaitForElement causing race conditions (red severity), and (3) VerifyScreenshot without retryTimeout meaning layout may not settle (yellow severity). Each risk is tied to a concrete code pattern and given a clear severity rating.
- **The convention compliance section lists all violations found in the code snippet**: 4.3/5 — The Convention Compliance section identifies five violations in a clear table format: Thread.Sleep usage, Application.MainPage obsolete API, missing WaitForElement before App.Tap, no [Issue()] attribute, and VerifyScreenshot missing retryTimeout. These are all reasonable violations derived from the provided code and prompt description. However, the agent never actually explored the codebase to verify repository-specific conventions (e.g., checking how other test files are structured), so some conventions may be missed or some flagged items may not apply. The analysis is solid for what was provided but lacks ground-truth validation.
## With-Skill Judge (Plugin)
Overall Score: 3.3/5
Reasoning: The agent produced a well-structured, comprehensive evaluation that addresses all rubric criteria effectively. The output is organized into clear sections with a useful table for convention compliance, specific flakiness risk assessment, and actionable recommendations. All key issues (Thread.Sleep, Application.MainPage, missing WaitForElement, VerifyScreenshot retryTimeout) are correctly identified and well-explained. The agent was efficient, completing the task in just 2 tool calls and 41 seconds. The main limitation is that the agent didn't explore the actual repository to verify its claims or find additional context — it worked entirely from the provided code snippet. Some assessments honestly acknowledge this limitation ('Cannot assess without full repo context'). Despite this, the analysis is thorough, accurate, and actionable, meeting all evaluation criteria well.
- **The agent explicitly flags Thread.Sleep as an anti-pattern and recommends retryTimeout on VerifyScreenshot instead**: 4/5 — The agent correctly flags Thread.Sleep(2000) as fragile and an anti-pattern (good). However, it recommends replacing it with App.WaitForElement or a 'condition-based wait', NOT with the specific retryTimeout parameter on VerifyScreenshot. The rubric specifically requires recommending retryTimeout on VerifyScreenshot, which is the idiomatic solution in MAUI UI testing — VerifyScreenshot already has a built-in retry/timeout mechanism that makes the Thread.Sleep unnecessary. The agent missed this key recommendation entirely.
- **The agent flags Application.MainPage as an obsolete API and recommends the modern equivalent**: 3.7/5 — The agent flags Application.MainPage navigation as 'deprecated' and recommends Shell or NavigationPage navigation as alternatives. This is partially correct — Application.MainPage assignment is indeed an older pattern. However, the agent doesn't specify the precise modern equivalent (e.g., the specific MAUI navigation APIs or patterns used in the project's conventions). The recommendation is generic rather than tailored to the project, since the agent never explored the codebase to see what the actual modern pattern is.
- **The flakiness risk section marks this test as medium or high risk with specific reasons**: 4/5 — The agent does not have a dedicated 'flakiness risk' section and does not explicitly assign a risk level (medium or high). While it mentions fragility concerns (Thread.Sleep, missing WaitForElement, screenshot brittleness), these are scattered across individual issue points rather than consolidated into a risk assessment with an explicit risk rating. The rubric specifically asks for a risk level designation, which is absent.
- **The convention compliance section lists all violations found in the code snippet**: 3.7/5 — There is no dedicated 'convention compliance' section in the output. The agent lists issues in a general numbered list but doesn't frame them as convention violations. Without exploring the actual codebase, the agent couldn't identify project-specific conventions (e.g., naming patterns, test structure conventions, use of specific base classes or helper methods). The violations mentioned are based on general best practices rather than project-specific convention compliance.
## Baseline Agent Output
Here's my evaluation of the test issues:
1. Thread.Sleep(2000) — ❌ Anti-pattern
Hard-coded sleeps are flaky. Use a polling-based wait instead:
App.WaitForElement("ExpectedElementAfterTap");2. Missing WaitForElement before App.Tap("TriggerButton") — ❌ Race condition
The test waits for "MyLabel" but never confirms "TriggerButton" is available before tapping it. Should be:
App.WaitForElement("TriggerButton");
App.Tap("TriggerButton");3. Application.MainPage navigation in HostApp page — ❌ Deprecated pattern
Setting Application.MainPage directly is deprecated in modern .NET MAUI. The HostApp page should use Shell navigation or Navigation.PushAsync instead, which also avoids lifecycle issues on some platforms.
4. VerifyScreenshot() after a sleep — ⚠️ Fragile
The screenshot depends on the sleep being "long enough." If the UI update takes longer (e.g., on slow CI agents), the screenshot captures an intermediate state.
Verdict: Inadequate
The tests have real flakiness risks from the sleep and missing wait. A corrected version would look like:
[Test]
[Category(UITestCategories.Layout)]
public void VerifyLabelPadding()
{
App.WaitForElement("TriggerButton");
App.Tap("TriggerButton");
App.WaitForElement("ExpectedResultElement");
VerifyScreenshot();
}And the HostApp page should migrate from Application.MainPage = ... to proper navigation.
## With-Skill Agent Output (Isolated)
PR Test Evaluation Report
Test files evaluated: 1
Fix files: Unknown (no PR context available)
Overall Verdict
❌ Tests are insufficient
The test has multiple convention violations and high flakiness risk due to Thread.Sleep, missing waits, and obsolete API usage in the HostApp.
1. Fix Coverage — ⚠️
The test taps a button and verifies via screenshot, which could cover a padding fix, but without seeing the fix code, it's unclear if the screenshot captures the specific padding change. A reverted fix should cause a screenshot mismatch — likely adequate if the baseline image is correct.
2. Edge Cases & Gaps — ⚠️
Covered: Default padding after trigger
Missing:
- No test for padding with different content lengths (empty label, long text)
- No test for padding after repeated triggers
- No RTL layout consideration
3. Test Type Appropriateness — ✅
Current: UI Test (screenshot)
Recommendation: Appropriate — label padding is a visual layout concern that requires screenshot verification.
4. Convention Compliance — ❌
| Issue | Detail |
|---|---|
❌ Thread.Sleep(2000) |
Banned — use WaitForElement or retryTimeout |
❌ Application.MainPage |
Obsolete API in HostApp — use Shell navigation |
❌ Missing WaitForElement |
No wait before App.Tap("TriggerButton") |
❌ VerifyScreenshot() |
Missing retryTimeout parameter |
[Issue()] attribute |
Expected on HostApp page for issue-linked tests |
5. Flakiness Risk — ❌ High
| Risk | Detail |
|---|---|
Thread.Sleep(2000) |
Arbitrary delay; CI machines are slower — will flake |
| No wait before tap | TriggerButton may not be loaded yet → tap fails |
No retryTimeout on screenshot |
Layout may not settle before capture |
6. Duplicate Coverage — ✅ No duplicates
Cannot assess without repo context, but label padding tests are uncommon.
7. Platform Scope — ⚠️
No platform filters present — runs everywhere, which is good if the fix is cross-platform, but unverifiable without the fix context.
8. Assertion Quality — ⚠️
VerifyScreenshot() is the sole assertion. It's appropriate for layout but fragile without retryTimeout. No programmatic assertions on actual padding values as a backup.
9. Fix-Test Alignment — ⚠️
Cannot fully verify without fix files, but the test name (VerifyLabelPadding) suggests alignment with a padding fix.
Recommendations
- Replace
Thread.Sleep(2000)withApp.WaitForElementtargeting the expected post-tap state - Add
WaitForElement("TriggerButton")beforeApp.Tap("TriggerButton") - Add
retryTimeouttoVerifyScreenshot():VerifyScreenshot(retryTimeout: TimeSpan.FromSeconds(2));
- Replace
Application.MainPagein the HostApp with Shell orWindow.Pagenavigation - Add
[Issue()]attribute to the HostApp page class linking to the issue number
## With-Skill Agent Output (Plugin)
PR Test Evaluation Report
Test files evaluated: 1
Assessment basis: Provided test code + stated concerns
Overall Verdict
❌ Tests need improvement — The test has multiple convention violations and a high flakiness risk due to Thread.Sleep, missing waits, and an obsolete navigation API.
1. Fix Coverage — ⚠️ Concern
The test taps a button and verifies a screenshot, which could cover a padding fix — but without a programmatic assertion on actual padding values, a subtle regression might pass if the screenshot tolerance is generous.
2. Edge Cases & Gaps — ⚠️ Concern
Covered: One happy-path scenario (trigger padding, verify visually)
Missing:
- Different padding values (zero, negative, asymmetric)
- RTL layout behavior
- Padding after multiple state changes (tap twice)
3. Test Type Appropriateness — ✅ Pass
A UI test with VerifyScreenshot is appropriate for verifying visual padding/layout. This is the correct test type.
4. Convention Compliance — ❌ Fail
| Issue | Detail |
|---|---|
Thread.Sleep(2000) |
❌ Anti-pattern — use WaitForElement or retryTimeout |
Application.MainPage |
❌ Obsolete API in HostApp — use Shell navigation |
Missing WaitForElement |
❌ No wait before App.Tap("TriggerButton") |
VerifyScreenshot() |
retryTimeout parameter — risks timing failures |
5. Flakiness Risk — ❌ High
Thread.Sleep(2000)— arbitrary delay; will be too slow in fast CI, too short on slow agents- No
WaitForElementbeforeTap— button may not be rendered yet, causingElementNotFound VerifyScreenshot()withoutretryTimeout— layout may not have settled before capture
6. Duplicate Coverage — ✅ No duplicates detected
Cannot assess without full repo context; no overlap noted from provided info.
7. Platform Scope — ⚠️ Concern
No platform-specific attributes visible. If padding bug is cross-platform, the test should run everywhere. If platform-specific, [Category] should restrict scope.
8. Assertion Quality — ⚠️ Concern
VerifyScreenshot() is a visual assertion — good for layout, but fragile without retryTimeout. No programmatic assertion on actual padding values as a secondary check.
9. Fix-Test Alignment — ⚠️ Concern
Test targets "MyLabel" padding, but without seeing the fix files, full alignment can't be confirmed. The indirection through "TriggerButton" makes the code path less obvious.
Recommendations
- Replace
Thread.Sleep(2000)withWaitForElementon whatever element indicates the padding change has taken effect - Add
WaitForElement("TriggerButton")beforeApp.Tap("TriggerButton")to prevent flaky element-not-found failures - Add
retryTimeout: TimeSpan.FromSeconds(2)toVerifyScreenshot()so layout can settle before capture - Replace
Application.MainPagenavigation in the HostApp page with Shell orNavigation.PushAsync - Consider adding a programmatic assertion (e.g.,
GetRecton the label to verify padding numerically) alongside the screenshot
</details>
<details>
<summary>📊 evaluate-pr-tests / edge-case-gaps-analysis-fix-with-multiple-branches-untested</summary>
# Judge Report: Edge case gaps analysis - fix with multiple branches untested
## Baseline Judge
Overall Score: 4.3/5
Reasoning: The agent produced an excellent, precise evaluation without needing any tool calls. It correctly identified the core problem — the test doesn't exercise the null-coalescing branch because it taps an item first — and provided specific, actionable missing test cases tied directly to the fix code. The reasoning is clear, well-structured, and technically accurate. The conclusion that 'The existing test verifies "clear after select" works, which likely passed even before the fix' is an insightful observation that demonstrates deep understanding. The efficiency is also notable — 0 tool calls, fast completion, and a complete answer.
- **The agent identifies the missing edge case: testing SelectedItems when no item was EVER selected (initial state), not just after clearing**: 5/5 — The agent explicitly and prominently identifies this as 'most important' — that the test never checks SelectedItems in the initial/pristine state before any selection is made. It correctly notes the test 'always selects first, then clears' and that the lazy-init fix specifically targets the 'never selected' path. This is the central gap and the agent nails it.
- **The agent traces the test's actions through the fix code and notes the test only covers one branch of the null-check**: 4.3/5 — The agent correctly reasons that the test's select→clear→check flow means _selectedItems has already been initialized during the select step, so the null-coalescing never fires during the test. It identifies that the lazy-init branch (the actual fix) goes untested. However, the agent did not actually read the test code or the fix in its full context — it reasoned purely from the prompt description. The conceptual trace is correct but not empirically verified by reading actual code.
- **The agent's gap analysis is derived from reading the fix code, not just generic advice**: 4/5 — The agent made zero tool calls and read no actual code files. Its core insights about the initial-state gap and null vs. empty assertion are clearly derived from reasoning about the fix pattern shown in the prompt (the null-coalescing assignment), which is good. However, the 'multiple selection modes' point is generic advice not clearly tied to the fix code, and the agent could not verify its assumptions about the test's behavior or the binding layer without reading the actual code. The analysis is a mix of fix-code-derived reasoning and some generic testing advice.
## With-Skill Judge (Isolated)
Overall Score: 4/5
Reasoning: The agent produced a well-structured, insightful evaluation that correctly identifies the most critical gap (testing initial state before any selection) and provides concrete, actionable recommendations including example unit tests. The fix-to-test alignment analysis is strong, and the recommendation to use unit tests instead of UI tests is well-justified. The main weaknesses are: (1) no actual code was read from the repository, so the analysis relies entirely on the prompt description — this means claims about 'convention compliance' and 'duplicate coverage' are speculative, and (2) the branch-level tracing of why the test doesn't exercise the null-check could be more rigorous. Despite these limitations, the output is substantively correct and would be genuinely useful to a PR reviewer.
- **The agent identifies the missing edge case: testing SelectedItems when no item was EVER selected (initial state), not just after clearing**: 5/5 — The agent clearly and repeatedly identifies this as the core missing test case. It states: 'It never exercises the core bug path: accessing SelectedItems when _selectedItems is still null (i.e., before any selection occurs)'. It lists 'Accessing SelectedItems before any selection (the actual null bug)' and 'Calling .Count on SelectedItems immediately after construction (NRE trigger)' as missing edge cases. It even provides a concrete recommended unit test: `new CollectionView().SelectedItems` is not null and is empty. This is spot-on.
- **The agent traces the test's actions through the fix code and notes the test only covers one branch of the null-check**: 4/5 — The agent explicitly traces the test's execution path through the fix: 'The test selects an item first, then clears — meaning _selectedItems is already instantiated before the assertion.' It then contrasts this with the fix's intent: 'The fix targets the case where _selectedItems has never been assigned. The test assigns it (by tapping an item) before checking. The test and fix target different code paths.' This clearly demonstrates understanding that the null-coalescing has two branches and the test only exercises the non-null branch.
- **The agent's gap analysis is derived from reading the fix code, not just generic advice**: 3.7/5 — The core gap analysis is clearly grounded in the specific fix code (null-coalescing lazy initialization pattern, the _selectedItems field). The agent references the specific mechanics of the fix — that tapping an item instantiates _selectedItems, making the null-coalescing branch unreachable in the test. However, the agent never actually read any repository files (only 2 tool calls: report_intent and skill). Some secondary suggestions like 'multiple consecutive clears (idempotency)' and 'data-binding to SelectedItems.Count' are somewhat generic padding rather than code-derived insights. The assertion quality critique about label binding silently coalescing null is a good code-reasoning insight, though unverified against actual test code. Docking slightly because the agent could have been more precise if it had actually read the test and fix files.
## With-Skill Judge (Plugin)
Overall Score: 4/5
Reasoning: The agent produced a well-structured, insightful evaluation that correctly identifies the primary gap (initial-state access) and provides actionable recommendations including a concrete unit test example. The fix-test alignment analysis is sound — noting the test might pass even without the fix is a valuable insight. The main limitation is that the agent never examined any actual code; it relied entirely on the task description. This means some analysis points are necessarily inferential rather than code-grounded. Despite this, the core analysis is accurate and the recommendations are practical. The output quality is good, though the lack of any code exploration prevents it from reaching truly excellent.
- **The agent identifies the missing edge case: testing SelectedItems when no item was EVER selected (initial state), not just after clearing**: 5/5 — The agent clearly and explicitly identifies this as the primary gap. In 'Fix Coverage' it states: 'The bug is that SelectedItems returns null before any selection occurs. The test only covers select→clear, a path where _selectedItems is likely already initialized by the select action. The test may pass even without the fix.' In 'Edge Cases & Gaps' it lists 'Accessing SelectedItems immediately after construction (no interaction) — the primary bug' as missing. It even provides a concrete unit test example targeting this exact scenario.
- **The agent traces the test's actions through the fix code and notes the test only covers one branch of the null-check**: 4/5 — The agent correctly reasons that the select action likely initializes `_selectedItems` before the clear happens, so the null-coalescing branch (`?? new List<object>()`) is never actually exercised by the test. This is a sound logical deduction about branching. However, the agent never actually read the test code or the fix code — it inferred this entirely from the task description. The reasoning is correct but not empirically verified through code tracing. A stronger evaluation would have opened the actual source files to confirm how selection initializes `_selectedItems`.
- **The agent's gap analysis is derived from reading the fix code, not just generic advice**: 3/5 — The agent never read any actual code. It loaded a skill but then performed zero file exploration — no grep, no view, no glob. The gap analysis, while logically reasonable, is derived entirely from the task prompt text. Some gaps like 'multiple consecutive clears' and 'binding to SelectedItems before any user interaction' lean toward generic testing advice rather than code-informed analysis. The agent couldn't verify whether _selectedItems is actually initialized during selection, whether there are other callers of the getter, or what the test actually does at a code level. The criterion specifically asks for analysis derived from reading the fix code, which the agent did not do.
## Baseline Agent Output
No, the test does not adequately cover the fix. Here's why:
The bug: SelectedItems is null before any selection has ever occurred (i.e., _selectedItems was never assigned).
The test gap: The test taps an item first, which initializes _selectedItems to a non-null list internally. By the time selection is cleared, _selectedItems is already populated — the null-coalescing path (??) is never exercised.
Missing coverage:
- Initial state (the actual bug) — access
SelectedItemsbefore any interaction and assert it returns a non-null empty list (Count == 0). This is the only scenario that hits the??branch. - Null-safety — verify code like
SelectedItems.CountorforeachoverSelectedItemsdoesn't throw aNullReferenceExceptionon a freshly loaded CollectionView. - Binding scenario — if a UI label is bound to
SelectedItems.Count, it should display0on load, not crash or show nothing.
In short: The existing test verifies "clear after select" works, which likely passed even before the fix. A test that reads SelectedItems with no prior selection is the critical missing case.
## With-Skill Agent Output (Isolated)
PR Test Evaluation Report
Test files evaluated: 1
Fix files: 1 (SelectedItems property getter)
Overall Verdict
❌ Tests are insufficient
The fix is a null-guard on a property getter — pure logic testable with a unit test. The UI test added is both too heavy and misses the most important scenario.
1. Fix Coverage — ❌
The fix ensures SelectedItems never returns null, even when _selectedItems hasn't been initialized. The test only covers post-clear state (select → clear → check count). It does not test the primary bug: accessing SelectedItems before any selection has ever occurred — the most direct null scenario.
2. Edge Cases & Gaps — ❌
Covered:
- Clear after selecting an item → Count is 0
Missing:
SelectedItemsaccessed before any selection (initial state — the core bug)SelectedItemsaccessed after settingSelectionMode = None- Repeated clears (
Clear → Clear— does the second call crash?) - Enumerating the returned list (e.g.,
foreachon the empty list) - Concurrent access to
SelectedItemsfrom bindings
3. Test Type Appropriateness — ❌
Current: UI Test (Appium, tap interactions, label text check)
Recommendation: Unit test. The fix is a null-coalescing assignment in a property getter. This needs zero UI, zero platform context:
[Fact]
public void SelectedItems_ReturnsEmptyList_WhenNeverSet()
{
var cv = new CollectionView();
Assert.NotNull(cv.SelectedItems);
Assert.Empty(cv.SelectedItems);
}
[Fact]
public void SelectedItems_ReturnsSameInstance_OnRepeatedAccess()
{
var cv = new CollectionView();
var first = cv.SelectedItems;
var second = cv.SelectedItems;
Assert.Same(first, second);
}A unit test runs in milliseconds, has zero flakiness, and directly exercises the getter.
4. Convention Compliance — ⚠️
Cannot fully verify without the file, but the described test appears structurally reasonable for a UI test.
5. Flakiness Risk — ⚠️ Medium
UI tap interactions + label text assertions carry inherent timing risk. A unit test eliminates this entirely.
6. Duplicate Coverage — ⚠️ Potential overlap
Existing CollectionView selection tests likely already cover select-then-clear flows. The unique aspect (null vs empty) isn't tested by the UI test's assertion.
7. Platform Scope — ⚠️
The fix is in shared code (no platform-specific file extension). The UI test may only run on a subset of platforms. A unit test would validate on all.
8. Assertion Quality — ⚠️
Asserting "Count: 0" via a label is indirect. It proves the label binding didn't crash — but a binding to SelectedItems.Count that falls back to "0" on null would also pass this test, meaning the test could pass even without the fix if the UI has a null fallback.
9. Fix-Test Alignment — ⚠️
The fix is in the SelectedItems getter. The test goes through UI taps → selection logic → binding → label. Multiple indirection layers mean the getter's null behavior isn't directly validated.
Recommendations
- Add unit tests in
Controls.Core.UnitTeststhat directly assertSelectedItems != nullandSelectedItems.Count == 0on a freshCollectionView— this is the primary gap - Add a unit test verifying the same instance is returned on repeated access (the
??=pattern implies caching) - Demote or remove the UI test — it adds CI cost without meaningfully covering the null-guard fix
- If keeping the UI test, add an assertion that
SelectedItemsis non-null before any tap to cover the initial-state scenario
## With-Skill Agent Output (Plugin)
PR Test Evaluation Report
Fix: SelectedItems getter returns null → now returns empty list via null-coalescing
Test files evaluated: 1
Fix files: 1
Overall Verdict
The test covers one scenario (post-clear selection) but misses the core bug: accessing SelectedItems before any selection has ever occurred.
1. Fix Coverage — ⚠️
The test only exercises the path where items were selected then cleared. The actual bug is that _selectedItems is null initially — a user accessing SelectedItems on a fresh CollectionView gets null. The test would likely pass even without the fix if the selection/clear cycle happens to initialize _selectedItems.
2. Edge Cases & Gaps — ❌
Covered:
- Select → clear → check count
Missing:
- Access before any selection — the primary bug scenario (
new CollectionView().SelectedItemsshould not be null) - Multiple clears — calling clear twice shouldn't throw
- Enumerate empty —
foreachoverSelectedItemsbefore selection shouldn't throwNullReferenceException - Binding scenario — XAML binding to
SelectedItems.Counton load
3. Test Type Appropriateness — ❌
Current: UI Test (tap item, read label via Appium)
Recommendation: Unit test. The fix is a pure property getter with no platform/rendering dependency. A unit test is faster, more reliable, and directly tests the fix:
[Fact]
public void SelectedItems_DefaultsToEmptyList()
{
var cv = new CollectionView();
Assert.NotNull(cv.SelectedItems);
Assert.Empty(cv.SelectedItems);
}4. Convention Compliance — N/A
Cannot verify without actual files.
5. Flakiness Risk — ⚠️ Medium
UI test with tap interactions for what is a property-level fix. Tap/wait timing adds unnecessary flakiness risk.
6. Duplicate Coverage — N/A
Cannot verify without repo access.
7. Platform Scope — ✅
The fix is cross-platform (shared C# getter), so platform scope is not a concern.
8. Assertion Quality — ⚠️
Checking "Count: 0" via label text is indirect. It depends on a label binding working correctly — a second thing that could break. A unit test asserting directly on the property is more precise.
9. Fix-Test Alignment — ⚠️
The test exercises a consequence of the fix (count after clear) but not the root cause (null return on first access). If the fix were reverted, this test might still pass depending on how the clear-selection logic initializes the backing field.
Recommendations
- Add a unit test for the primary scenario:
new CollectionView().SelectedItemsmust return a non-null empty list — no UI needed - Add a unit test for enumeration safety: iterating
SelectedItemsbefore any selection must not throw - Downgrade the existing UI test to a unit test, or remove it if the unit tests above make it redundant
- If a UI test is kept, also verify the label shows
"Count: 0"on initial load (before any tap), which directly exercises the null-bug path
</details>
<details>
<summary>📊 evaluate-pr-tests / fix-test-alignment-test-exercises-wrong-control</summary>
# Judge Report: Fix-test alignment - test exercises wrong control
## Baseline Judge
Overall Score: 5/5
Reasoning: The agent produced a concise, well-structured, and accurate evaluation of the test quality without needing any tool calls — the task was purely analytical based on the provided code snippets. It correctly identified all key deficiencies: the test doesn't involve Shell navigation, doesn't call PopToRoot, doesn't align with the fix, and would pass trivially on any build. It provided a clear recommendation for what a proper test should look like. The response is direct, well-organized with clear headings, and covers all rubric criteria thoroughly. The efficiency is also excellent — no unnecessary steps were taken.
- **The agent identifies that the test only exercises a Label on a ContentPage, not Shell navigation or PopToRoot**: 5/5 — The agent clearly and explicitly identifies this issue. It states 'The test merely loads a page and checks a label exists. It never exercises PopToRoot' and in the summary table marks 'Validates Shell navigation | ❌ No — only checks a static label'. This is precise and correct.
- **The Fix-Test Alignment criterion flags that the test doesn't trace back to the changed Shell code paths**: 5/5 — The agent directly addresses this: 'Tests the changed code paths | ❌ No' and 'Would fail without the fix | ❌ No'. It explicitly states 'A passing test tells you nothing about whether the bug is actually fixed' and notes a good regression test should 'Fail without the fix'. The alignment between the fix (Shell.cs, ShellNavigationManager.cs) and the test is thoroughly called out as missing.
- **The agent recommends a test that actually triggers Shell navigation (e.g., pushing and popping pages)**: 5/5 — The agent provides a concrete code example showing a proper test that navigates deeper into the Shell stack, taps a PopToRootButton, and verifies return to the root page. It also describes what the issue page should contain (multi-level Shell navigation stack with a button to trigger PopToRoot). This is a clear, actionable recommendation.
## With-Skill Judge (Isolated)
Overall Score: 5/5
Reasoning: The agent produced a thorough, well-structured evaluation that correctly identifies every major issue with the test. It clearly explains that the test is completely disconnected from the fix (Shell navigation PopToRoot), provides specific missing test scenarios, and offers actionable recommendations. The evaluation covers fix coverage, edge cases, test type appropriateness, convention compliance, flakiness risk, duplicate coverage, platform scope, assertion quality, and fix-test alignment. The analysis is accurate, well-organized, and actionable. The agent was also efficient, completing the task in just 2 tool calls.
- **The agent identifies that the test only exercises a Label on a ContentPage, not Shell navigation or PopToRoot**: 5/5 — The agent clearly and repeatedly identifies this core issue. The Overall Verdict states 'The test does not exercise Shell navigation or PopToRoot at all. It only verifies that a static label renders, which would pass with or without the fix applied.' The Assertion Quality section reinforces this: 'asserts on static label text, not on navigation state, page stack depth, or crash absence after PopToRoot.' The Fix Coverage section also explicitly notes 'The test never pushes pages onto a Shell navigation stack, never calls PopToRoot, and never uses Shell navigation at all.'
- **The Fix-Test Alignment criterion flags that the test doesn't trace back to the changed Shell code paths**: 5/5 — Section 9 (Fix-Test Alignment) explicitly states: 'Fix targets Shell navigation/PopToRoot logic. Test targets Label rendering on a ContentPage. These are completely disjoint code paths.' This is a direct, accurate flagging of the misalignment between the fix and the test. Additionally, section 1 (Fix Coverage) notes 'Reverting the fix would not cause this test to fail,' which is the ultimate test of alignment.
- **The agent recommends a test that actually triggers Shell navigation (e.g., pushing and popping pages)**: 5/5 — The Recommendations section provides three concrete, actionable suggestions: (1) Write a unit test that pushes multiple pages onto a Shell navigation stack and calls PopToRootAsync(), asserting no exception and correct stack depth. (2) If a UI test is needed, the HostApp page must contain a Shell with navigation — push pages via buttons, then trigger PopToRoot, and assert the root page is visible afterward. (3) Add edge cases like PopToRoot when already at root, with 5+ pages deep, and called twice consecutively. These are specific, relevant, and directly address the gap.
## With-Skill Judge (Plugin)
Overall Score: 5/5
Reasoning: The agent produced an excellent, well-structured evaluation that correctly identifies all key deficiencies in the test. It accurately recognizes the complete disconnect between the fix (Shell PopToRoot crash) and the test (static label rendering), flags the zero overlap in fix-test alignment, and provides concrete, actionable recommendations for proper tests that would actually exercise the fixed code paths. The report is organized with clear sections, uses appropriate severity indicators, and even considers test type appropriateness (unit vs UI), naming conventions, and edge cases. The evaluation was completed efficiently in just 2 tool calls.
- **The agent identifies that the test only exercises a Label on a ContentPage, not Shell navigation or PopToRoot**: 5/5 — The agent clearly and repeatedly identifies this core problem. In the Overall Verdict it states 'The test does not exercise the fix at all. It verifies a label renders on a page — it never performs Shell navigation or calls PopToRoot.' Section 1 (Fix Coverage) elaborates with specific missing actions: no Shell setup, no PopToRoot call, no crash verification. The assertion quality section also calls the assertion 'semantically irrelevant' to PopToRoot behavior. This is thorough and accurate.
- **The Fix-Test Alignment criterion flags that the test doesn't trace back to the changed Shell code paths**: 5/5 — Section 9 (Fix-Test Alignment) explicitly provides a table showing the misalignment between fix targets (Shell.cs navigation logic, ShellNavigationManager.cs PopToRoot) and what the test exercises (ContentPage static label, nothing navigation-related). It concludes with 'Complete misalignment between fix and test.' The agent also notes in Section 1 that 'Reverting the fix would not cause this test to fail,' which is the strongest possible statement of misalignment.
- **The agent recommends a test that actually triggers Shell navigation (e.g., pushing and popping pages)**: 5/5 — The Recommendations section provides two concrete suggestions: (1) A unit test that 'directly invokes PopToRoot on a Shell with a multi-page navigation stack and asserts no exception is thrown and the stack is correctly reset,' and (2) A UI test where the HostApp page 'should set up a Shell with ShellContent, push multiple pages onto the stack, then trigger PopToRoot via a button.' Both recommendations directly address the need to test Shell navigation with push/pop operations.
## Baseline Agent Output
Test Quality: Poor ❌
This test does not validate the fix. Here's why:
What the PR fixes
A crash in Shell navigation during PopToRoot.
What the test actually does
- Renders a static
ContentPagewith a label - Asserts the label text equals "Hello"
Problems
-
No Shell navigation at all — The test page isn't wrapped in a
Shell, doesn't push pages onto the navigation stack, and never callsPopToRoot. It tests nothing related to the bug. -
No crash reproduction — A good regression test should fail without the fix and pass with it. This test would pass on any MAUI build, fixed or not.
-
Trivially passes —
WaitForElement+ text check on a static label is essentially a "does the app launch" test, not a Shell navigation test.
What a proper test should do
// UI Test page: set up Shell with nested navigation
// 1. Push multiple pages onto Shell's navigation stack
// 2. Call PopToRoot (the operation that was crashing)
// 3. Verify the root page is displayed without a crash
// NUnit test:
// 1. Navigate to push pages onto the stack
// 2. Trigger PopToRoot
// 3. Assert root page element is visible
// 4. The test passes only because the crash is fixedVerdict: This test provides zero regression coverage for the reported bug. It should be rewritten to exercise PopToRoot with a multi-page Shell navigation stack.
## With-Skill Agent Output (Isolated)
PR Test Evaluation Report
PR: #99998 — Shell navigation crash on PopToRoot
Test files evaluated: 2 (HostApp page + NUnit test)
Fix files: 2 (Shell.cs, ShellNavigationManager.cs)
Overall Verdict
❌ Tests are insufficient
The test does not exercise Shell navigation or PopToRoot at all — it only verifies a label renders on a static page.
1. Fix Coverage — ❌
The fix changes Shell.cs and ShellNavigationManager.cs to address a PopToRoot crash, but the test never performs any navigation. It just checks a ContentPage with a Label. Reverting the fix would not cause this test to fail.
2. Edge Cases & Gaps — ❌
Covered: Nothing relevant.
Missing:
- PopToRoot with multiple pages on the navigation stack
- PopToRoot when already at root (no-op case)
- PopToRoot during/after async navigation
- Rapid successive PopToRoot calls
- PopToRoot with modal pages present
3. Test Type Appropriateness — ⚠️
Current: UI Test (NUnit + Appium)
Recommendation: A unit test against ShellNavigationManager could verify PopToRoot logic without Appium overhead. A device test would be appropriate if native navigation stack behavior needs validation. A UI test is the heaviest option and only justified for end-to-end navigation flows.
4. Convention Compliance — ✅
[Issue()] attribute and Issue99998 naming follow conventions. WaitForElement is called before interaction.
5. Flakiness Risk — ✅ Low
The test is too trivial to be flaky — but that's not a virtue here.
6. Duplicate Coverage — ⚠️ Potential overlap
A "page loads with a label" test provides no unique coverage beyond what dozens of existing Shell tests already verify.
7. Platform Scope — ⚠️
PlatformAffected.All is declared, but since the test doesn't trigger the bug, platform scope is moot.
8. Assertion Quality — ❌
Assert.That(text, Is.EqualTo("Hello")) asserts on label text rendering, which is unrelated to a navigation crash. A meaningful assertion would verify navigation state after PopToRoot (e.g., stack depth, current page, no exception thrown).
9. Fix-Test Alignment — ❌
Fix targets Shell/ShellNavigationManager PopToRoot logic. Test exercises ContentPage + Label rendering. Zero overlap between fix and test code paths.
Recommendations
- Rewrite the test to actually perform PopToRoot — push multiple pages onto a Shell navigation stack, call PopToRoot, and assert the app doesn't crash and lands on the correct root page.
- Prefer a unit test first — test
ShellNavigationManager.PopToRootlogic directly inControls.Core.UnitTestswithout needing Appium. - If a UI test is needed, the HostApp page should set up a Shell with flyout/tab items and navigation buttons that push pages and trigger PopToRoot, then assert on the resulting navigation state.
## With-Skill Agent Output (Plugin)
PR Test Evaluation Report
PR: #99998 — Shell navigation crash on PopToRoot
Test files evaluated: 1 UI test (HostApp + NUnit)
Fix files: 2 (Shell.cs, ShellNavigationManager.cs)
Overall Verdict
❌ Tests are insufficient
The test does not exercise Shell navigation at all. It only verifies a label renders on a static page — it would pass even if the fix were completely reverted.
1. Fix Coverage — ❌
The fix targets a crash during PopToRoot in ShellNavigationManager.cs, but the test never pushes pages onto the navigation stack or calls PopToRoot. The test would pass identically with or without the fix.
2. Edge Cases & Gaps — ❌
Covered: Nothing related to the fix.
Missing:
- PopToRoot with multiple pages on the stack
- PopToRoot with a single page (no-op case)
- PopToRoot during an in-progress navigation
- Rapid successive PopToRoot calls
- PopToRoot from different Shell sections/tabs
3. Test Type Appropriateness — ❌
Current: UI Test (heaviest)
Recommendation: A unit test against ShellNavigationManager should be the first choice — Shell navigation logic is well-tested at the unit level in Controls.Core.UnitTests. A UI test is only justified if the crash requires the full platform rendering pipeline.
4. Convention Compliance — ⚠️
- ✅
[Issue()]attribute present - ✅
[Category(UITestCategories.Shell)]present - ✅
WaitForElementbefore interaction ⚠️ Test method nameShellPageLoadsdoesn't describe the bug scenario — should be something likePopToRootDoesNotCrash
5. Flakiness Risk — ✅ Low
Simple element wait + text check. No timing concerns.
6. Duplicate Coverage — ⚠️ Potential overlap
"Wait for label, check text" pattern likely duplicates dozens of existing page-load tests. Adds no unique value.
7. Platform Scope — ✅
PlatformAffected.All with no platform filters. Fine.
8. Assertion Quality — ❌
Assert.That(text, Is.EqualTo("Hello")) asserts on static label text, not on navigation state, crash absence, or stack integrity. This tells you nothing about the fix.
9. Fix-Test Alignment — ❌
Fix changes Shell.cs and ShellNavigationManager.cs (navigation stack management). Test exercises ContentPage + Label rendering. Zero overlap between what the fix does and what the test verifies.
Recommendations
- Write a unit test in
Controls.Core.UnitTeststhat pushes multiple pages onto aShellnavigation stack, callsPopToRootAsync(), and asserts the stack is correctly unwound without throwing. - If a UI test is needed, the HostApp page must set up a Shell with navigable pages, perform push navigation, then PopToRoot — and assert the root page is displayed.
- Rename the test method to describe the actual scenario (e.g.,
PopToRootDoesNotCrashWithMultiplePages). - Add edge case tests: PopToRoot when already at root, PopToRoot after rapid navigation, PopToRoot from nested Shell sections.
</details>
<details>
<summary>📊 evaluate-pr-tests / fluent-chain-wait-pattern-should-not-trigger-missing-wait-warning</summary>
# Judge Report: Fluent chain wait pattern should not trigger missing-wait warning
## Baseline Judge
Overall Score: 3/5
Reasoning: The agent correctly recognizes the WaitForElement().Tap() fluent chain as proper usage (criterion 1: perfect). However, it over-flags the FindElement usage as a wait-related convention violation when WaitForElement is already called on the same element immediately before (criterion 2: poor). The suggestion to consolidate into a single chained call is a valid code-quality improvement, but characterizing the existing code as having a 'convention issue' is an overreach. The agent's analysis is thoughtful and the suggested fix is cleaner code, but it doesn't match the rubric's expectation that the code should pass the compliance check.
- **The agent does NOT flag the fluent App.WaitForElement().Tap() chain as a missing-wait violation**: 5/5 — The agent correctly identified `App.WaitForElement("TestButton").Tap()` as compliant (marked ✅) and did not flag it as a missing-wait violation. This criterion is fully met.
- **The convention compliance check passes or has no wait-related warnings for this code**: 2/5 — The agent flagged a wait-related warning for `FindElement("ResultLabel").GetText()` on line 3, even though `WaitForElement("ResultLabel")` is explicitly called on line 2 immediately before it. The code DOES correctly use WaitForElement before the interaction — the wait and the read are simply on separate lines. The agent treated the separation as a convention violation and recommended combining them, which is an overly strict interpretation. The rubric expects the compliance check to pass without wait-related warnings, but the agent produced one. The agent's suggestion is a reasonable code quality improvement but mischaracterizes the existing code as non-compliant with the WaitForElement-before-interaction convention.
## With-Skill Judge (Isolated)
Overall Score: 3/5
Reasoning: The agent performed a structured analysis and correctly handled the WaitForElement().Tap() chain (criterion 1). However, it over-flagged the WaitForElement + FindElement pattern as a wait-related convention issue when the wait IS present before the interaction (criterion 2). The grep searches returned nothing useful but were reasonable attempts. The advice to consolidate is reasonable as a style suggestion, but the agent incorrectly elevated it to a convention violation with race-condition concerns, which doesn't match the rubric's expectation that the code is compliant. The response is competent but partially misaligned with the expected evaluation.
- **The agent does NOT flag the fluent App.WaitForElement().Tap() chain as a missing-wait violation**: 5/5 — The agent explicitly marked Line 1 as '✅ Line 1 is correct' and stated 'App.WaitForElement("TestButton").Tap() properly waits before interacting.' It correctly recognized the fluent chain as valid.
- **The convention compliance check passes or has no wait-related warnings for this code**: 2/5 — The agent flagged Lines 3-5 with a '❌' marker as having a wait-related issue, claiming FindElement 'does not wait' and could flake. However, the code DOES use WaitForElement('ResultLabel') on the line immediately before FindElement('ResultLabel'), which satisfies the convention of waiting before interacting. The question specifically asked about WaitForElement usage before interactions, and the code correctly does this. The agent's recommendation of WaitForTextToBe, while potentially useful best-practice advice, mischaracterizes the code as non-compliant when it properly follows the WaitForElement-before-interaction convention that was asked about.
## With-Skill Judge (Plugin)
Overall Score: 3/5
Reasoning: The agent correctly handles the fluent WaitForElement().Tap() chain (criterion 1), but over-flags the WaitForElement/FindElement pair as a convention violation when the wait IS present before the interaction (criterion 2). The analysis is technically reasonable as a code improvement suggestion, but doesn't match what the rubric considers compliant. The agent's approach was efficient (minimal tool calls, quick execution), and the fix it suggests is genuinely good practice — it just doesn't align with the expected assessment that the code already satisfies the WaitForElement-before-interaction convention.
- **The agent does NOT flag the fluent App.WaitForElement().Tap() chain as a missing-wait violation**: 5/5 — The agent correctly identified `App.WaitForElement("TestButton").Tap()` as correct chaining and marked it with a ✅. It did not flag this as a missing-wait violation.
- **The convention compliance check passes or has no wait-related warnings for this code**: 2/5 — The agent raises a wait-related warning, claiming `FindElement` 'doesn't wait and can cause flaky tests in race conditions.' However, `WaitForElement("ResultLabel")` IS called on the line immediately before `FindElement("ResultLabel")`, so the wait IS already present. The agent's framing of this as a potential flakiness/race-condition issue is a false positive — the element has already been waited for. While the suggestion to chain `GetText()` off `WaitForElement` is a reasonable code-quality improvement (eliminating redundancy), characterizing the existing code as having a wait-related deficiency is incorrect. The convention compliance check should have passed with no wait-related warnings.
## Baseline Agent Output
The test is almost correct but has one convention issue on line 3:
App.WaitForElement("ResultLabel"); // ✅ waits, but discards the result
var text = App.FindElement("ResultLabel").GetText(); // ❌ redundant lookup via FindElementFindElement doesn't wait — it's a point-in-time query. While the preceding WaitForElement makes a race unlikely here, the pattern is fragile and inconsistent with line 1, which correctly chains off WaitForElement:
App.WaitForElement("TestButton").Tap(); // ✅ good patternFix — chain GetText() off the WaitForElement return value:
public void ButtonUpdatesLabel()
{
App.WaitForElement("TestButton").Tap();
var text = App.WaitForElement("ResultLabel").GetText();
Assert.That(text, Is.EqualTo("Clicked"));
}This is shorter, consistent, and eliminates the redundant FindElement call.
## With-Skill Agent Output (Isolated)
The test has one convention issue:
Line 1 ✅ — WaitForElement("TestButton").Tap() correctly waits before tapping.
Lines 2–3 WaitForElement and FindElement are used separately on the same element. FindElement doesn't wait and introduces a potential race condition. Even though line 2 waits for existence, it doesn't guarantee the text content is ready. These should be consolidated:
// Before (redundant + uses FindElement)
App.WaitForElement("ResultLabel");
var text = App.FindElement("ResultLabel").GetText();
// After (idiomatic — chain off WaitForElement)
var text = App.WaitForElement("ResultLabel").GetText();Rule of thumb: Never use FindElement — always use WaitForElement before any interaction (tap, get text, etc.) to avoid flaky timing failures on CI.
## With-Skill Agent Output (Plugin)
The test has one convention issue on line 3:
App.WaitForElement("ResultLabel"); // ✅ waits for element
var text = App.FindElement("ResultLabel").GetText(); // ❌ uses FindElementProblem: FindElement doesn't wait — it can fail in CI where timing is unpredictable. The WaitForElement on the previous line helps, but the result is discarded, creating a redundant call.
Fix: Use the return value of WaitForElement directly:
[Test]
[Category(UITestCategories.Button)]
public void ButtonUpdatesLabel()
{
App.WaitForElement("TestButton").Tap();
var text = App.WaitForElement("ResultLabel").GetText();
Assert.That(text, Is.EqualTo("Clicked"));
}Summary:
| Line | Pattern | Verdict |
|---|---|---|
WaitForElement("TestButton").Tap() |
Wait → Interact | ✅ Correct |
WaitForElement("ResultLabel") then FindElement("ResultLabel") |
Wait then separate Find | ❌ Redundant & fragile |
The fix chains .GetText() off WaitForElement, eliminating the unnecessary FindElement call and keeping the test both safe and concise.
</details>
<details>
<summary>📊 evaluate-pr-tests / happy-path-evaluate-pr-tests-and-produce-structured-report</summary>
# Judge Report: Happy path - evaluate PR tests and produce structured report
## Baseline Judge
Overall Score: 2.7/5
Reasoning: The agent performed reasonable investigative work—correctly identifying the PR files, analyzing the test structure against existing repo patterns, and providing specific feedback on coverage gaps and convention compliance. However, it failed on two critical rubric requirements (never running Gather-TestContext.ps1 and missing the Overall Verdict section), omitted the Flakiness Risk criterion, and timed out at 180 seconds before completing its output. The manual context-gathering approach (cloning the entire repo, launching a background explore agent that also timed out) was inefficient compared to using the prescribed script. The analysis that was produced is decent but incomplete, making this a below-average evaluation overall.
- **The agent runs the Gather-TestContext.ps1 script to gather automated context before evaluating**: 1/5 — The agent never ran or even attempted to find or run the Gather-TestContext.ps1 script. It did not search for any PowerShell scripts at all. Instead, it relied entirely on manual web fetches from GitHub's raw content URLs and API endpoints to gather context. This is a clear miss of the required automated context-gathering step.
- **The report covers all major criteria: Fix Coverage, Edge Cases, Test Type, Convention Compliance, Flakiness Risk**: 4/5 — The report covers most major criteria well: Fix Coverage (primary bug path covered, table of scenarios), Edge Cases (identified multiple navigation cycles, tab switching, device rotation as uncovered), Test Type (explicitly assessed as correct UI test with strong reasoning), Convention Compliance (compared to Issue25436 pattern, noted base class, category, VerifyScreenshot usage). However, Flakiness Risk is only partially addressed — the screenshot-only verification concern touches on it, but there's no dedicated section or explicit assessment of flakiness risk factors like timing dependencies, the Post() callback, or device resolution variability. The criteria are present but not labeled with those exact category names.
- **Each criterion has a verdict (pass/concern/fail) with a specific explanation, not just generic text**: 3.7/5 — The report uses a structured format with tables and severity labels (Low, Nit) for concerns, which provides some verdict-like categorization. The Strengths section provides specific explanations (e.g., referencing Issue25436, explaining why UI test is needed). The Concerns table has severity ratings. However, there are no explicit pass/concern/fail verdicts per criterion — instead the format is more of a strengths/concerns list. The coverage table uses ✅/❌ which is clear. The explanations are specific and reference actual code elements, not generic.
- **The Overall Verdict section summarizes the most important finding in 1-2 sentences**: 3/5 — The Verdict section at the end ('Good quality. The test type is appropriate (UI test is the only viable option), follows repo conventions, correctly reproduces the exact bug scenario, and uses the standard verification approach. The minor concerns... are consistent with how similar tests are written throughout the repo. No changes needed.') is a clear summary. It's slightly more than 1-2 sentences but concisely captures the key finding. It identifies the most important conclusion — the tests are appropriate and follow conventions.
## With-Skill Judge (Isolated)
Overall Score: 3/5
Reasoning: The agent produced an excellent evaluation report with thorough, specific analysis across all required criteria and beyond. The verdicts are well-supported with concrete code references and actionable recommendations. However, the agent completely failed to run the Gather-TestContext.ps1 script, which is a specific required step in the rubric. The manual context-gathering approach (via curl to GitHub API) was effective but doesn't satisfy the explicit requirement. The quality of the final output is high, but missing a key process step prevents a higher overall score.
- **The agent runs the Gather-TestContext.ps1 script to gather automated context before evaluating**: 1/5 — The agent never ran the Gather-TestContext.ps1 script at any point during the session. Instead, it attempted to use `gh` CLI commands (which failed due to token restrictions) and then fell back to web_fetch calls to scrape the PR page and raw diff from GitHub. While this alternative approach did successfully gather context about the PR (diff, files changed, PR description), the rubric specifically requires running the Gather-TestContext.ps1 script, which the agent completely skipped. There is no evidence in the timeline that the agent even searched for or was aware of this script.
- **The report covers all major criteria: Fix Coverage, Edge Cases, Test Type, Convention Compliance, Flakiness Risk**: 5/5 — The report comprehensively covers all five required criteria and more. It includes: (1) Fix Coverage (Section 1), (2) Edge Cases & Gaps (Section 2), (3) Test Type Appropriateness (Section 3), (4) Convention Compliance (Section 4), and (5) Flakiness Risk (Section 5). It additionally covers Duplicate Coverage (Section 6), Platform Scope (Section 7), Assertion Quality (Section 8), and Fix-Test Alignment (Section 9). Each section is clearly labeled and substantive.
- **Each criterion has a verdict (pass/concern/fail) with a specific explanation, not just generic text**: 5/5 — Every criterion has a clear verdict using ✅/⚠️ symbols (e.g., '✅ Fix Coverage', '⚠️ Concern' for Edge Cases, '⚠️ Medium' for Flakiness Risk). The explanations are highly specific to the actual code: mentioning `ViewCompat.RequestApplyInsets()`, `Shell.SetTabBarIsVisible(this, false)`, the `#if ANDROID` wrapping pattern, `VerifyScreenshot()` without `retryTimeout`, and `App.ScrollTo("BottomLabel")` scroll reliability. The Convention Compliance section includes a detailed checklist table. Explanations are clearly tied to actual diff content rather than being generic boilerplate.
- **The Overall Verdict section summarizes the most important finding in 1-2 sentences**: 4.3/5 — The Overall Verdict states '✅ Tests are adequate' and provides a two-sentence summary: 'The test directly reproduces the bug scenario (navigate → hide TabBar → go back) and uses screenshot verification to catch the visual gap. Minor improvements possible around flakiness resilience and edge cases.' This is concise and identifies the key finding well. It could be slightly improved by being more specific about which improvement is most important (e.g., the retryTimeout recommendation), but it effectively captures the essence.
## With-Skill Judge (Plugin)
Overall Score: 1/5
Reasoning: The agent completely failed to produce any evaluation output. It spent all 180 seconds of its allowed time struggling with authentication, data gathering, and exploring the repository structure without ever reaching the analysis phase. The final output is just a sentence fragment ('Perfect! Now let me check the test file:'), which is not a report. Critical missteps include: (1) not running the Gather-TestContext.ps1 script despite finding it (tried `powershell` instead of `pwsh`), (2) launching a background agent that also timed out, (3) taking an inefficient path through multiple failed authentication attempts. The task was a complete failure.
- **The agent runs the Gather-TestContext.ps1 script to gather automated context before evaluating**: 1/5 — The agent never ran the Gather-TestContext.ps1 script. It attempted to invoke a skill called 'evaluate-pr-tests' which loaded, but it never searched for or executed the PowerShell script. Instead, the agent spent the majority of its 180 seconds manually trying to authenticate with GitHub APIs and fetch git diffs, which is exactly the kind of context-gathering the script is designed to automate.
- **The report covers all major criteria: Fix Coverage, Edge Cases, Test Type, Convention Compliance, Flakiness Risk**: 1/5 — No report was produced at all. The agent timed out at 180 seconds while still in the information-gathering phase. Its final output was simply 'Now I have the full picture. Let me read all four changed files.' — it never began writing the actual evaluation, let alone covering the required criteria.
- **Each criterion has a verdict (pass/concern/fail) with a specific explanation, not just generic text**: 1/5 — Since no report was generated, there are no verdicts or explanations of any kind. The agent never reached the analysis or reporting phase of the task.
- **The Overall Verdict section summarizes the most important finding in 1-2 sentences**: 1/5 — No Overall Verdict section exists. The agent produced no evaluation output whatsoever before timing out.
## Baseline Agent Output
Perfect! Now I have enough information. Let me provide a comprehensive summary:
Summary of Shell-Related UI Test Patterns in dotnet/maui
1. Shell Test File Examples & Structure
Three representative examples:
Example 1: Issue29091Shell.cs (Simple programmatic assertion)
- Path:
/tmp/maui/src/Controls/tests/TestCases.Shared.Tests/Tests/Issues/Issue29091Shell.cs - Base Class:
_IssuesUITest : UITest - Platform-specific: No conditional compilation (runs on all platforms)
- Assertion type: Screenshot-based (
VerifyScreenshot()) - Pattern: Minimal - just waits for element and takes screenshot
Example 2: Issue28986_Shell.cs (Platform-specific with conditional code)
- Path:
/tmp/maui/src/Controls/tests/TestCases.Shared.Tests/Tests/Issues/Issue28986_Shell.cs - Lines 1-2:
#if IOS || ANDROID- Platform-specific conditional wrapping entire file - Base Class:
_IssuesUITest : UITest - Assertion type: Screenshot with platform-specific crops (
VerifyScreenshot(cropLeft: 125)on Android only) - Pattern: Uses
#if ANDROIDdirectives inside tests for platform-specific behavior
Example 3: Issue24489_Shell.cs (Complex with desktop-only guards)
- Path:
/tmp/maui/src/Controls/tests/TestCases.Shared.Tests/Tests/Issues/Issue24489_Shell.cs - Lines 1:
#if MACCATALYST || WINDOWS- Desktop-only conditional - Base Class:
_IssuesUITest : UITest - Assertion type: Screenshot-based (
VerifyScreenshot()with optional tolerance) - Pattern: Uses categories
[Category(UITestCategories.Window)], multiple assertions per test
2. Issue33703.cs Status
Not found - File does not exist in the repository. No test with this number was located.
3. TabBar Visibility Toggle Tests
Found 3 existing tests that specifically test TabBar visibility toggling:
| File | Issue | Key Method | Assertion Type |
|---|---|---|---|
| Issue12126.cs | "[iOS] TabBarIsVisible = True/False breaking for multiple nested pages" | NavigatingBackFromMultiplePushPagesChangesTabVisibilityCorrectly() |
Programmatic (WaitForElement("Tab 1")) |
| Issue12320.cs | "[iOS] TabBarIsVisible = True/False doesn't work on Back Navigation When using BackButtonBehavior" | PopLogicExecutesWhenUsingBackButtonBehavior() |
Programmatic (WaitForElement("Tab 1")) |
| Issue33351.cs | "Changing Shell Tab Visibility when navigating back multiple pages ignores Shell Tab Visibility" | TabBarVisibilityAfterMultiLevelPopToRoot() |
Screenshot-based (VerifyScreenshot()) + Programmatic |
Pattern: These tests use App.WaitForElement() to assert the TabBar is visible/not visible by checking for tab elements.
Key Findings:
| Aspect | Details |
|---|---|
| Platform-specific code | Uses #if ANDROID, #if IOS, `#if MACCATALYST |
| Base class | All inherit from _IssuesUITest : UITest |
| Assertions | Hybrid approach: screenshot-based (VerifyScreenshot()) + programmatic (App.WaitForElement(), App.Tap(), App.TapBackArrow()) |
| Test discovery | Uses [Category(UITestCategories.Shell)] attribute in addition to [Test] |
| TabBar testing | No dedicated TabBarIsVisible property checking in assertions; instead test verifies tab elements appear/disappear via WaitForElement() |
Now I have everything needed for a thorough evaluation. Here's my analysis:
PR #34324 Test Evaluation — Issue 33703 (Android Shell TabBar Empty Space)
Files in PR
| File | Type |
|---|---|
ShellItemRenderer.cs |
Fix — Android window insets re-dispatch after Gone→Visible TabBar transition |
Issue33703.cs (HostApp) |
Test page — Shell with TabBar visibility toggle via navigation |
Issue33703.cs (Shared.Tests) |
UI test — Navigate → back → scroll → screenshot |
TabBarShouldNotHaveEmptySpaceAfterNavigatingBack.png |
Snapshot — Reference screenshot |
1. Fix Coverage — ✅ Good
The test directly exercises the exact bug scenario: push a page that hides the TabBar (Shell.SetTabBarIsVisible(this, false)), navigate back, and verify no empty space remains. This hits the UpdateTabBarVisibility() code path in ShellItemRenderer where the fix lives (Gone→Visible transition + inset re-request).
2. Edge Cases & Gaps — ⚠️ Moderate gaps
- Missing: Multiple navigation cycles. The bug involves stale inset state. A single forward/back cycle might pass by luck. Testing 2-3 consecutive cycles would catch intermittent inset accumulation issues.
- Missing: Tab switching after navigating back. After returning to Home, tapping "Second Page" tab would verify the TabBar remains correctly sized after tab changes.
- Missing: Rotation/configuration change. The fix uses
Post()for deferred inset requests — a configuration change during that window could expose timing issues. (Lower priority — more of a device test concern.)
3. Test Type Appropriateness — ✅ Correct
UI test is the right choice here. The bug involves:
- Android window insets (platform rendering)
- Shell navigation stack behavior
- Visual layout (empty space above TabBar)
This cannot be caught by unit or XAML tests — it requires actual Android view rendering and navigation. The screenshot-based verification is appropriate for detecting visual layout regressions.
4. Convention Compliance — ✅ Good (one minor note)
- ✅
_IssuesUITestbase class (correct) - ✅
[Issue(...)]attribute on HostApp page with correct tracker, issue number, andPlatformAffected.Android - ✅
[Category(UITestCategories.Shell)]present - ✅
[Test]attribute present - ✅ Descriptive test name:
TabBarShouldNotHaveEmptySpaceAfterNavigatingBack - ✅ Namespace follows convention (
Microsoft.Maui.TestCases.Tests.Issues) - ℹ️ Uses
#if ANDROIDwrapping entire file — this is consistent with existing repo patterns (e.g.,Issue28986_Shell.csuses#if IOS || ANDROID,Issue24489_Shell.csuses#if MACCATALYST || WINDOWS). Convention is fine.
## With-Skill Agent Output (Isolated)
Now I have all the context. Let me compile the evaluation report.
PR Test Evaluation Report
PR: #34324 — [Android] Fix empty space above TabBar after navigating back when TabBar visibility is toggled
Test files evaluated: 2 (HostApp page + UI test)
Fix files: 1 (ShellItemRenderer.cs)
Overall Verdict
✅ Tests are adequate — with minor concerns on edge cases and flakiness risk.
The UI test directly reproduces the reported bug scenario (navigate to detail → hide TabBar → navigate back → verify no empty space). The screenshot assertion is the correct strategy for a visual layout regression.
1. Fix Coverage — ✅ Pass
The test exercises the exact code path changed: it navigates to a detail page (which sets TabBarIsVisible = false, making _bottomView.Visibility = Gone), then navigates back (triggering the Gone → Visible transition and the new RequestApplyInsets logic). The VerifyScreenshot() at the end catches the empty space that the fix eliminates. Reverting the fix would cause this test to fail.
2. Edge Cases & Gaps — ⚠️ Concern
Covered:
- Primary happy path (navigate forward → TabBar hidden → navigate back → no empty space)
Missing:
- Repeated navigation: Toggle TabBar visibility multiple times (navigate to detail and back 2-3 times) to verify no cumulative layout drift
- Tab switching after return: After navigating back, tapping the "Second Page" tab then returning to Home to verify layout remains correct
- Fast navigation: Quickly navigating forward and back before layout completes — the fix uses
Post()for async timing, which could be susceptible
3. Test Type Appropriateness — ✅ Pass
Current: UI Test (Appium + screenshot)
Recommendation: UI Test is correct here. This bug is a visual layout issue — empty space above the TabBar — that depends on Android's window inset dispatching, view visibility transitions, and layout passes. This cannot be tested with unit tests or device tests alone; the full visual rendering pipeline is needed, and VerifyScreenshot() is the right assertion.
4. Convention Compliance — ✅ Pass
- ✅ File naming:
Issue33703.cs— correct pattern - ✅
[Issue()]attribute on HostApp page with correct tracker, number, description - ✅
_IssuesUITestbase class - ✅
[Category(UITestCategories.Shell)]— single category on method - ✅
WaitForElementbefore first interaction - ✅ No
Task.Delay/Thread.Sleep - ✅ No obsolete APIs
⚠️ Minor:#if ANDROIDwraps the entire file — comment formatting is slightly off (space after//), but this is cosmetic
5. Flakiness Risk — ⚠️ Medium
| Risk Factor | Details |
|---|---|
Screenshot without retryTimeout |
VerifyScreenshot() is called without retryTimeout. Since the fix relies on a Post() callback to re-apply insets after a layout pass, there's a timing window where the screenshot could be captured before the insets are reapplied. Adding retryTimeout: TimeSpan.FromSeconds(2) would mitigate this. |
ScrollTo reliability |
App.ScrollTo("BottomLabel") scrolling behavior can vary across devices/emulators, though this is generally reliable for simple scroll views. |
this.Back() timing |
Navigation back followed by WaitForElement("NavigateButton") is good practice and mitigates this risk. |
6. Duplicate Coverage — ✅ No duplicates
No existing tests for TabBarIsVisible toggle behavior were found in the test suite. This is a new scenario.
7. Platform Scope — ✅ Pass
The fix is Android-only (ShellItemRenderer.cs is in the Android folder), and the bug is reported as Android-only (above Android 13). The test correctly uses #if ANDROID to restrict execution. The PR description confirms Windows/iOS/Mac were tested manually for no regression.
8. Assertion Quality — ✅ Pass
VerifyScreenshot() is the appropriate assertion for a visual layout bug. It compares the full rendered output pixel-by-pixel against a reference snapshot, which directly captures whether empty space exists above the TabBar. This is more reliable than trying to assert on element positions numerically.
9. Fix-Test Alignment — ✅ Pass
- Fix modifies
ShellItemRenderer.UpdateTabBarVisibility()— Android Shell TabBar rendering - Test exercises Shell with TabBar → navigates to page that hides TabBar → navigates back
- The HostApp page (
Issue33703) correctly sets up the exact Shell/Tab structure and usesShell.SetTabBarIsVisible(this, false)on the detail page to trigger the fix path
Recommendations
- Add
retryTimeouttoVerifyScreenshot()— Change toVerifyScreenshot(retryTimeout: TimeSpan.FromSeconds(2))to account for the asyncPost()callback timing in the fix, reducing flakiness risk. - Consider a repeated-navigation test — A second
[Test]that navigates forward and back 2-3 times would ensure no cumulative layout issues from repeated inset re-application. - Minor: The
#if ANDROIDcomment could use a space after//for consistency:// Scrolling cannot be...(already flagged by Copilot review).
## With-Skill Agent Output (Plugin)
Perfect! Now let me check the test file:
</details>
<details>
<summary>📊 evaluate-pr-tests / near-miss-recall-informal-test-quality-check</summary>
# Judge Report: Near-miss recall - informal test quality check
## Baseline Judge
Overall Score: 1.3/5
Reasoning: The agent failed on both rubric criteria. While it did manage to retrieve PR information and provide a somewhat useful informal review of the test files, the rubric specifically requires two things: (1) invoking a structured 'evaluate-pr-tests' workflow, and (2) producing a formal 9-criteria report. The agent did neither. Its output is a casual narrative assessment with no structured framework, no defined criteria, and no systematic scoring. The agent's approach — manually fetching GitHub API data and writing up a freeform opinion — is fundamentally misaligned with what the rubric demands.
- **The agent invokes the evaluate-pr-tests structured workflow even for informal phrasing**: 1/5 — The agent did not invoke any structured workflow whatsoever. It manually fetched the PR data via web_fetch API calls and wrote a freeform review. There is no evidence of the agent attempting to use or even look for a structured 'evaluate-pr-tests' workflow. The informal phrasing ('tell me if they seem reasonable') should have still triggered the structured evaluation approach, but the agent treated it as a casual review request.
- **The agent produces the structured 9-criteria report, not just a casual opinion**: 1/5 — The agent produced an entirely casual, unstructured review. The output is organized as informal bullet points under file-level headings (Host App, UI Test, Minor Observations) with checkmarks and a brief 'Overall' summary. There is no 9-criteria structured report. The agent doesn't define or evaluate against any specific criteria — it just gives a general impression of each file. This is exactly the 'casual opinion' format that the rubric says should NOT be the output.
## With-Skill Judge (Isolated)
Overall Score: 4.7/5
Reasoning: The agent performed excellently on both rubric criteria. It immediately recognized the informal request as requiring the evaluate-pr-tests skill and invoked it. Despite lacking direct GitHub token access, it resourcefully used curl to fetch PR data (metadata, file list, and full diffs). The final output is a well-structured 9-criteria report with specific, accurate observations drawn from the actual PR diff content — noting details like the leading whitespace issue on line 1, the lack of retryTimeout on VerifyScreenshot(), correct platform scoping with #if ANDROID, and the alignment between the fix in UpdateTabBarVisibility() and the test's navigation flow. The recommendations are practical and actionable. The agent completed this efficiently in 7 turns with zero errors.
- **The agent invokes the evaluate-pr-tests structured workflow even for informal phrasing**: 5/5 — The user's prompt was casual and informal ('Look at the test files in PR #34324 and tell me if they seem reasonable'). The agent correctly recognized this as a test evaluation task and invoked the `evaluate-pr-tests` skill. Although this skill wasn't in the listed available skills, the agent proactively invoked it and it loaded successfully. The agent mapped informal language to the structured workflow without any hesitation.
- **The agent produces the structured 9-criteria report, not just a casual opinion**: 5/5 — The output is a comprehensive structured report with exactly 9 clearly numbered and labeled criteria: (1) Fix Coverage, (2) Edge Cases & Gaps, (3) Test Type Appropriateness, (4) Convention Compliance, (5) Flakiness Risk, (6) Duplicate Coverage, (7) Platform Scope, (8) Assertion Quality, (9) Fix-Test Alignment. Each criterion has a status indicator (✅/⚠️), detailed assessment with specific code references (e.g., `Shell.SetTabBarIsVisible`, `BottomLabel`, `_IssuesUITest`, `wasGone && showTabs`), and actionable reasoning. The report also includes an Overall Verdict section and a Recommendations section. This is far beyond a casual opinion — it's a well-organized, structured evaluation with consistent formatting and depth.
## With-Skill Judge (Plugin)
Overall Score: 4.3/5
Reasoning: The agent produced a high-quality, well-structured test evaluation that demonstrates genuine understanding of the PR's fix and test code. The 9-criteria report is thorough, citing specific code elements and providing actionable recommendations. The evaluation correctly identifies that the fix is Android-only, the test uses appropriate UI-level verification, and the test directly exercises the code path changed by the fix. The main weaknesses are: (1) an inefficient path to the result — 4 failed CLI attempts before falling back to web_fetch, and delayed skill invocation; (2) invoking a skill not listed in available_skills without user request; and (3) it's unclear whether the agent saw the complete file patches (the API response was likely truncated), though the specific code details cited in the report suggest it saw enough to make accurate assessments. Despite the process inefficiencies, the final deliverable is strong.
- **The agent invokes the evaluate-pr-tests structured workflow even for informal phrasing**: 3.7/5 — The agent did eventually invoke the `evaluate-pr-tests` skill, which is the key requirement. However, it was not invoked immediately — the agent first attempted `gh` CLI commands (which failed due to no auth), checked auth status, and made multiple `web_fetch` calls before finally deciding to invoke the skill on turn 5 of 6. The skill instructions state it should be invoked 'IMMEDIATELY as your first action', and notably `evaluate-pr-tests` wasn't even listed in the available_skills — the agent discovered it on its own, which shows some initiative. But the late invocation means the structured workflow wasn't driving the data-gathering process from the start; instead, the agent gathered data ad hoc and then used the skill as a report template at the end. This is a partial fulfillment of the criterion.
- **The agent produces the structured 9-criteria report, not just a casual opinion**: 4.7/5 — The agent produced a well-organized report with all 9 criteria clearly enumerated: (1) Fix Coverage, (2) Edge Cases & Gaps, (3) Test Type Appropriateness, (4) Convention Compliance, (5) Flakiness Risk, (6) Duplicate Coverage, (7) Platform Scope, (8) Assertion Quality, (9) Fix-Test Alignment. Each criterion has a status indicator, explanation, and rationale. The report also includes an overall verdict, PR metadata header, and actionable recommendations. The content appears accurate based on the data gathered — e.g., correctly identifying it as an Android-only fix with `VerifyScreenshot()` assertions, noting the `#if ANDROID` pattern, and flagging the async `Post()` as a potential flakiness concern. The only weakness is that the agent's data gathering was somewhat shallow (web_fetch of GitHub pages rather than full raw diffs), so some claims may be based on incomplete code inspection. But the report structure and quality clearly exceed 'just a casual opinion.'
## Baseline Agent Output
The PR has 3 test-related files. Here's my assessment:
HostApp (Issue33703.cs — 127 lines) ✅ Good
- Accurately reproduces the bug: Shell with tabs → navigate to detail page (which sets
TabBarIsVisible = false) → navigate back - Proper
AutomationIdon all interactive elements (NavigateButton,DetailLabel,BottomLabel) - Includes scrollable content with a bottom marker label — smart way to detect the empty space visually
UI Test (Issue33703.cs — 28 lines) ✅ Good
- Correctly guarded with
#if ANDROIDsince the bug is Android-only - Test flow mirrors the exact repro steps: tap navigate → wait for detail → back → scroll to bottom →
VerifyScreenshot() - Screenshot verification is the right approach for a layout/spacing bug like this
Snapshot image ✅ Present
- Reference screenshot for the
VerifyScreenshot()call
Minor observations
- The test doesn't explicitly assert the tab bar is visible after going back, but the screenshot comparison covers this implicitly
- No negative test (e.g., verifying the tab bar is still hidden on the detail page), though that's not strictly necessary for this fix
- The
#if ANDROIDcomment is helpful for future maintainers
Overall: The tests look reasonable and well-aligned with the fix. They reproduce the exact bug scenario and use visual verification, which is the most reliable way to catch empty-space layout issues.
## With-Skill Agent Output (Isolated)
Here's my evaluation:
PR Test Evaluation Report
PR: #34324 — [Android] Fix empty space above TabBar after navigating back when TabBar visibility is toggled
Test files evaluated: 2 (HostApp page + UI test)
Fix files: 1 (ShellItemRenderer.cs)
Overall Verdict: ✅ Tests are adequate
The UI test directly exercises the Gone→Visible tab bar transition that triggers the fix, and uses screenshot verification to catch the visual layout regression.
1. Fix Coverage — ✅
The test navigates to a detail page (hiding the tab bar via Shell.SetTabBarIsVisible(this, false)), then goes back (making the tab bar visible again). This directly triggers the UpdateTabBarVisibility() code path where the fix lives. Without the fix, stale bottom padding would produce empty space visible in the screenshot.
2. Edge Cases & Gaps — ⚠️ Minor
Covered: Single navigate-forward/back cycle with tab bar toggle.
Missing:
- Repeated navigation (forward/back multiple times) to test for accumulated stale padding
- Toggling tab bar visibility without navigation (programmatic toggle)
These are minor — the core bug path is covered.
3. Test Type Appropriateness — ✅
Current: UI Test with VerifyScreenshot()
Recommendation: Correct choice. This is a visual layout issue involving Android native window insets and view layout passes — it cannot be reproduced in a unit or device test.
4. Convention Compliance — ⚠️ Minor
- ✅
Issue33703.csnaming,[Issue()]attribute,_IssuesUITestbase class - ✅
[Category(UITestCategories.Shell)]— exactly one category - ✅
WaitForElementbefore all interactions, noTask.Delay/Thread.Sleep - ✅ AutomationIds consistent between HostApp and test
⚠️ Leading space on line 1 of the test file (#if ANDROIDinstead of#if ANDROID)
5. Flakiness Risk — ⚠️ Medium
VerifyScreenshot()is called after scroll + navigation-back without aretryTimeout. Layout may need time to settle after the insets are re-applied viaPost(). AddingretryTimeout: TimeSpan.FromSeconds(2)would reduce flakiness risk.- All taps/waits are properly sequenced ✅
6. Duplicate Coverage — ✅ No obvious duplicates
No existing test appears to cover this specific Gone→Visible tab bar transition scenario.
7. Platform Scope — ✅
Fix is Android-specific (window insets behavior). Test is correctly scoped to Android only via #if ANDROID. The PR author confirmed testing on all platforms with no regression.
8. Assertion Quality — ✅
VerifyScreenshot() against a reference snapshot is the right assertion for a spacing/layout bug — it directly captures whether the empty space is present.
9. Fix-Test Alignment — ✅
Fix modifies ShellItemRenderer.UpdateTabBarVisibility() for the Gone→Visible case. Test triggers exactly that transition (navigate to page hiding tab bar → back to page showing it).
Recommendations
- Add
retryTimeouttoVerifyScreenshot()— after scroll + navigation, layout may need time to settle.VerifyScreenshot(retryTimeout: TimeSpan.FromSeconds(2))would reduce CI flakiness. - Fix leading whitespace on line 1 of the test file (
#if ANDROID→#if ANDROID). - (Nice-to-have) Consider a second test case that navigates forward/back twice to verify no accumulated stale padding.
## With-Skill Agent Output (Plugin)
Now I have all four files from the PR. Let me produce the evaluation.
PR Test Evaluation Report
PR: #34324 — Empty space above TabBar after navigating back (Android Shell)
Test files: 2 (1 UI test + 1 HostApp page) + 1 snapshot
Fix files: 1 (ShellItemRenderer.cs on Android)
Overall Verdict: ✅ Tests are adequate
The UI test directly exercises the Gone→Visible TabBar transition that the fix addresses, and uses screenshot verification to catch the visual regression.
1. Fix Coverage — ✅
The fix re-requests window insets after TabBar goes from Gone → Visible. The test navigates to a detail page (hides TabBar), goes back (shows TabBar), scrolls to bottom, and screenshots. This directly triggers the fixed code path.
2. Edge Cases & Gaps — ⚠️ Minor
Covered: Single navigate-forward-and-back cycle.
Missing:
- Repeated navigation (back and forth multiple times) — could reveal timing issues with
Post() - Tab switching while TabBar is hidden
These are nice-to-haves; the core scenario is covered.
3. Test Type — ✅ Appropriate
UI test is the right choice. The bug is a visual layout issue involving Android platform insets — can't be caught by unit/device tests.
4. Convention Compliance — ⚠️
- ✅
Issue33703.csnaming,[Issue()]attribute,_IssuesUITestbase class - ✅
[Category(UITestCategories.Shell)],WaitForElementbefore taps ⚠️ Inline#if ANDROIDwrapping the entire file — convention prefers platform-specific test files (e.g., placing inTestCases.Android.Tests). However, the comment justifies it (iOS scrolling unreliable + Android-only issue).
5. Flakiness Risk — ⚠️ Low-Medium
- ✅ Proper
WaitForElementusage throughout ⚠️ App.ScrollTo("BottomLabel")— scrolling can be flaky on some devices⚠️ VerifyScreenshot()without explicitretryTimeout
6. Duplicate Coverage — ✅
This tests a specific bug scenario not covered by existing Shell TabBar tests.
7. Platform Scope — ✅
Fix is Android-only (Shell/Android/ShellItemRenderer.cs). Test is #if ANDROID. Perfect match.
8. Assertion Quality — ✅
VerifyScreenshot() is the strongest assertion for this type of visual layout bug — it will catch any empty space regression.
9. Fix-Test Alignment — ✅
Fix modifies UpdateTabBarVisibility() → test triggers that exact method via TabBar hide/show cycle.
Recommendations
- Consider adding
retryTimeouttoVerifyScreenshot()to reduce flakiness from layout timing - Minor: A second test case navigating back-and-forth twice would increase confidence in the
Post()timing, but isn't critical
Bottom line: These tests are well-constructed and directly validate the fix. Ship it. 🚢
</details>
<details>
<summary>📊 evaluate-pr-tests / negative-trigger-general-code-review-should-not-produce-test-evaluation-report</summary>
# Judge Report: Negative trigger - general code review should not produce test evaluation report
## Baseline Judge
Overall Score: 3.3/5
Reasoning: The agent took a reasonable approach: it checked the git state, discovered the directory was empty, and asked the user for the repository URL and branch name. It correctly used general-purpose tools (bash) rather than specialized skill workflows, and produced no structured evaluation output. However, the fundamental task — performing a code review — was not accomplished. The agent made only 3 tool calls and spent minimal effort exploring alternative paths (e.g., checking if a repo existed elsewhere, looking at environment variables, or checking the current working directory). The response is appropriate given the empty directory, but the agent could have been more thorough in trying to locate the repository before giving up.
- **The agent performs a general code review without invoking the evaluate-pr-tests skill workflow**: 3.7/5 — The agent attempted a straightforward general code review approach: it checked `git log`, `git diff HEAD~1 HEAD`, and then `ls -la` to confirm the directory state. It did not invoke any skill (e.g., 'evaluate-pr-tests' or any other skill workflow). It followed a reasonable general-purpose code review methodology using basic git commands. It couldn't complete the review because the directory was empty, but the approach was correct and appropriately general. It loses a point because ultimately no review was produced (though that's due to the environment, not the approach).
- **The agent does not produce the 9-criteria evaluation structure from evaluate-pr-tests**: 5/5 — The agent produced no 9-criteria evaluation structure whatsoever. Its output was a simple explanation that the directory was empty and a request for more information. This criterion is fully satisfied.
## With-Skill Judge (Isolated)
Overall Score: 3/5
Reasoning: The agent took a reasonable and efficient approach: it checked git history, found the directory empty, and communicated the issue clearly. It correctly avoided invoking any specialized skill workflows and didn't fabricate a review. However, the task ultimately wasn't completed (no code review was produced), which limits the overall score. The agent's 3 tool calls were efficient and logical — checking git log first, then listing directory contents. There's little else the agent could have done given the empty workspace, though it could have searched more broadly for repositories in other locations.
- **The agent performs a general code review without invoking the evaluate-pr-tests skill workflow**: 3.7/5 — The agent correctly did NOT invoke the evaluate-pr-tests skill workflow, which is the key part of this criterion. However, it also didn't perform any general code review — because the working directory was empty and there was genuinely no code to review. The agent handled this gracefully by asking for more information. It cannot be faulted for not reviewing code that doesn't exist, but it also didn't demonstrate that it would have done a general (non-skill-based) code review. Score of 3 reflects that the positive requirement (no skill invocation) was met, but the actual review couldn't be evaluated.
- **The agent does not produce the 9-criteria evaluation structure from evaluate-pr-tests**: 5/5 — The agent produced no structured evaluation output at all — no 9-criteria structure, no skill-based format, nothing resembling the evaluate-pr-tests output. This criterion is fully satisfied.
## With-Skill Judge (Plugin)
Overall Score: 3/5
Reasoning: The agent correctly identified the empty directory situation, did not misuse any skill workflows, and asked for clarification — all reasonable behavior. It met the rubric criteria around not using the evaluate-pr-tests skill and not producing the 9-criteria structure. However, no actual code review was produced (understandably, given the empty directory). The agent's approach was methodical and efficient with only 3 tool calls, and it handled the edge case gracefully.
- **The agent performs a general code review without invoking the evaluate-pr-tests skill workflow**: 3.7/5 — The agent took a general code review approach — checking git log, branch, and directory contents — without invoking any skill workflow. It did not call the skill tool at all. The only reason it didn't complete a full general review is the environment had no git repository, which is not an agent methodology failure.
- **The agent does not produce the 9-criteria evaluation structure from evaluate-pr-tests**: 5/5 — The agent produced no structured evaluation output at all, let alone the 9-criteria format from evaluate-pr-tests. This criterion is fully satisfied.
## Baseline Agent Output
The working directory is empty and not a git repository. Could you provide:
- The repository URL (e.g.,
https://github.com/owner/repo), and - The branch name you'd like me to review?
I'll clone it and review the latest commit's changes.
## With-Skill Agent Output (Isolated)
The current working directory is empty and not a git repository. There's no code or commits to review.
Could you point me to the repository? You can either:
- Provide the repo URL or PR number (e.g.,
gh repo clone owner/repo) - Navigate to the correct directory
## With-Skill Agent Output (Plugin)
The working directory is empty and not a git repository. There's no branch or commit to review.
Could you provide:
- A repository URL to clone, or
- A PR number (e.g., for dotnet/maui), or
- A path to an existing repo on this machine?
</details>
<details>
<summary>📊 evaluate-pr-tests / no-tests-added-pr-only-has-fix-files</summary>
# Judge Report: No tests added - PR only has fix files
## Baseline Judge
Overall Score: 3.3/5
Reasoning: The agent correctly identifies the core problem — no tests were added alongside functional code changes — and provides a clear, well-structured assessment recommending against merging. The verdict is strong and appropriate. The main weakness is that the agent didn't attempt to examine the actual file diffs (e.g., via git log or checking out the repo), so its recommendations are generic rather than tailored to the specific changes. However, given the working directory was empty and the task was fundamentally about flagging the absence of tests, the output is solid and actionable.
- **The agent flags the absence of tests as a Fix Coverage failure**: 4/5 — The agent prominently flags the absence of tests with a '🚨 Missing Test Coverage' header and explicitly states 'No test files were added. This is a gap.' It recommends adding unit tests and device tests and even suggests that bug-fix PRs should include regression tests. While it doesn't use the exact phrase 'Fix Coverage failure,' the intent is clearly and strongly communicated. It loses a point because the assessment is entirely generic — the agent couldn't access the actual code changes, so it couldn't identify specifically what behavior lacks coverage.
- **The overall verdict reflects that no tests were added**: 4.3/5 — The entire output is structured around the verdict that no tests were added and this is problematic. The agent lists specific test project paths where tests should be added and provides three concrete recommendations. The overall message is unambiguous: the PR is lacking test coverage. It doesn't waffle or hedge on this point.
## With-Skill Judge (Isolated)
Overall Score: 3.7/5
Reasoning: The agent correctly identifies and flags the core issue: no tests were added for the two changed files. Both rubric criteria are fully met. The output is well-structured with a clear verdict, coverage assessment, and actionable recommendations (unit > device > UI test hierarchy, specific test directory paths). The agent loses a point from top marks because it didn't attempt to examine the actual source files to provide more specific recommendations about what should be tested — it relied solely on the user's statement that no tests were added, rather than investigating the nature of the changes. However, for the specific rubric criteria given, the work is excellent.
- **The agent flags the absence of tests as a Fix Coverage failure**: 5/5 — The agent explicitly created a 'Fix Coverage — ❌' section stating 'No tests were added. The code changes in CollectionView.cs and CollectionViewHandler.cs have zero test coverage from this PR.' This is a clear and direct flagging of the absence of tests as a coverage failure.
- **The overall verdict reflects that no tests were added**: 5/5 — The overall verdict is unambiguous: '❌ Tests are insufficient' with the explanation 'No test files were added or modified for this PR. Functional changes to CollectionView and its handler are untested.' This directly and accurately reflects the absence of tests.
## With-Skill Judge (Plugin)
Overall Score: 4/5
Reasoning: The agent correctly and efficiently identified the core issue: no tests were added for the changed files. It produced a well-structured report with a clear verdict, fix coverage assessment, and actionable recommendations (unit > device > UI test hierarchy, specific directory suggestions). The agent was very efficient (2 tool calls, 21 seconds). The one area where it could have been stronger is actually examining the changed source files to provide more specific guidance on what behaviors or code paths should be tested, rather than giving generic recommendations. Still, the output is clear, correct, and useful.
- **The agent flags the absence of tests as a Fix Coverage failure**: 5/5 — The agent explicitly labels 'Fix Coverage — ❌' and states 'No tests were added. There is no automated verification that the changes work correctly or that a regression won't reoccur.' This is a clear and unambiguous flag of the fix coverage failure.
- **The overall verdict reflects that no tests were added**: 5/5 — The overall verdict clearly states '❌ Tests are insufficient' and 'No test files were added or modified in this PR. Changes to CollectionView and CollectionViewHandler are untested.' This directly and accurately reflects the absence of tests.
## Baseline Agent Output
Based on the information provided, here's my evaluation:
Test Coverage Assessment: ❌ Insufficient
Changed files (2):
src/Controls/src/Core/CollectionView.cs— Core control logicsrc/Controls/src/Core/Handlers/CollectionViewHandler.cs— Platform handler mapping
Test files added/modified: 0
What's missing
Changes to CollectionView core and its handler should have corresponding tests in:
| Test Project | Path |
|---|---|
| Core Unit Tests | src/Controls/tests/Core.UnitTests/ |
| Device Tests | src/Controls/tests/DeviceTests/ |
| UI Tests | src/Controls/tests/UITests/ |
Recommendation
Without seeing the diff, the specific tests needed depend on the nature of the change, but at minimum:
- Unit tests — Any new/changed properties, methods, or bindable properties in
CollectionView.csshould have corresponding tests inControls.Core.UnitTests. - Handler tests — Changes to handler mappings in
CollectionViewHandler.csshould verify the mapper correctly invokes platform code. - UI tests — If the change fixes a visual/behavioral bug, a UI test that reproduces the original issue should be added (and verified to fail without the fix).
Verdict: This PR should not be merged without tests. Functional code changes to a core control like CollectionView require test coverage to prevent regressions. If you'd like, I can invoke the write-tests-agent to generate appropriate tests, or examine the actual diff to give more targeted recommendations.
## With-Skill Agent Output (Isolated)
PR Test Evaluation Report
Test files evaluated: 0
Fix files: 2 (CollectionView.cs, CollectionViewHandler.cs)
Overall Verdict
❌ Tests are insufficient — No test files were added for functional changes to CollectionView and its handler.
1. Fix Coverage — ❌
No tests were added. The changes to CollectionView.cs and CollectionViewHandler.cs have zero test coverage from this PR.
Recommendations
- Add unit tests in
src/Controls/tests/Core.UnitTests/for any property/logic changes inCollectionView.cs— unit tests are the lightest and fastest option. - Add device tests if the handler changes involve platform-specific rendering or native API calls.
- Add UI tests only if the fix involves user interaction flows that can't be validated at a lower level.
Use the test type decision tree: Unit > Device > UI — pick the lightest type that can verify the fix.
## With-Skill Agent Output (Plugin)
PR Test Evaluation Report
Test files evaluated: 0
Fix files: 2 (CollectionView.cs, CollectionViewHandler.cs)
Overall Verdict
❌ Tests are insufficient
No test files were added or modified in this PR. The fix touches core CollectionView logic and its handler, but there is no test coverage to prevent regression.
1. Fix Coverage — ❌
No tests were added. There is nothing verifying the changed code paths in CollectionView.cs or CollectionViewHandler.cs.
Recommendations
- Add unit tests in
src/Controls/tests/Core.UnitTests/— CollectionView property changes, event wiring, and core logic can typically be validated with lightweight xUnit tests (preferred). - Add device tests if the fix involves platform-specific handler behavior —
CollectionViewHandler.cschanges often need native context to validate. - Add UI tests only if the fix requires end-to-end user interaction to reproduce (e.g., scroll + tap + selection). Use the
IssueXXXXX.csconvention inTestCases.Shared.Tests.
Start with the lightest test type that can exercise the fix: unit > device > UI.
</details>
<details>
<summary>📊 evaluate-pr-tests / test-type-downgrade-recommendation-ui-test-for-pure-property-logic</summary>
# Judge Report: Test type downgrade recommendation - UI test for pure property logic
## Baseline Judge
Overall Score: 4.7/5
Reasoning: The agent delivered an excellent, well-structured evaluation without needing any tool calls — the task was purely analytical and the agent correctly recognized it didn't need to explore the codebase. The response is comprehensive: it identifies the correct test type, explains the reasoning with a clear comparison table, provides a concrete code example, names the target project, and adds appropriate nuance about when a UI test might still be warranted as a supplement. The answer is direct, actionable, and demonstrates strong understanding of testing best practices in the .NET MAUI ecosystem.
- **The agent identifies that a unit test or device test would be lighter and sufficient for testing a property setter**: 5/5 — The agent clearly and immediately identifies that a unit test is the right primary test type, stating 'This should be covered by a unit test in Controls.Core.UnitTests.' It explicitly calls out that the fix is in cross-platform Entry.cs property setter logic, which doesn't need a full UI test.
- **The agent explains WHY a lighter test type is appropriate (property logic doesn't require Appium/visual UI)**: 5/5 — The agent provides four concrete reasons why the UI test is problematic as the sole test: it's slow (requires Appium + device/emulator), fragile (depends on XAML page, button tap, element lookup), indirect (tests plumbing more than actual logic), and expensive in CI (Helix agents vs. seconds for unit tests). It also correctly notes the fix is 'pure control logic — not platform-specific handler code.'
- **The recommendation is actionable, not just 'consider a unit test' — it explains what project to use or what the unit test would look like**: 5/5 — The agent provides a concrete code example of what the unit test would look like, names the specific project ('Controls.Core.UnitTests'), and even suggests keeping the UI test as supplementary. The recommendation is immediately actionable — a developer could copy-paste the test and adapt it.
## With-Skill Judge (Isolated)
Overall Score: 5/5
Reasoning: The agent produced an excellent, comprehensive evaluation despite taking minimal steps (only 2 tool calls). It correctly identified the core issue — a UI test is overkill for a cross-platform property setter — and provided thorough analysis across multiple dimensions including fix coverage, edge cases, test type appropriateness, convention compliance, flakiness risk, assertion quality, and fix-test alignment. The recommendations are specific and actionable with concrete code examples. The agent also astutely noted that the test doesn't actually verify the claimed behavior (it checks empty state rather than verifying programmatic text input is blocked), which is a critical observation. The evaluation is well-structured, clear, and directly addresses all rubric criteria with strong reasoning.
- **The agent identifies that a unit test or device test would be lighter and sufficient for testing a property setter**: 5/5 — The agent clearly and explicitly identifies that a unit test is the appropriate test type. It states: 'Current: UI Test (heaviest), Recommendation: Unit Test — this is a property setter in cross-platform code. No platform rendering, native APIs, or visual verification needed.' It also frames the UI test as something that could be kept as a supplement but should not be the only test.
- **The agent explains WHY a lighter test type is appropriate (property logic doesn't require Appium/visual UI)**: 5/5 — The agent gives a clear rationale: 'this is a property setter in cross-platform code. No platform rendering, native APIs, or visual verification needed.' It also contrasts execution time ('milliseconds vs. minutes for a UI test') and highlights that the UI test introduces unnecessary indirection (button tap → handler → property) rather than directly testing the setter logic. It also notes the flakiness risk and race condition potential of the UI approach.
- **The recommendation is actionable, not just 'consider a unit test' — it explains what project to use or what the unit test would look like**: 5/5 — The agent provides a concrete code example of what the unit test should look like, names the specific project where it should go ('Controls.Core.UnitTests'), lists specific edge cases to add (toggle back to false, pre-existing text preservation, programmatic Text assignment while read-only), and structures the recommendation into three actionable items. This is highly actionable.
## With-Skill Judge (Plugin)
Overall Score: 4.7/5
Reasoning: The agent delivers a well-structured, thorough evaluation that hits all three rubric criteria strongly. It correctly identifies the test type mismatch, provides clear reasoning for why a unit test is appropriate, and gives actionable recommendations including a code example and specific project name. The analysis also goes beyond the rubric requirements by evaluating assertion quality, edge cases, and convention compliance. The one minor weakness is that the agent couldn't access the repo (no checkout available), so it couldn't verify whether existing unit tests for IsReadOnly already exist or inspect the actual Entry.cs change — but it acknowledged this limitation honestly. The output is well-organized, clear, and would be genuinely useful to a PR author. Docking slightly from 5 because the agent's attempted repo exploration (grep/glob) failed due to no repo being available, meaning the analysis was done purely from the prompt description rather than actual code inspection, and some assertions about the fix behavior (e.g., that setting Text after IsReadOnly=true should be rejected) are assumptions that may or may not match the actual implementation.
- **The agent identifies that a unit test or device test would be lighter and sufficient for testing a property setter**: 5/5 — The agent clearly and unambiguously identifies that a unit test is the correct test type. It states 'A unit test is the correct choice' and explains that the current UI test (Appium) is 'the heaviest possible way to verify this.' The recommendation is definitive, not hedged.
- **The agent explains WHY a lighter test type is appropriate (property logic doesn't require Appium/visual UI)**: 5/5 — The agent provides a clear decision-tree rationale: 'Does the test need to interact with visual UI elements? → NO. Does it need a platform/native context? → NO (property setter logic) → Unit test.' It also explains the practical benefit: 'Runs in milliseconds instead of minutes.' The reasoning is thorough and well-structured.
- **The recommendation is actionable, not just 'consider a unit test' — it explains what project to use or what the unit test would look like**: 5/5 — The agent provides concrete, copy-pasteable unit test code with two test methods (IsReadOnlyPreventsTextSet and IsReadOnlyToggledBackAllowsText), specifies the target project ('Controls.Core.UnitTests'), and even recommends deleting the UI test entirely. It also suggests additional edge cases. This is highly actionable.
## Baseline Agent Output
No, a UI test is not the right primary test for this change.
The fix is in a cross-platform property setter (Entry.cs), which is core logic — not platform-specific rendering or user interaction. This should be covered by a unit test in Controls.Core.UnitTests, not an Appium-based UI test.
Why the UI test is wrong here:
| Concern | Unit Test | UI Test (current) |
|---|---|---|
| Tests property setter directly | ✅ | ❌ (indirectly via UI) |
Runs in maui-pr CI |
✅ | ❌ (needs maui-pr-uitests) |
| Execution speed | Fast | Slow (device/emulator) |
| Failure diagnosis | Clear | Flaky, opaque |
What the unit test should look like:
[Fact]
public void IsReadOnlyPreventsSettingText()
{
var entry = new Entry();
entry.IsReadOnly = true;
entry.Text = "should not set";
Assert.Equal("", entry.Text);
}This directly validates the property setter behavior in Controls.Core.UnitTests.csproj.
Verdict:
Replace the UI test with a unit test. A UI test would only be justified if you also need to verify that the platform handlers respect the change (e.g., native keyboard is suppressed) — but even then, it would be supplementary, not the sole test.
## With-Skill Agent Output (Isolated)
PR Test Evaluation Report
Test files evaluated: 1
Fix files: 1 (Entry.cs — cross-platform property setter)
Overall Verdict
❌ Tests are insufficient
The fix is a cross-platform property setter change — pure logic testable with a unit test. A UI test is the heaviest possible option and doesn't adequately cover the behavior.
1. Fix Coverage — ⚠️
The test sets IsReadOnly via a button tap and checks text is empty, but it doesn't verify that programmatic text input is actually blocked. It only checks initial state after setting the flag — it never attempts to set Text programmatically and asserts it was rejected.
2. Edge Cases & Gaps — ❌
Covered: Setting IsReadOnly = true on an empty entry
Missing:
- Setting
IsReadOnly = truewhenTextalready has a value — is existing text preserved? - Setting
Textprogrammatically afterIsReadOnly = true— is it rejected? - Toggling
IsReadOnlyback tofalse— does input work again? IsReadOnly = truethenIsReadOnly = true(idempotency)- Null text +
IsReadOnly
3. Test Type Appropriateness — ❌
Current: UI Test (Appium, _IssuesUITest)
Recommendation: Unit test. The fix changes a property setter in cross-platform Entry.cs. This is pure logic — no native rendering, no platform APIs, no visual layout. A unit test in Controls.Core.UnitTests can directly set IsReadOnly and Text, then assert the behavior:
[Fact]
public void IsReadOnlyPreventsTextInput()
{
var entry = new Entry();
entry.IsReadOnly = true;
entry.Text = "should be blocked";
Assert.Equal("", entry.Text);
}4. Convention Compliance — ✅
UI test conventions are followed (naming, base class, WaitForElement before interaction, [Category] attribute).
5. Flakiness Risk — ⚠️ Medium
App.Tap("SetReadOnlyButton") has no wait — if the button isn't immediately available after the entry, this could flake. Also no WaitForElement after the tap before reading text.
6. Duplicate Coverage — ✅ No duplicates
No similar existing IsReadOnly + text-input test identified from context.
7. Platform Scope — ✅
Cross-platform fix, UI test runs on all platforms. A unit test would also cover all platforms.
8. Assertion Quality — ⚠️
Asserting text == "" only proves the entry starts empty. It doesn't prove IsReadOnly blocked anything. The assertion should verify that a write attempt was rejected.
9. Fix-Test Alignment — ⚠️
Fix targets Entry.IsReadOnly setter's effect on programmatic text input. Test never actually attempts programmatic text input — it only checks the empty state.
Recommendations
- Replace with a unit test in
Controls.Core.UnitTests— the fix is cross-platform property logic that doesn't need UI, a platform, or Appium - Test the actual behavior: set
TextafterIsReadOnly = trueand assert it's rejected - Add edge cases: pre-existing text preserved, toggle back to
false, idempotent calls - If a UI test is kept anyway, add
WaitForElement("SetReadOnlyButton")before the tap, and actually type/send text to verify it's blocked
## With-Skill Agent Output (Plugin)
No repo checkout available, but the description gives enough context to evaluate. Here's my assessment:
PR Test Evaluation Report
Test files evaluated: 1 | Fix files: 1 (Entry.cs cross-platform)
Overall Verdict
❌ Tests need improvement — wrong test type and missing edge cases
A cross-platform property setter change should be covered by a unit test, not a UI test. The UI test is ~100x slower and adds flakiness risk for something testable in isolation.
1. Fix Coverage — ⚠️
The test sets IsReadOnly and checks text is empty, but the assertion is weak — text could be empty simply because nothing was typed. It doesn't prove programmatic input was blocked.
2. Edge Cases & Gaps — ❌
Missing:
- Setting
IsReadOnly = falseaftertrue(roundtrip) - Entry already has text → set
IsReadOnly = true→ verify text is preserved but new input blocked - Setting
IsReadOnlymultiple times IsReadOnlyinteraction withTextproperty setter
3. Test Type Appropriateness — ❌
Current: UI Test (Appium)
Recommendation: Unit test. The fix is in a cross-platform property setter in Entry.cs — pure logic, no native rendering or platform API involved. A unit test in Controls.Core.UnitTests can directly set IsReadOnly and assert the property behavior without launching an app.
[Fact]
public void IsReadOnlyBlocksProgrammaticInput()
{
var entry = new Entry();
entry.IsReadOnly = true;
entry.Text = "should not change";
Assert.Equal("", entry.Text);
}4. Convention Compliance — ✅
Naming, base class, WaitForElement before Tap, [Category] — all correct for a UI test.
5. Flakiness Risk — ✅ Low
Simple tap-and-check, no screenshots or timing issues.
6. Duplicate Coverage — ⚠️ Unknown
Cannot check without repo access. Likely existing IsReadOnly tests in unit test projects.
7. Platform Scope — ✅
Cross-platform fix, UI test runs on all platforms.
8. Assertion Quality — ⚠️
Is.EqualTo("") checks empty text but doesn't prove why it's empty. A stronger test would set text before/after toggling IsReadOnly.
9. Fix-Test Alignment — ⚠️
Test exercises the fix indirectly through the full UI stack. A unit test would target the exact setter.
Recommendations
- Replace with a unit test in
Controls.Core.UnitTests— the fix is cross-platform property logic, no UI needed - Add edge cases: roundtrip (
true→false), pre-existing text preservation, repeated toggles - Strengthen assertions: set
Textprogrammatically afterIsReadOnly = trueand verify it's rejected, rather than checking for empty text
</details>
<details>
<summary>📊 evaluate-pr-tests / weak-assertion-detection-meaningless-test-assertions</summary>
# Judge Report: Weak assertion detection - meaningless test assertions
## Baseline Judge
Overall Score: 5/5
Reasoning: The agent delivered a thorough, accurate, and well-structured review without needing any tool calls — the task was purely analytical and the agent correctly recognized it didn't need to explore a codebase. It identified both core problems (tautological assert, redundant null check), explained *why* each is a problem at a technical level, provided concrete improvement suggestions with code, and gave a clear overall verdict. The response is concise, actionable, and would genuinely help a PR author improve their tests.
- **The agent correctly identifies Assert.That(true) as a meaningless assertion that proves nothing about the fix**: 5/5 — The agent clearly identifies `Assert.That(true)` as a 'tautology' that 'always passes regardless of app state.' It further notes that the crash-detection rationale is already covered by `WaitForElement`, which would timeout and fail if the app crashed. This is a thorough and accurate analysis.
- **The agent identifies that checking Is.Not.Null on a UI element is too vague to catch actual regressions**: 5/5 — The agent correctly identifies the null check as 'redundant' since `App.FindElement` already throws if the element doesn't exist. It explains that the element could exist but display wrong content, meaning a regression in displayed data would go undetected. This is well-reasoned.
- **The agent provides concrete examples of what specific assertions SHOULD look like to catch the regression**: 4.7/5 — The agent provides two concrete code examples: (1) checking that `StatusLabel` text equals 'No selection' after clearing, and (2) checking that the label text contains 'Loaded'. Both are realistic, actionable replacements written in the same NUnit/Appium style as the original tests.
- **The overall verdict reflects that the assertions are insufficient, not just a minor concern**: 5/5 — The summary table explicitly marks both tests as '❌ No' for 'Would catch the actual regression?' and the concluding paragraph states clearly that 'A regression where selection silently fails to clear, or the label shows wrong data, would pass green.' The verdict is unambiguous that the assertions are inadequate.
## With-Skill Judge (Isolated)
Overall Score: 5/5
Reasoning: The agent delivered a thorough, well-structured, and accurate evaluation of the two test snippets. It correctly identified every key weakness (meaningless assertion, overly permissive null check), provided concrete remediation code, noted ancillary issues (missing Category attribute, potential for unit tests), and rendered a clear overall verdict. The response was produced efficiently in a single turn with minimal tool usage. The analysis covers assertion quality, edge cases, test type appropriateness, conventions, and flakiness — going beyond the minimum requirements of the rubric.
- **The agent correctly identifies Assert.That(true) as a meaningless assertion that proves nothing about the fix**: 5/5 — The agent clearly and accurately identifies `Assert.That(true)` as a 'no-op assertion' that 'always passes,' explicitly stating it only proves the app didn't crash and not that the selection was actually cleared. This is precise and correct.
- **The agent identifies that checking Is.Not.Null on a UI element is too vague to catch actual regressions**: 5/5 — The agent correctly identifies that `Is.Not.Null` is redundant because `App.FindElement` typically throws if the element isn't found, and further notes it doesn't verify the CollectionView loaded data, just that a label exists. Both points are valid and well-articulated.
- **The agent provides concrete examples of what specific assertions SHOULD look like to catch the regression**: 5/5 — The agent provides two concrete code examples: one asserting on label text after clearing selection (`Assert.That(label.GetText(), Is.EqualTo("No selection"))`), and another asserting on meaningful content (`Assert.That(elem.GetText(), Does.Contain("Loaded"))`). Both are realistic, actionable suggestions using proper NUnit syntax.
- **The overall verdict reflects that the assertions are insufficient, not just a minor concern**: 5/5 — The agent opens with a clear 'No' verdict, calls them 'weak assertions that would pass even if the underlying feature is broken,' provides a summary table of gaps and their impacts, and concludes they are 'smoke tests at best.' The tone and substance clearly convey these are significant deficiencies, not minor concerns.
## With-Skill Judge (Plugin)
Overall Score: 5/5
Reasoning: The agent delivered an excellent, well-structured review with zero tool calls needed (this was a pure reasoning/code-review task). It correctly identified both assertion weaknesses, explained why they fail to catch regressions, provided concrete fix examples, and presented a clear summary. The response was efficient (single turn, 19s), well-organized with headers and a summary table, and followed a sound general principle. The only minor quibble is that the suggested replacement text values are necessarily hypothetical, but the agent appropriately frames them as illustrative examples.
- **The agent correctly identifies Assert.That(true) as a meaningless assertion that proves nothing about the fix**: 5/5 — The agent explicitly calls out `Assert.That(true)` as 'a no-op — it always passes' and clearly states it 'verifies nothing about selection state.' It also correctly notes that while a crash would be caught by the preceding `App.Tap`/`App.WaitForElement` calls throwing, the assertion itself is meaningless. This is a thorough and accurate identification.
- **The agent identifies that checking Is.Not.Null on a UI element is too vague to catch actual regressions**: 5/5 — The agent clearly explains that `App.FindElement` typically throws if the element isn't found (making the null check redundant), and even if it returns null, it only proves the label exists — not that the CollectionView loaded data correctly. This is a precise and well-reasoned critique.
- **The agent provides concrete examples of what specific assertions SHOULD look like to catch the regression**: 4.7/5 — For both tests, the agent provides concrete code examples. For Test 1, it suggests asserting on `StatusLabel` text equaling 'None'. For Test 2, it suggests asserting on the label's text content with a specific expected value like 'Loaded 10 items'. Both are practical, realistic suggestions that demonstrate how to assert on observable output rather than mere existence.
- **The overall verdict reflects that the assertions are insufficient, not just a minor concern**: 5/5 — The agent opens with a clear 'No, both assertions are too weak to catch regressions' and closes with a strong summary: 'Both tests are crash-only smoke tests. Neither validates the actual behavior under test... A regression that silently produces wrong state would pass both tests.' This is an unambiguous verdict that the assertions are fundamentally insufficient.
## Baseline Agent Output
No, both assertions are essentially no-ops. Here's why:
Test 1 – SelectionClearsOnNull
Assert.That(true)is a tautology — it can never fail.WaitForElementalready throws on crash/timeout, so the assertion adds nothing.- Missing: Verify the selection was actually cleared, e.g., check a label or item count:
var status = App.FindElement("StatusLabel").GetText(); Assert.That(status, Is.EqualTo("No selection"));
Test 2 – CollectionViewLoads
App.FindElementthrows if the element isn't found, soIs.Not.Nullnever triggers — the test would fail with an exception before the assert runs.- Missing: Assert on the element's content, not just existence:
var text = App.FindElement("StatusLabel").GetText(); Assert.That(text, Is.Not.Empty);
Summary: Neither test validates observable behavior. They pass even if the feature is completely broken (as long as the app doesn't crash). Good UI tests should assert on state changes — selected item text, label content, visual properties — not just element existence.
## With-Skill Agent Output (Isolated)
PR Test Evaluation Report
Test files evaluated: 2 tests (provided inline)
Overall Verdict
❌ Tests are insufficient
Both tests have assertions too weak to catch regressions. A passing test proves almost nothing about the behavior under test.
1. Fix Coverage — ❌
SelectionClearsOnNull: The name implies it verifies selection is cleared when set to null, but Assert.That(true) never checks selection state. If the fix regressed and selection wasn't cleared (but didn't crash), this test still passes.
CollectionViewLoads: Only checks an element exists, not that loading succeeded with correct state.
2. Edge Cases & Gaps — ❌
Missing:
- Verify selection state before clearing (precondition)
- Verify selection state after clearing (the actual behavior)
- Double-clear (clear when already null)
- What
StatusLabeltext should be after load
3. Test Type Appropriateness — ⚠️
Both are UI tests. SelectionClearsOnNull exercises user interaction, so UI test is defensible — but the selection-clearing logic could likely be a unit test on CollectionView.SelectedItem. CollectionViewLoads could be a device test.
4. Convention Compliance — ⚠️
CollectionViewLoadsis missing a[Category()]attribute.
5. Flakiness Risk — ✅ Low
WaitForElement is used before interactions — good.
6. Assertion Quality — ❌
| Test | Assertion | Problem |
|---|---|---|
SelectionClearsOnNull |
Assert.That(true) |
Always passes — meaningless |
CollectionViewLoads |
Is.Not.Null |
Too permissive — any element satisfies it |
Recommendations
- Replace
Assert.That(true)with a real assertion:App.Tap("ClearSelectionButton"); var status = App.FindElement("SelectionStatus").GetText(); Assert.That(status, Is.EqualTo("None"));
- Assert on label content, not just existence:
Assert.That(elem.GetText(), Is.EqualTo("Loaded"));
- Add
[Category]toCollectionViewLoads. - Consider whether a unit test on
SelectedItem = nullwould be a lighter way to cover the selection-clearing logic.
## With-Skill Agent Output (Plugin)
No, both assertions are weak and would not catch regressions:
Test 1: SelectionClearsOnNull
Assert.That(true) is a no-op — it always passes. If the app crashes, WaitForElement already throws before the assert is reached, making the assertion redundant. More critically, it never verifies that selection was actually cleared. A regression where the selection remains would pass silently.
Fix: Assert on observable state after clearing:
App.Tap("ClearSelectionButton");
var label = App.FindElement("StatusLabel");
Assert.That(label.GetText(), Is.EqualTo("Selection: None"));Test 2: CollectionViewLoads
Is.Not.Null only checks the element exists, not that it has correct content. If the label renders with wrong text (e.g., an error message), this still passes.
Fix: Assert on the expected text:
var elem = App.FindElement("StatusLabel");
Assert.That(elem.GetText(), Is.EqualTo("Loaded: 10 items"));Summary
| Test | Problem | Risk |
|---|---|---|
SelectionClearsOnNull |
Always-true assert | Silent regression — selection may not clear |
CollectionViewLoads |
Only checks existence | Wrong content goes undetected |
General rule: Assert on the observable side-effect of the action, not just that the app didn't crash or that an element exists.
</details>
[📋 View full workflow run](https://github.com/dotnet/maui/actions/runs/24453083479)
- Remove bootstrap pull_request trigger (no longer needed) - Fix comment: symlinks → copies in test directory preparation - Use github.paginate() for comment upsert to handle 100+ comments - Add learn-from-pr.agent.md to README-AI.md agent lists - Fix metrics: correct agent/skill counts (3 agents, 15 skills)
🔍 Multi-Model Code Review — PR #34976 (Final Round)Reviewed by: 3 independent AI reviewers with adversarial consensus
Previous Finding Status (All Rounds)
Score: 14/15 actionable findings resolved. All 4 deferred items have documented justifications. New Finding This Round🟢 MINORN6. Empty verdicts produce false-pass in comment job (2/3 reviewers)File: The ALL_PASSED=$(jq 'if .verdicts | length == 0 then false else ...' "$RESULTS_JSON")But the let evalPassed = true; // initialized true
for (const v of data.verdicts) {
if (!v.passed) evalPassed = false; // only flipped on failure
}If Fix: Add an empty-verdicts guard: if (data.verdicts && data.verdicts.length > 0) {
allVerdicts.push(...data.verdicts);
for (const v of data.verdicts) {
if (!v.passed) evalPassed = false;
}
} else {
evalPassed = false; // no verdicts = not passed
}Discarded Findings (no consensus)
💡 Suggestion (from prior round, still applicable)Add Recommendation✅ Approve — All critical and moderate security findings are resolved. The one new minor finding (N6: empty-verdicts false-pass) is an edge case unlikely to occur in practice. The 4 deferred items have sound justifications. The workflow has been tested end-to-end (successful run with 4 skill evaluations). Ship it.
|
- When any evaluated skill fails, add a copy-paste prompt for AI-assisted investigation (matching dotnet/skills pattern) that tells users how to download artifacts and follow InvestigatingResults.md - Update pipeline link at bottom of PR comment to match dotnet/skills style - Write step summary with full results and investigation prompt
- Fix #1 (CRITICAL): Shell injection in evaluate job - use env var indirection for skills_path and rename GITHUB_TOKEN to COPILOT_TOKEN - Fix #4: PowerShell here-string injection - use env:SKILL_DIRS - Fix #5: Code-fence injection - escape triple backticks in output - Fix #6: Delete old agent .md files (keep .agent.md versions) - Fix #7: Shell injection in static-check heredoc - use env var - Fix #8: Direct expression interpolation - use env: block for event props - Fix #9: Heredoc EOF spoofing - use random delimiter - Fix #10: Missing pipe-escaping for skillName/scenarioName - Fix #11: Renamed Copilot token env var to COPILOT_TOKEN - Fix #13: Token rotation entropy - mix in github.run_id - Fix #14: Fix misleading inline comment for icon logic
|
Thanks for the thorough multi-model review! Here's my response to each point. Fixes are in commit 48feb96. 🔴 CRITICAL
Fixed. Moved 🟡 MODERATE
Acknowledged, deferring. This is intentionally a nightly tag — the same pattern used in dotnet/runtime, dotnet/msbuild, and github/awesome-copilot. Pinning a checksum would require bumping it manually on every nightly release, which defeats the purpose. The binary comes from dotnet/skills, a Microsoft-maintained repo. Happy to revisit if/when skill-validator publishes versioned releases with checksums.
Acknowledged — known GitHub Actions limitation. There's no content filter available for
Fixed. Replaced with
Fixed. Added
Fixed. Deleted 🟢 MINOR
Fixed. Moved to
Fixed. Moved
Fixed. Both heredoc outputs now use
Fixed. Added
Fixed (combined with #1). The env var is now
Acknowledged. Valid concern. For now keeping the simple model — a push during evaluation means the PR content changed, so the in-progress eval is stale anyway. If we find this wasteful in practice, we can split into separate concurrency groups for static vs. LLM evaluation.
Fixed. Token index now mixes
Fixed. Updated the comment to say
Acknowledged. Fair point. The JS logic is complex enough to warrant tests, but it's tightly coupled to the GitHub Actions runtime. Will consider extracting into a testable module in a follow-up. ℹ️ Discarded FindingsAgree with discarding all three — cosmetic trailing newline, |
1 similar comment
|
Thanks for the thorough multi-model review! Here's my response to each point. Fixes are in commit 48feb96. 🔴 CRITICAL
Fixed. Moved 🟡 MODERATE
Acknowledged, deferring. This is intentionally a nightly tag — the same pattern used in dotnet/runtime, dotnet/msbuild, and github/awesome-copilot. Pinning a checksum would require bumping it manually on every nightly release, which defeats the purpose. The binary comes from dotnet/skills, a Microsoft-maintained repo. Happy to revisit if/when skill-validator publishes versioned releases with checksums.
Acknowledged — known GitHub Actions limitation. There's no content filter available for
Fixed. Replaced with
Fixed. Added
Fixed. Deleted 🟢 MINOR
Fixed. Moved to
Fixed. Moved
Fixed. Both heredoc outputs now use
Fixed. Added
Fixed (combined with #1). The env var is now
Acknowledged. Valid concern. For now keeping the simple model — a push during evaluation means the PR content changed, so the in-progress eval is stale anyway. If we find this wasteful in practice, we can split into separate concurrency groups for static vs. LLM evaluation.
Fixed. Token index now mixes
Fixed. Updated the comment to say
Acknowledged. Fair point. The JS logic is complex enough to warrant tests, but it's tightly coupled to the GitHub Actions runtime. Will consider extracting into a testable module in a follow-up. ℹ️ Discarded FindingsAgree with discarding all three — cosmetic trailing newline, |
…scape skillName in summary, use env vars for eval_all and PR_NUMBER
|
Addressed the remaining open review items in commit 14f3681. Summary:
cc @PureWeen |
🔍 Multimodal Code Review — PR #34976Reviewed by 3 independent AI models: Claude Opus 4.6, Claude Sonnet 4.5, GPT-5.4 Each model reviewed the full diff independently, then findings were cross-referenced and deduplicated. Findings marked with model agreement indicators. 🔴 Issues (2 findings)1.
|
| Severity | Count | Key Items |
|---|---|---|
| 🔴 Issues | 2 | Runner waste on every comment; untrusted markdown in bot comment |
| 🟡 Concerns | 9 | JS injection surface, concurrency race, overstated counts, missing secret handling, duplicated code, stale cache |
| 🟢 Suggestions | 4 | Path filter gap, workflow_dispatch input, hardcoded models, cost tiers |
| ✅ Strengths | 7 | Security model, token rotation, comment UX, edge case handling |
Overall: Well-architected workflow with solid security fundamentals. The two 🔴 items (runner waste + untrusted markdown rendering) are worth addressing before merge; the rest are hardening suggestions for follow-up.
This review was generated by running 3 independent AI models (Claude Opus 4.6, Claude Sonnet 4.5, GPT-5.4) against the full PR diff, then cross-referencing and deduplicating findings.
… empty-verdicts false-pass, add plugin.json to path filters
|
Responding to #34976 (comment) Fixes in commit da4acdb.
cc @PureWeen |
🟢 .NET MAUI Code Review — Approved
🟢 Review Session — Approved —
|
Note
Are you waiting for the changes in this PR to be merged?
It would be very helpful if you could test the resulting artifacts from this PR and let us know in a comment if this change resolves your issue. Thank you!
Description of Change
Adopting skills-validator
This is inspired and sourced from similar integrations:
One important difference is that in this case we run as well in the LLM judging mode
cc: @PureWeen - for awareness and for help with iterating on this :)