Summary
Convert the repo's latent product contract into a repeatable benchmark suite with explicit pass/fail evidence.
This issue was generated from an org-wide EvalOps mining pass on 2026-05-10 07:57 UTC. It combines live GitHub repo signals with a per-repo arXiv search. Treat the research links as grounding for a concrete implementation, not as a request for a literature review.
Repo Evidence
- Repository description: Context-aware AI desktop assistant for macOS
- Tree signals: 1 docs files, 1 workflows, 0 proto files, 6 test-like files.
package.json:132 includes latent-spec language: "extendInfo": { "NSMicrophoneUsageDescription": "Kestrel needs microphone access to record meeting audio.", "NSAppleEventsUsageDescription": "Kestrel needs to read browser tab information for context.",
package.json:133 includes latent-spec language: "NSMicrophoneUsageDescription": "Kestrel needs microphone access to record meeting audio.", "NSAppleEventsUsageDescription": "Kestrel needs to read browser tab information for context.", "NSAccessibilityUsageDescription": "Kestrel uses accessibility features to understand what you're working on.",
.claude/skills/ax-agent-optimize/SKILL.md:3 includes latent-spec language: name: ax-agent-optimize description: This skill helps an LLM generate correct AxAgent tuning and evaluation code using @ax-llm/ax. Use when the user asks about agent.optimize(...), judgeOptions, eval datasets, optimization targets, saved optimizedProgram artifacts, or recursive optimization guidance. version: "19.0.33"
.claude/skills/ax-agent-optimize/SKILL.md:9 includes latent-spec language: Use this skill for agent.optimize(...) workflows. Prefer short, modern, copyable patterns. Do not repeat general agent-authoring guidance unless the user needs it.
.claude/skills/ax-agent-optimize/SKILL.md:16 includes latent-spec language: - If the user wants reusable improvements, include artifact save/load. - If the user wants cost or recursion behavior improved, make the eval tasks expose those tradeoffs explicitly.
.claude/skills/ax-agent-optimize/SKILL.md:23 includes latent-spec language: - Prefer the built-in judge path for open-ended assistant tasks: judgeAI plus judgeOptions. - Only reach for a plain typed AxGen evaluator when the user needs LLM-as-judge behavior outside the built-in agent.optimize(...) flow. - Default optimize target is root.actor; use target: 'responder' or explicit pro
Research Grounding
Repo axes: desktop, security, evaluation, data
Search keywords: npm, run, build, sdk, kestrel, app, context, contextkit, native, electron, meeting, swift
- arXiv:2504.18575v3 WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks (Ivan Evtimov, Arman Zharmagambetov, Aaron Grattafiori, Chuan Guo, Kamalika Chaudhuri), 2025.
- arXiv:2506.14866v2 OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents (Thomas Kuntz, Agatha Duzan, Hao Zhao, Francesco Croce, Zico Kolter, Nicolas Flammarion), 2025.
- arXiv:2510.04257v1 AgentTypo: Adaptive Typographic Prompt Injection Attacks against Black-box Multimodal Agents (Yanjie Li, Yiming Cao, Dong Wang, Bin Xiao), 2025.
- arXiv:2511.20597v1 BrowseSafe: Understanding and Preventing Prompt Injection Within AI Browser Agents (Kaiyuan Zhang, Mark Tenenholtz, Kyle Polley, Jerry Ma, Denis Yarats, Ninghui Li), 2025.
- arXiv:2507.05445v1 A Systematization of Security Vulnerabilities in Computer Use Agents (Daniel Jones, Giorgio Severi, Martin Pouliot, Gary Lopez, Joris de Gruyter, Santiago Zanella-Beguelin), 2025.
- arXiv:2602.09222v1 MUZZLE: Adaptive Agentic Red-Teaming of Web Agents Against Indirect Prompt Injection Attacks (Georgios Syros, Evan Rose, Brian Grinstead, Christoph Kerschbaumer, William Robertson, Cristina Nita-Rotaru), 2026.
- arXiv:2506.02456v2 VPI-Bench: Visual Prompt Injection Attacks for Computer-Use Agents (Tri Cao, Bennett Lim, Yue Liu, Yuan Sui, Yuexin Li, Shumin Deng), 2025.
- arXiv:2604.25562v1 SnapGuard: Lightweight Prompt Injection Detection for Screenshot-Based Web Agents (Mengyao Du, Han Fang, Haokai Ma, Jiahao Chen, Kai Xu, Quanjun Yin), 2026.
- arXiv:2510.03705v1 Backdoor-Powered Prompt Injection Attacks Nullify Defense Methods (Yulin Chen, Haoran Li, Yuan Sui, Yangqiu Song, Bryan Hooi), 2025.
- arXiv:2604.12284v1 WebAgentGuard: A Reasoning-Driven Guard Model for Detecting Prompt Injection Attacks in Web Agents (Yulin Chen, Tri Cao, Haoran Li, Yue Liu, Yibo Li, Yufei He), 2026.
What To Build
- Define the smallest representative
kestrel golden workflow and capture expected inputs, outputs, and evidence artifacts.
- Add fixtures for a successful path, an ambiguous/degraded path, and a failure path.
- Publish a command that local agents and CI can run before shipping related changes.
Acceptance Criteria
Notes
- Generated issue 1/5 for
evalops/kestrel by evalops_org_miner.py.
- Before implementation, confirm the sampled latent-spec snippets still match
main; this issue intentionally cites exact file paths/lines where the mining pass saw them.
Summary
Convert the repo's latent product contract into a repeatable benchmark suite with explicit pass/fail evidence.
This issue was generated from an org-wide EvalOps mining pass on 2026-05-10 07:57 UTC. It combines live GitHub repo signals with a per-repo arXiv search. Treat the research links as grounding for a concrete implementation, not as a request for a literature review.
Repo Evidence
package.json:132includes latent-spec language: "extendInfo": { "NSMicrophoneUsageDescription": "Kestrel needs microphone access to record meeting audio.", "NSAppleEventsUsageDescription": "Kestrel needs to read browser tab information for context.",package.json:133includes latent-spec language: "NSMicrophoneUsageDescription": "Kestrel needs microphone access to record meeting audio.", "NSAppleEventsUsageDescription": "Kestrel needs to read browser tab information for context.", "NSAccessibilityUsageDescription": "Kestrel uses accessibility features to understand what you're working on.",.claude/skills/ax-agent-optimize/SKILL.md:3includes latent-spec language: name: ax-agent-optimize description: This skill helps an LLM generate correct AxAgent tuning and evaluation code using @ax-llm/ax. Use when the user asks about agent.optimize(...), judgeOptions, eval datasets, optimization targets, saved optimizedProgram artifacts, or recursive optimization guidance. version: "19.0.33".claude/skills/ax-agent-optimize/SKILL.md:9includes latent-spec language: Use this skill foragent.optimize(...)workflows. Prefer short, modern, copyable patterns. Do not repeat general agent-authoring guidance unless the user needs it..claude/skills/ax-agent-optimize/SKILL.md:16includes latent-spec language: - If the user wants reusable improvements, include artifact save/load. - If the user wants cost or recursion behavior improved, make the eval tasks expose those tradeoffs explicitly..claude/skills/ax-agent-optimize/SKILL.md:23includes latent-spec language: - Prefer the built-in judge path for open-ended assistant tasks:judgeAIplusjudgeOptions. - Only reach for a plain typedAxGenevaluator when the user needs LLM-as-judge behavior outside the built-inagent.optimize(...)flow. - Default optimize target isroot.actor; usetarget: 'responder'or explicit proResearch Grounding
Repo axes: desktop, security, evaluation, data
Search keywords: npm, run, build, sdk, kestrel, app, context, contextkit, native, electron, meeting, swift
What To Build
kestrelgolden workflow and capture expected inputs, outputs, and evidence artifacts.Acceptance Criteria
Notes
evalops/kestrelbyevalops_org_miner.py.main; this issue intentionally cites exact file paths/lines where the mining pass saw them.