Skip to content

Add adversarial safety fixtures for kestrel (context-aware desktop assistant macos) #91

@haasonsaas

Description

@haasonsaas

Summary

Exercise prompt/tool/data poisoning and fail-closed behavior for the repo's most sensitive agent-facing path.

This issue was generated from an org-wide EvalOps mining pass on 2026-05-10 07:57 UTC. It combines live GitHub repo signals with a per-repo arXiv search. Treat the research links as grounding for a concrete implementation, not as a request for a literature review.

Repo Evidence

  • Repository description: Context-aware AI desktop assistant for macOS
  • Tree signals: 1 docs files, 1 workflows, 0 proto files, 6 test-like files.
  • package.json:132 includes latent-spec language: "extendInfo": { "NSMicrophoneUsageDescription": "Kestrel needs microphone access to record meeting audio.", "NSAppleEventsUsageDescription": "Kestrel needs to read browser tab information for context.",
  • package.json:133 includes latent-spec language: "NSMicrophoneUsageDescription": "Kestrel needs microphone access to record meeting audio.", "NSAppleEventsUsageDescription": "Kestrel needs to read browser tab information for context.", "NSAccessibilityUsageDescription": "Kestrel uses accessibility features to understand what you're working on.",
  • .claude/skills/ax-agent-optimize/SKILL.md:3 includes latent-spec language: name: ax-agent-optimize description: This skill helps an LLM generate correct AxAgent tuning and evaluation code using @ax-llm/ax. Use when the user asks about agent.optimize(...), judgeOptions, eval datasets, optimization targets, saved optimizedProgram artifacts, or recursive optimization guidance. version: "19.0.33"
  • .claude/skills/ax-agent-optimize/SKILL.md:9 includes latent-spec language: Use this skill for agent.optimize(...) workflows. Prefer short, modern, copyable patterns. Do not repeat general agent-authoring guidance unless the user needs it.
  • .claude/skills/ax-agent-optimize/SKILL.md:16 includes latent-spec language: - If the user wants reusable improvements, include artifact save/load. - If the user wants cost or recursion behavior improved, make the eval tasks expose those tradeoffs explicitly.
  • .claude/skills/ax-agent-optimize/SKILL.md:23 includes latent-spec language: - Prefer the built-in judge path for open-ended assistant tasks: judgeAI plus judgeOptions. - Only reach for a plain typed AxGen evaluator when the user needs LLM-as-judge behavior outside the built-in agent.optimize(...) flow. - Default optimize target is root.actor; use target: 'responder' or explicit pro

Research Grounding

Repo axes: desktop, security, evaluation, data

Search keywords: npm, run, build, sdk, kestrel, app, context, contextkit, native, electron, meeting, swift

  • arXiv:2504.18575v3 WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks (Ivan Evtimov, Arman Zharmagambetov, Aaron Grattafiori, Chuan Guo, Kamalika Chaudhuri), 2025.
  • arXiv:2506.14866v2 OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents (Thomas Kuntz, Agatha Duzan, Hao Zhao, Francesco Croce, Zico Kolter, Nicolas Flammarion), 2025.
  • arXiv:2510.04257v1 AgentTypo: Adaptive Typographic Prompt Injection Attacks against Black-box Multimodal Agents (Yanjie Li, Yiming Cao, Dong Wang, Bin Xiao), 2025.
  • arXiv:2511.20597v1 BrowseSafe: Understanding and Preventing Prompt Injection Within AI Browser Agents (Kaiyuan Zhang, Mark Tenenholtz, Kyle Polley, Jerry Ma, Denis Yarats, Ninghui Li), 2025.
  • arXiv:2507.05445v1 A Systematization of Security Vulnerabilities in Computer Use Agents (Daniel Jones, Giorgio Severi, Martin Pouliot, Gary Lopez, Joris de Gruyter, Santiago Zanella-Beguelin), 2025.
  • arXiv:2602.09222v1 MUZZLE: Adaptive Agentic Red-Teaming of Web Agents Against Indirect Prompt Injection Attacks (Georgios Syros, Evan Rose, Brian Grinstead, Christoph Kerschbaumer, William Robertson, Cristina Nita-Rotaru), 2026.
  • arXiv:2506.02456v2 VPI-Bench: Visual Prompt Injection Attacks for Computer-Use Agents (Tri Cao, Bennett Lim, Yue Liu, Yuan Sui, Yuexin Li, Shumin Deng), 2025.
  • arXiv:2604.25562v1 SnapGuard: Lightweight Prompt Injection Detection for Screenshot-Based Web Agents (Mengyao Du, Han Fang, Haokai Ma, Jiahao Chen, Kai Xu, Quanjun Yin), 2026.
  • arXiv:2510.03705v1 Backdoor-Powered Prompt Injection Attacks Nullify Defense Methods (Yulin Chen, Haoran Li, Yuan Sui, Yangqiu Song, Bryan Hooi), 2025.
  • arXiv:2604.12284v1 WebAgentGuard: A Reasoning-Driven Guard Model for Detecting Prompt Injection Attacks in Web Agents (Yulin Chen, Tri Cao, Haoran Li, Yue Liu, Yibo Li, Yufei He), 2026.

What To Build

  • Add adversarial fixtures for untrusted UI, filesystem, and desktop-control inputs.
  • Document the intended fail-closed behavior and any allowed degraded-mode fallback.
  • Add regression coverage that proves unsafe inputs do not silently reach the privileged path.

Acceptance Criteria

  • A short design note names the repo-specific workflow, threat or correctness model, and the research assumptions being adopted.
  • A runnable check, fixture, or verifier exercises the new contract in CI or an equivalent local command documented in the repo.
  • The implementation emits or stores enough evidence for a downstream agent/operator to cite inputs, decisions, and outputs.
  • At least one negative/degraded-mode case is covered so failures are observable rather than silently accepted.
  • Documentation links the new behavior to the relevant EvalOps platform primitive or explicitly records why this repo remains standalone.

Notes

  • Generated issue 3/5 for evalops/kestrel by evalops_org_miner.py.
  • Before implementation, confirm the sampled latent-spec snippets still match main; this issue intentionally cites exact file paths/lines where the mining pass saw them.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions