Summary
Exercise prompt/tool/data poisoning and fail-closed behavior for the repo's most sensitive agent-facing path.
This issue was generated from an org-wide EvalOps mining pass on 2026-05-10 07:57 UTC. It combines live GitHub repo signals with a per-repo arXiv search. Treat the research links as grounding for a concrete implementation, not as a request for a literature review.
Repo Evidence
- Repository description: Maestro — multi-model coding agent with TUI, web, IDE, Slack, and GitHub interfaces
- Tree signals: 79 docs files, 19 workflows, 1 proto files, 737 test-like files.
AGENTS.md:31 includes latent-spec language: Critical: Consult .github/workflows/ (evals.yml, nx-ci.yml, release.yml) to mirror CI environments.
AGENTS.md:36 includes latent-spec language: * Full Test Suite: npx nx run maestro:test --skip-nx-cache (Builds tui + maestro-web automatically). Run after every code change. * Linting: bun run bun:lint (Biome + Eval Verifier). Run after every code change. * Runtime Commands: Avoid long-lived dev/watch servers (e.g., npm run dev) unless th
AGENTS.md:54 includes latent-spec language: 2. Never force-push to main. This rewrites shared history and breaks collaborators. 3. Atomic commits only. Each commit should be one logical change. Don't mix unrelated changes. 4. Never use --force or --force-with-lease on shared branches.
AGENTS.md:187 includes latent-spec language: 4. Stop tasks when done - they'll auto-cleanup on Maestro exit, but explicit stops are cleaner 5. Use restart policies for resilient services - ideal for dev servers that should recover from crashes 6. Direct execution is safer - omit shell parameter for simple commands without pipes
AGENTS.md:374 includes latent-spec language: 4. Wire handler in src/cli-tui/tui-renderer.ts: ```typescript
CLAUDE.md:31 includes latent-spec language: Critical: Consult .github/workflows/ (evals.yml, nx-ci.yml, release.yml) to mirror CI environments.
Research Grounding
Repo axes: tooling, evaluation, security, desktop
Search keywords: run, maestro, use, dev, bun, never, servers, commands, test, build, command, npx
- arXiv:2508.07575v1 MCPToolBench++: A Large Scale AI Agent Model Context Protocol MCP Tool Use Benchmark (Shiqing Fan, Xichen Ding, Liang Zhang, Linjian Mo), 2025.
- arXiv:2603.24943v1 FinMCP-Bench: Benchmarking LLM Agents for Real-World Financial Tool Use under the Model Context Protocol (Jie Zhu, Yimin Tian, Boyang Li, Kehao Wu, Zhongzhi Liang, Junhui Li), 2026.
- arXiv:2508.12566v1 Help or Hurdle? Rethinking Model Context Protocol-Augmented Large Language Models (Wei Song, Haonan Zhong, Ziqi Ding, Jingling Xue, Yuekang Li), 2025.
- arXiv:2602.01129v1 SMCP: Secure Model Context Protocol (Xinyi Hou, Shenao Wang, Yifan Zhang, Ziluo Xue, Yanjie Zhao, Cai Fu), 2026.
- arXiv:2603.00123v1 CT-Flow: Orchestrating CT Interpretation Workflow with Model Context Protocol Servers (Yannian Gu, Xizhuo Zhang, Linjie Mu, Yongrui Yu, Zhongzhen Huang, Shaoting Zhang), 2026.
- arXiv:2507.19570v1 MCP4EDA: LLM-Powered Model Context Protocol RTL-to-GDSII Automation with Backend Aware Synthesis Optimization (Yiting Wang, Wanghao Ye, Yexiao He, Yiran Chen, Gang Qu, Ang Li), 2025.
- arXiv:2604.13849v1 MCPThreatHive: Automated Threat Intelligence for Model Context Protocol Ecosystems (Yi Ting Shen, Kentaroh Toyoda, Alex Leung), 2026.
- arXiv:2506.14683v2 Unified Software Engineering Agent as AI Software Engineer (Leonhard Applis, Yuntong Zhang, Shanchao Liang, Nan Jiang, Lin Tan, Abhik Roychoudhury), 2025.
- arXiv:2503.23803v2 Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute (Yingwei Ma, Yongbin Li, Yihong Dong, Xue Jiang, Rongyu Cao, Jue Chen), 2025.
- arXiv:2506.19998v1 Doc2Agent: Scalable Generation of Tool-Using Agents from API Documentation (Xinyi Ni, Haonan Jian, Qiuyang Wang, Vedanshi Chetan Shah, Pengyu Hong), 2025.
What To Build
- Add adversarial fixtures for untrusted UI, filesystem, and desktop-control inputs.
- Document the intended fail-closed behavior and any allowed degraded-mode fallback.
- Add regression coverage that proves unsafe inputs do not silently reach the privileged path.
Acceptance Criteria
Notes
- Generated issue 3/5 for
evalops/maestro by evalops_org_miner.py.
- Before implementation, confirm the sampled latent-spec snippets still match
main; this issue intentionally cites exact file paths/lines where the mining pass saw them.
Summary
Exercise prompt/tool/data poisoning and fail-closed behavior for the repo's most sensitive agent-facing path.
This issue was generated from an org-wide EvalOps mining pass on 2026-05-10 07:57 UTC. It combines live GitHub repo signals with a per-repo arXiv search. Treat the research links as grounding for a concrete implementation, not as a request for a literature review.
Repo Evidence
AGENTS.md:31includes latent-spec language: Critical: Consult.github/workflows/(evals.yml,nx-ci.yml,release.yml) to mirror CI environments.AGENTS.md:36includes latent-spec language: * Full Test Suite:npx nx run maestro:test --skip-nx-cache(Buildstui+maestro-webautomatically). Run after every code change. * Linting:bun run bun:lint(Biome + Eval Verifier). Run after every code change. * Runtime Commands: Avoid long-liveddev/watch servers (e.g.,npm run dev) unless thAGENTS.md:54includes latent-spec language: 2. Never force-push to main. This rewrites shared history and breaks collaborators. 3. Atomic commits only. Each commit should be one logical change. Don't mix unrelated changes. 4. Never use--forceor--force-with-leaseon shared branches.AGENTS.md:187includes latent-spec language: 4. Stop tasks when done - they'll auto-cleanup on Maestro exit, but explicit stops are cleaner 5. Use restart policies for resilient services - ideal for dev servers that should recover from crashes 6. Direct execution is safer - omitshellparameter for simple commands without pipesAGENTS.md:374includes latent-spec language: 4. Wire handler insrc/cli-tui/tui-renderer.ts: ```typescriptCLAUDE.md:31includes latent-spec language: Critical: Consult.github/workflows/(evals.yml,nx-ci.yml,release.yml) to mirror CI environments.Research Grounding
Repo axes: tooling, evaluation, security, desktop
Search keywords: run, maestro, use, dev, bun, never, servers, commands, test, build, command, npx
What To Build
Acceptance Criteria
Notes
evalops/maestrobyevalops_org_miner.py.main; this issue intentionally cites exact file paths/lines where the mining pass saw them.